Enrico Caruso (1873–1921) was a famous Italian tenor, who sang to great acclaim in major opera houses throughout the world. A good deal of his popularity was due to numerous commercial audio recordings he made. Despite primitive equipment of the time, Caruso’s voice recorded extremely well, and therein lay his advantage over many of his competitors. Plus, of course, he was a rare, consummate artist, and his was a truly exceptional voice.
In this document, I’ll use an excerpt from Caruso’s recording of E lucevan le stelle (a tenor aria from Giacomo Puccini’s opera Tosca1 Giacomo Puccini (1858–1924) was arguably the last truly major Italian operatic composer. His works enjoy enduring popularity in operatic circles. For instance, according to statistics published by Operabase, in 2017–2018 there were 136 productions of Tosca worldwide, with a total of 553 performances. This puts Tosca in the lofty 5th position among the most performed operas of the previous season.) to illustrate an amusing fact that residuals in the (non-parametric) regression analysis can not only be displayed, but also heard.
First some preliminaries: I’ll rely on the following R packages in this document.
library(ggplot2)
library(tidyverse)
library(waveslim)
library(tuneR)
library(EbayesThresh)
Now the core issue. A digital signal of the aria (at the sampling rate 8192 Hz) is included in WaveLab, a free library of Matlab routines for wavelet analysis. No further information on the recording is provided, but it can be surmised that it was made sometime during the first decade of the 20th century2 Throughout his career, Caruso recorded E lucevan le stelle several times. The piano accompaniment that we hear in the WaveLab recording would fit the following discs: Pathé from 1901 (matrix number 84004), Zonophone from 1902 (matrix number X 1553), or Gramophone Typewriter from 1903 (matrix number 52349).. You can download the dataset here3 Place the file in the folder data in your working directory.. After loading the .txt file in R, the data can be visualised as in the plot below4 The full dataset contains 50,000 values. I’ll be using only 48,000, as the former number won’t allow me to compute two levels of the wavelet transform, but only 1; see below..
n <- 48000
caruso <- read.table("data/caruso.txt")[1:n,1]
ggplot()+
geom_line(mapping = aes(x = 1:n, y = caruso)) +
xlab("n") +
ylab("signal")
Rather than looking at the plot, admittedly it’s more enlightening to listen to the recording. In R, one can either export the signal as a wave file (using the tuneR package) and play it externally, or call a command line audio player. The code for the latter option is platform-dependent and requires some tuning5 On Mac, you can employ a little utility called Play: download it from here. Next use the tuneR package functionalities: set the audio player in R via setWavPlayer()
by pointing to Play. You can now play the wave files by invoking the command play()
. See, e.g., the directions here., so I’ll use the former, and let you do the manual work6 All the wave files produced through the code below will be saved to the ouput folder. You need to create it beforehand in your working directory!.
caruso_wave <- Wave(caruso, samp.rate = 8192, bit = 16)
writeWave(caruso_wave, filename = "output/caruso.wav")
You can now play the caruso.wav file on your computer7 If you are a Mac user and have followed the directions in note 5, you can play the wave data via play(caruso_wave)
.. For convenience, I’ve also linked it here.
Upon hearing the recording, presence of a substantial amount of noise on top of the signal becomes apparent. How to remove this surface noise? There’re many cutting-edge audio de-noising techniques, but since my primary goals are illustrative, I’ll stick to using the Discrete Wavelet Transform (DWT), see (Percival and Walden 2000Percival, D. B., and A. T. Walden. 2000. Wavelet Methods for Time Series Analysis. Vol. 4. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge. doi:10.1017/CBO9780511841040.). DWT is a special kind of a linear (orthogonal) transformation. Loosely speaking, it has a property of concentrating most of the signal in a few large components (coefficients), while spreading the original noise “uniformly” in the transformed version. A wavelet approach to de-noising consists in:
I’ll use the cutting-edge empirical Bayes approach to wavelet de-noising, see (Johnstone and Silverman 2005Johnstone, I. M., and B. W. Silverman. 2005. “Empirical Bayes Selection of Wavelet Thresholds.” Ann. Statist. 33 (4): 1700–1752. doi:10.1214/009053605000000345.). In R, the method is implemented in the ebayesthresh package. Although I’ve been talking about de-noising all the time, it should be clear that at the end of the day I’m just doing (non-parametric) regression.
The first step is evaluation of DWT. There are several options available in R; I’ll use the waveslim package. In particular, I’ll compute (a partial) DWT with 2 levels of the transform8 Understanding the options I pass to dwt
requires some knowledge of the wavelet transform.. De-noising more than 2 levels appears to be too invasive in the present setting.
caruso_dwt <- dwt(caruso, wf = "la8", n.levels = 2, boundary = "reflection")
The next step is de-noising the wavelet coefficients9 Understanding the options I pass to ebayesthresh.wavelet
requires reading Johnstone and Silverman’s paper..
EBayes <- ebayesthresh.wavelet(caruso_dwt, threshrule = "median",
prior = "laplace", a = NA, vscale = "independent")
Now I’ll invert the transform, and round off the obtained values to be able to play the audio file.
caruso_ebayes <- idwt(EBayes)[1:n] %>%
round()
Finally, I’ll create and save the wave file of the de-noised signal. It’s also available for download here.
caruso_ebayes %>%
Wave(samp.rate = 8192, bit = 16) %>%
writeWave(filename = "output/caruso_ebayes.wav")
Listen to the file and judge for yourself. To my ears, the results aren’t entirely satisfactory: some noise is certainly gone, but Caruso’s voice in the original “noisy” version sounds more natural, spontaneous (?) (play the files again to double-check).
The truth is that DWT is not the most appropriate transform for audio de-noising purposes. Perhaps the same should be said of Johnstone and Silverman’s empirical Bayes de-noising, as it wasn’t designed in view of audio de-noising applications, and hence doesn’t account for features important in that field; cf. (Wolfe, Godsill, and Ng 2004Wolfe, P. J., S. J. Godsill, and W.-J. Ng. 2004. “Bayesian Variable Selection and Regularization for Time-Frequency Surface Estimation.” J. R. Stat. Soc. Ser. B Stat. Methodol. 66 (3): 575–89. doi:10.1111/j.1467-9868.2004.02052.x.) and (Godsill et al. 2007Godsill, S. J., A. T. Cemgil, C. Févotte, and P. J. Wolfe. 2007. “Bayesian Computational Methods for Sparse Audio and Music Processing.” In 15th European Signal Processing Conference (EURASIP), 345–49.). This despite all kinds of nice mathematical theorems established in Johnstone and Silverman’s work, that deal with optimality properties of their method.
One can also judge the de-noising quality by examining the residuals that resulted from the empirical Bayes method. These residuals can be plotted, as in the figure below.
caruso_residuals <- caruso - caruso_ebayes
ggplot()+
geom_point(mapping = aes(x = 1:n, y = caruso_residuals)) +
xlab("n") +
ylab("residuals")
A quick visual impression is that the method has removed quite some noise there, although as, e.g., the plot of the residuals versus the fitted values would indicate, the residuals aren’t quite healthy.
ggplot()+
geom_point(mapping = aes(x = caruso_ebayes, y = caruso_residuals)) +
xlab("fitted values") +
ylab("residuals")
Now let’s hear the residuals. I’ve also placed the wave file here.
caruso_residuals %>%
Wave(samp.rate = 8192, bit=16) %>%
writeWave(filename = "output/caruso_residuals.wav")
It turns out that when performing de-noising, the empirical Bayes has chopped off a decent amount of the signal too: amidst the noise, one can clearly distinguish Caruso’s voice, though at a reduced volume! So in the hindsight that perhaps provides a rational explanation why I found the de-noising outcome somewhat troubling. You might still think otherwise, though. What I like about this example is that performance of an algorithm can be assessed directly, without any math10 I don’t know how to formalise workings of my perceptions mathematically., and subjectively (which, in the end, isn’t such a bad thing?).
One possibility to “fix” things here is to use a multi-step method. That is, one can repeat the whole procedure, but now applied on the residuals from the first step. One may hope that thereby one will pile off another layer of the signal and add it to the “de-noised” reconstruction from the first step. This doesn’t sound like a technique commonly considered in the statistics literature when dealing with wavelet de-noising applications, though it’s been suggested elsewhere; see (Berger, Coifman, and Goldberg 1994Berger, J., R. R. Coifman, and M. J. Goldberg. 1994. “Removing Noise from Music Using Local Trigonometric Bases and Wavelet Packets.” J. Audio Eng. Soc 42 (10): 808–18.). What is perhaps a more important conclusion is that in complex applications11 Real-world problems are mostly complex., an unqualified use of the off-the-shelf algorithms and techniques might not always lead to a desired outcome. My results can be improved, but that requires a bit of skill, patience and time.
And this brings the present illustrative example to completion. I hope you enjoyed reading it!