Common signal representations

Many problems in signal and image processing consist in estimating a signal of interest from noisy and/or incomplete measurements. Such problems can be addressed by exploiting characteristics or regularities of natural signal and images, which can be identified or highlighted using specific signal representations. In this applied course, we will discuss the commonly used signal and image representations.

Basics

Signal

Signal processing starts with a sensor that acquires an analog signal characterizing a physical phenomenon. For instance, a microphone converts an acoustic wave into an electric signal.

Image credit: Meinard Müller, “Fundamentals of Music Processing”, Springer 2015

Digital signal

For diginal processing, the signal is sampled and quantized.

Image credit: Meinard Müller, “Fundamentals of Music Processing”, Springer 2015

Below, we import the library soundfile (as sf), and read the information of the bird-thrush-nightingale-03.wav file located in the audio folder using the sf.info function. If you want to know more about this function, just run sf.info? in a code cell.

We can also use the IPython package to listen to this audio signal.

Question: Using the variable info, compute the number of samples in this audio signal. You will verify your answer by loading the audio signal using the sf.read function of soundfile.

Answer: The number of samples is given by the duration of the signal (in seconds) multiplied by the sample rate (in Hz).

The above information tells us that this file is a WAV file using a 16-bit pulse-code modulation (PCM) representation. In particular, it means that the signal is quantized using 16 bits per sample. The amplitude of the signal is therefore represented using $2^{16} = 65 536$ values.

Representations of signals in the discrete time and frequency domains

Time-domain representation

To develop models and algorithms in signal processing we need to work with mathematical notations or representations of the physical signals we acquire in the real world.

Let's start with the most simple representation in the time domain. We denote by $x(t) \in \mathbb{R}$ a signal whose samples are real valued and which is defined in the discrete-time domain $t \in \mathbb{Z}$.

In signal processing, we can define different classes of signals, which allows us to define different transformations of the signals. For instance, we can assume that the sum of the absolute value of the signal samples is finite, i.e., $\sum\limits_{t \in \mathbb{Z}} |x(t)| < \infty.$ This assumption allows us to define the discete-time Fourier transform (DTFT). Understanding the DTFT first is useful to understand a more practical transform called the discrete Fourier transform (DFT) (spoiler: The DFT is a sampled version of the DTFT). But our goal is not to create signal processing heroes in this course, so we will directly jump to the DFT. Don't be too disapointed though, you can have a look at the appendices to learn about the DTFT and its relation with the DFT.

Real-world signals

So let's consider a really practical situation, where we acquire signals during a limited period of time and using sensors with limited dynamics. This implies that the signals are of finite length, amplitude, and energy. We therefore always have:

We can therefore also represent a signal as a vector $\mathbf{x} \in \mathbb{R}^T$, where the entry of index $t \in \{0,...,T-1 \}$ contains the sample $x(t)$.

Question: Aren't you puzzled by some incoherence between what we've seen before about digital signals and this mathematical representation of supposedly real-world signal? In 1976, a British statistician named George Box wrote “All models are wrong, some are useful.” The above model of real-world signals is wrong, but definitely useful in so many daily-life applications.

Answer: The amplitude is quantized and not real valued.

The waveform

So, a signal is actually a function from the discrete-time domain to the amplitude domain: $x : \mathbb{Z} \mapsto \mathbb{R}$, or if you prefer $x : \{0,...,T-1\} \mapsto \mathbb{R}$. Can we look at the graph of this function? Of course we can, and this is called the waveform.

For audio signals, the waveform represents the deviation over time of the air pressure (with respect of the average air pressure) at the microphone location, it's a pressure-time plot. Let's look at the waveform of our bird recording.

Question: Why is the signal centered around 0? What pressure does this amplitude correspond to?

Answer: The audio signal results from the vibration of a physical system, which causes oscillations of the surrounding air particles. These should eventually go back to their equilibrium position, corresponding to the average air pressure.

Let's look at another, maybe simpler signal. In the next cell we load an electrocardiogram (ECG) and plot the corresponding waveform. Electrocardiography is a method to record heart activity by using electrodes to measure electrical changes on the skin. An ECG thus corresponds to the recording of a heart's electrical activity, and the waveform is a voltage-time plot.

The image below shows a schematic representation of a normal ECG. According to Wikipedia, normal rhythm produces four entities – a P wave, a QRS complex, a T wave, and a U wave – that each have a fairly unique pattern.

Image credit: Wikipedia.

An ECG captures multiple repetitions of this pattern, corresponding to several cardiac cycles. Unfortunately, the real ECG is quite different from this schematic representation, as it contains other periodic components that make it difficult to visualize the above patterns. Decomposing the signal in different components associated with different frequencies is very useful to remove such perturbations. This decomposition is achieved using the discrete Fourier transform (DFT).

Frequency domain representation

Discrete Fourier transform (DFT)

Let $x(t) \in \mathbb{R}$ be a signal of finite support $ \{ 0,...,T-1 \} $. The discrete Fourier transform (DFT) of order $F \in \mathbb{N}$ is defined by:

$$ X(f) = \sum\limits_{t=0}^{T-1} x(t) \exp\left(-j 2 \pi \frac{f t}{F}\right), \qquad f \in \{0,...,F-1\}.$$

If $F > T$, the signal is zero-padded and if $F < T$ it is truncated. In other words, we artificially change the length $T$ of the signal so that in the end it is equal to the DFT order $F$.

The inverse DFT is defined by:

$$ x(t) = \frac{1}{F} \sum\limits_{f=0}^{F-1} X(f) \exp\left(+j 2 \pi \frac{f t}{F}\right), \qquad t \in \mathbb{Z}. $$

Alternatively but equivalently, we could define the DFT by normalizing both the direct and the inverse transforms by $\frac{1}{\sqrt{F}}$, instead of normalizing the inverse DFT by $\frac{1}{F}$.

For real-valued signals, the DFT is complex valued, it gives the spectrum of the signal. So we will generally look at the modulus of the DFT, which is called the magnitude spectrum, or the squared modulus, which is called the power spectrum. The argument of the DFT is called the phase spectrum.

Hermitian symmetry

Let's have a look at the magnitude and power spectrum of our ECG signal.

We observe in the above spectra a form of symmetry around the middle of the frequency axis. This comes from an important property of the DFT of real-valued signals, which is called the Hermitian symmetry and is defined by:

$$X(F-f) = X^*(f), \qquad f \in \{0,...,F-1\},$$

where $\cdot^*$ denotes the complex conjugate.

Let's print the values of the DFT at the zero frequency ($f=0$) and at the Nyquist frequency ($f= F//2$), assuming that $F$ is even.

We observe that the DFT coefficients at the zero and Nyquist frequencies are real-valued, i.e., the imaginary part is exactly zero. This can be easily verified theoretically as follows:

The DFT can consequently be splitted into 4 parts:

  1. The zero-frequency coefficient $X(0)$, which is real valued;
  2. The (F/2-1) coefficients $X(f)$ for $f = 1,...,F/2-1$, which are sometimes called the positive frequencies;
  3. The Nyquist-frequency coefficient $X(F/2)$, which is real valued;
  4. The (F/2-1) coefficients $X(f)$ for $f = F/2+1,...,F-1$, which are sometimes called the negative frequencies.

The Hermitian symmetry property tells us that the negative-frequency coefficients are equal to the conjugate of the positive-frequency coefficients when we reverse the frequency axis. Let's verify this.

Wow, this was harsh signal processing stuff. Don't worry, what you simply need to remember is the following:

Real-valued signals have an Hermitian symmetric spectrum, so we usually only keep the non-redundant part of the spectrum, up to the Nyquist frequency, here we go.

We see that in this signal, most of the energy is concentrated in the low frequency, so let's zoom in a bit more. We plot again the waveform for discussion.

Notice that we have a difference of about 60 dB between the most and least energetic frequency components of this signal. When we divide the amplitude of a signal by 2, we loose $20 \log_{10}(2) = 6$ dB, so a 60 dB decrease is quite significant.

Filtering of an ECG signal

We can make other observations from this power spectrum:

In order to use the ECG for diagnosis, we would like to eliminate these frequency components, since they are not associated to heart activity. This is a filtering task in signal processing. The naive approach would simply consist in zeroing the DFT coefficients at the frequencies we want to discard, and reconstructing the time-domain signal by inverse DFT.

Exercise: Complete the cell below to implement this naive filtering approach.

Solution

Let's plot the results, what do you observe?

This simple brick-wall filter is known to cause ringing artifacts in the reconstructed time domain signal. You should probably ask a real signal processing hero to help you in the filter design, which is out of the scope of this course.

Geometric interpretation and alternatives to the DFT

To finish this part, let's discuss about a geometric interpretation of the DFT. This is probably what you should really remember about the DFT, because it shares strong connections with other aspects of signal representations that will be discussed in following lessons.

Let $\mathbf{x} \in \mathbb{R}^T$ be the vector representation of the signal $x(t)$, $t \in \{ 0,...,T-1 \}$, and let $\mathbf{u}_f \in \mathbb{C}^T$ be the vector of entries $\mathbf{u}_f(t) = \frac{1}{\sqrt{F}} \exp\left(+j 2 \pi \frac{f t}{F}\right)$, $t \in \{ 0,...,T-1 \}$. This vector can be regarded as a complex sinusoid of frequency $f/F$, using Euler's formula:

$$ e^{+j 2 \pi \frac{f t}{F}} = \cos\left( 2 \pi \frac{f t}{F} \right) + j \sin\left( 2 \pi \frac{f t}{F} \right). $$

Then the (normalized) DFT can be expressed as inner products (on the complex vector space $\mathbb{C}^T$) between the signal $\mathbf{x}$ and complex sinusoids $\mathbf{u}_f$ of different frequencies:

$$ X(f) = \langle \mathbf{x}, \mathbf{u}_f \rangle = \mathbf{u}_f^H \mathbf{x}, \qquad f \in \{0,...,F-1\}, $$

where $\cdot^H$ denotes Hermitian transpose (transpose + complex conjugate).

If we assume that $F = T$ (after zero-padding or truncation), the vector $\check{\mathbf{x}} \in \mathbb{C}^F$ of DFT coefficients is obtained by multiplying the signal $\mathbf{x} \in \mathbb{R}^T$ with the DFT matrix $\mathbf{U} \in \mathbb{C}^{F \times T}$ where each row of index $f \in \{0,...,F-1\}$ is defined by ${\mathbf{u}_f^H \in \mathbb{C}^T}$:

$$ \check{\mathbf{x}} = \mathbf{U} \mathbf{x}, \qquad (\mathbf{U})_{f,:} = \mathbf{u}_f^H. $$

The figure below shows the real (left) and imaginary (right) parts of the DFT matrix:

Image credit: Meinard Müller, “Fundamentals of Music Processing”, Springer 2015

Question: Can you explain why we observe white horizontal lines on the right figure?

Answer: These correspond to the zero and Nyquist frequencies, see discussion above.

Now let's focus on the inverse DFT. If we go back to its definition (in the normalized case):

$$ x(t) = \frac{1}{\sqrt{F}} \sum\limits_{f=0}^{F-1} X(f) \exp\left(+j 2 \pi \frac{f t}{F}\right), \qquad t \in \mathbb{Z}, $$

we see that the time-domain signal is represented as a linear combination of complex sinusoids. How come this operation can lead to a real-valued time-domain signal? Well, this is precisely due to the Hermitian symmetry property of the DFT of real-valued signals.

Finally, it can be shown that the unit-norm vectors $\{\mathbf{u}_f \in \mathbb{C}^T\}_{f=0}^{F-1}$ are orthogonal and span $\mathbb{C}^T$, so they form an othonormal basis of $\mathbb{C}^T$ and the DFT is just a change of basis. The DFT matrix is therefore unitary and we have

$$ \mathbf{U}^H \hat{\mathbf{x}} = \mathbf{U}^H \mathbf{U} \mathbf{x} = \mathbf{x}.$$

In summary, the DFT gives us the decomposition of the time-domain signal onto the set of complex sinusoids $\{\mathbf{u}_f, f \in \{0,...,F-1\}\}$, and the inverse DFT reconstructs the time-domain signal as a linear combination of the same complex sinusoids, weighted by the DFT coefficients: $$ \mathbf{x} = \sum\limits_{f=0}^{F-1} X(f) \mathbf{u}_f = \sum\limits_{f=0}^{F-1} \langle \mathbf{x}, \mathbf{u}_f \rangle \mathbf{u}_f. $$

There exist alternatives to Fourier's complex sinusoids for spectral analysis. The above geometric interpretation remains valid, these alternatives simply consist in replacing the complex sinusoids by other functions that also form an othonormal basis of $\mathbb{C}^T$ (or $\mathbb{R}^T$). For instance, using cosine functions defined by $$ \mathbf{u}_f(t) = \sqrt{\frac{2}{T}} \cos\left[ \frac{\pi}{T} \left(f + \frac{1}{2}\right) \left(t + \frac{1}{2}\right) \right] $$ leads to the discrete cosine transform (DCT) of type IV. The DCT (or variants of the DCT) is very popular in data compression, for instance it is used in the JPEG and MP3 compression standards. We will come back to this at the end of the notebook.

In the next section, we will discuss some limitations of the DFT and how they can be addressed using time-frequency representations.

Time-frequency representation

Consider the problem of automatic music transcription, i.e., estimating a musical score given an audio recording. You have below an example of musical score (with excessive information, this is for the sake of the example). A musical score is used by musicians to indicate a succession of audio events (notes) with indicators of pitch, dynamics, tempo, and timbre. Tempo and rhythm relate to time (measured in seconds), pitch and timbre relate to frequency (measured in Hertz), dynamics relates to intensity or power (measured in decibels).

Let's consider a piano recording of "Au clair de la lune" (the melody corresponding to the above musical score) and plot the signal representations that we are familiar with, the waveform and the power spectrum.

Question: In order to solve our automatic music transcription task, we would like to compute a representation that highlights the characteristics of the signal along the same dimensions as in the musical score. Explain what is missing the waveform and power spectrum representations.

Answer:

In the waveform, the frequency information is missing. In the power spetrum, the frequency information is "averaged" over the entire time domain, such that local phenomena (individual musical notes here) become global in the DFT.

Fourier analysis is insufficient when the frequency characteristics of the signal vary with time. We speak of non-stationarity. The non-stationarity can be slow, for instance in the case of a vibrato in music, where the frequency characteristics of the sound (notably its pitch) slowly vary with time. Or it can be fast, for instance in the case of a percussive or more generally transient sound.

To compute a representation that highlights the characteristics of the signal along time, frequency, and intensity, we need a local DFT. This is called the short-time Fourier transform (STFT) and leads to a representation called the spectrogram.

The short-time Fourier transform (STFT) is obtained by computing the discrete Fourier transform (DFT) on short overlapping smoothed windows of the signal, as illustrated below.

We are going to define the STFT properly.

Windowing

Let $x(t) \in \mathbb{R}$ be a signal defined in the time domain $t \in \mathbb{Z}$. A frame of the signal is defined for all $n \in \mathbb{Z}$ by:

$$ x_n(t) = x(t + nH) w_a(t),$$

where

$x(t + nH)$ is a shifted version of $x(t)$, where the shift is equal to $n$ times the analysis hop size $H$.

Question: Assume $x(t) \neq 0$ only for $0 \le t \le T-1 $, what is the time support of $x(t + nH)$? If $n > 0$, is the signal shifted left or right?

Answer: $x(t + nH) \neq 0$ for $0 \le t + nH \le T-1 \Leftrightarrow - nH \le t \le T-1 - nH$. If $n > 0$ the signal is shifted left.

The time support of the frame $x_n(t)$ is the same as that of the analysis window $w_a(t)$. In audio signal processing, we typically choose a short analysis window (between 10 and 100 ms) to assume the stationarity of the signal over its time support, we speak of local stationarity.

Remark: In the above definition of the STFT, we shift the signal $x(t)$ by a number of samples $nH$ instead of shifting the analysis window $w_a(t)$ as it is represented in the above figure. This definition corresponds to the band-pass convention. There exist another definition, in the low-pass convention, where the analysis window is shifted.

Direct STFT

The STFT is defined for all time-frequency points $(f,n) \in \{0,...,F-1\} \times \mathbb{Z}$ by the DFT of the frame $x_n(t)$:

$$ X(f,n) = \sum_{t \in \mathbb{Z}} x_n(t) \exp\left(-j2\pi \frac{ft}{F}\right),$$

where $F$ is the order of the DFT, which can be chosen as $F=L$. The STFT inherits the properties of the DFT; it is complex valued and the STFT of a real-valued signal exhibits the Hermitian symmetry. Thus, as for the DFT, we usually only keep the non-redundant part of the spectrum.

In practice, the time support of the STFT is finite (because the time-domain signal is of finite length $T$), such that we can denote by $\mathbf{X} \in \mathbb{C}^{F \times N}$ the STFT matrix where $F$ is the number of frequency bins and $N$ the number of time frames (which can be computed from the $T$, $L$ and $H$). Then,

We are going to implement the STFT using the sine window, which is actually the square root of the Hann window (there exist many different windows with different properties):

$$ w_a(t) = \sin\left( \frac{\pi}{L} \left(t + \frac{1}{2}\right) \right), \qquad t = 0,..., L-1. $$

Read the code in the following cells to understand how the STFT is computed.

Below, we display the resulting power spectrogram.

Question: Explain what you observe in this spectrogram representation by comparing it with the corresponding musical score. Propose a simple method to estimate the musical score from this representation.

Answer: We observe the succession of musical notes. For instance, we see that the audio recording starts with three repetitions of the same note. Indeed, a note is characterized by its fundamental frequency (also called the pitch), which corresponds to the frequency of the first peak in its spectrum. We could therefore estimate the musical score by peak picking from the spectrogram representation.

Inverse STFT and overlap-add

The inverse STFT is computed by taking the inverse DFT of the spectra at all time indices and by a procedure called overlap-add.

For each time frame $n \in \mathbb{Z}$ of the STFT, we first compute the inverse DFT:

$$ \hat{x}_n(t) = \frac{1}{F}\sum_{f=0}^{F-1} X(f,n) \exp\left(+ j2\pi \frac{ft}{F}\right).$$

The inverse STFT is then computed by overlap-add, for all $t \in \mathbb{Z}$:

$$ \hat{x}(t) = \sum_{n \in \mathbb{Z} } w_s(t-nH) \hat{x}_n(t-nH),$$

where $w_s(t)$ is a smooth synthesis window with the same support as the analysis window.

Read the following cells that implement the inverse STFT.

Question: Plot the difference between the original audio signal and its reconstruction after direct and inverse STFT, what do you conclude?

Answer: The amplitude of the error is about 1e-16, indicating that the reconstruction is perfect.

Perfect reconstruction

We can show that perfect reconstruction is achieved, i.e., $\hat{x}(t) = x(t)$ for all $t \in \mathbb{Z}$, if the analysis and synthesis windows satisfy:

$$ \sum_{n \in \mathbb{Z} } w_a(t-nH)w_s(t-nH) = 1.$$

Hint for the proof (only for the brave): $\displaystyle \frac{1}{F} \sum_{f=0}^{F-1} \exp\left( -j 2 \pi \frac{ft}{F} \right) = 1$ if $t=0$, and $0$ if $t \in \mathbb{Z}^*$. It corresponds to the sum of the $F$-th roots of unity, and this result can be shown by recognizing a sum of the successive terms of a geometric sequence.

This condition is satisfied for the sine window we have used for both analysis and synthesis ($w_a(t) = w_s(t)$) and with a 50% overlap ($H = L/2$). We are going to verify this by defining an arbitrary number of time frames $N = 10$ for the overlap-add operation, such that $n=0,...,N-1$, and by computing the overlap-add operation in the last equation.

We indeed observe that except for the edges, the perfect reconstruction condition is verified. In practice, when computing the STFT, we will do some preprocessing (add zeros) to deal with edges. When computing the inverse STFT, we will apply the corresponding postprocessing (remove first and last coefficients) to ensure perfect reconstruction. Note that we could also simply divide by the overlap-add of the sine window after computing the inverse STFT.

Effect of the analysis window length and time-frequency resolution

In the following we will use librosa to compute spectrograms. This is a python package for music and audio analysis.

Run the following cells to analyze and listen to the recording of a glockenspiel. In particular, we will plot the spectrogram with different lengths for the analysis window.

When a glockenspiel bar is struck with a mallet, it vibrates at a fundamental frequency while emitting a series of overtones, which are generally not harmonic (i.e., not integer multiples of the fundamental frequency).

Question: Describe the structure of the spectrogram and comment on the effect of the analysis window length.

Answer:

The spectrogram essentially consists of horizontal and vertical lines. Vertical lines correspond to the percussion of the instrument bars with a mallet, which correspond to a transient sound. Horizontal lines correspond to the vibration of the bar, which correspond to a steady sound.

As the analysis window length increases, the vertical lines become thiner and the horizontal lines become thicker. This effect characterizes the trade off between time and frequency resolution. With a short analysis window, we have a good time resolution, allowing us to precisely localize events in time (the bar being struck), but we cannot at the same time precisely localize the event in frequency (the pitch of the note). With a larger analysis window, it's the contrary, we can precisely localize the event in frequency, but not in time.

To further understand this notion of resolution, read the corresponding section in the appendix. Note also that while the time-frequency resolution is fixed with the STFT, the wavelet transform adapts automatically the time-frequency resolution. With the wavelet transform, the time resolution increases with the frequency, which corresponds to the intuition that high-frequency (i.e., rapidly-varying) signal components need a better time resolution than low-frequency (i.e., slowly-varying) ones.

Another important observation that can be made by looking at the spectrogram corresponds to the repartition of the energy. We see that the energy in this signal is strongly localized in time or frequency, and most of the time-frequency points are actually associated with the black color, which corresponds to very small values on the spectrogram. Audio signals tend to be sparse in the time-frequency domain, we will come back to this notion of sparsity later on in the course.

Interpretation as the decomposition over a set of time-frequency atoms

Remember, we've seen that the computation of the DFT of a signal can be interpreted as the signal's decomposition over a set of complex sinusoids. And the time-domain signal can be reconstructed by linearly combining the complex sinusoids.

Similarly, we can see the time-frequency analysis of a signal $\mathbf{x} \in \mathbb{R}^T$ as its decomposition over a dictionary (or a set) of time-frequency atoms $\{\boldsymbol{\psi}_{fn} \in \mathbb{C}^T\}_{(f,n) \in \{0,...,F-1\}\times \mathbb{Z}}$:

$$ X(f,n) = <\mathbf{x},\boldsymbol{\psi}_{fn}> = \boldsymbol{\psi}_{fn}^H \mathbf{x} = \sum_{t \in \mathbb{Z}} x(t) \psi_{fn}^*(t).$$

In the case of the STFT (and using the normalized definition of the DFT), the time-frequency atom $\psi_{fn}(t)$ is a temporally localized frame of the complex sinusoid at frequency $f$:

$$ \psi_{fn}(t) = \frac{1}{\sqrt{L}} w(t-nH) \exp\left(+j2\pi\frac{f(t-nH)}{L}\right).$$

Inverse STFT then consists in reconstructing the time-domain signal by linear combination of the time-frequency atoms:

$$ \hat{\mathbf{x}} = \sum_{f=0}^{F-1} \sum_{n \in \mathbb{Z}} X(f,n) \boldsymbol{\psi}_{fn} = \sum_{f=0}^{F-1} \sum_{n \in \mathbb{Z}} <\mathbf{x},\boldsymbol{\psi}_{fn}> \boldsymbol{\psi}_{fn}.$$

Using different time-frequency atoms lead to different transforms. For instance, the modified discrete cosine transform (MDCT) is a local version of the DCT-IV that uses frames of cosine functions:

$$ \psi_{fn}(t) = \frac{4}{\sqrt{L}} w(t-nH) \cos\left(\frac{2\pi}{L} \left(f+\frac{1}{2}\right) \left(t-nH+\frac{1}{2}+ \frac{L}{4}\right) \right).$$

MDCT is the most widely used lossy compression technique in audio data compression. It is employed in most modern audio coding standards, including MP3.

Beyond audio spectrograms

We have been focusing on spectrograms of audio signals, but time-frequency representations such as the spectrogram are routinely used in countless applications, such as:

Spectrograms are frequently used as the input of machine listening systems, for the automatic retrieval of information from audio signal (e.g., voice activity detection, speech recognition, music transcription, speech emotion recognition, sound event classification), with applications in smartphones, smart speakers, robotics, domotics, smart cities, etc. A spectrogram representation is indeed much more discriminative of the constitutive elements of the sound than the raw waveform. Look for instance at the waveforms and spectrograms of the sound of different musical instruments below:

In the deep learning course, you will for instance develop a system for in-home monitoring of elderly's autonomy using spectrograms and deep neural networks.

Image, structure in the DCT domain, and compression

In a future course, we will see how the specific structure of audio signals in the time-frequency domain can be used to perform audio source separation, that is to separate different sounds from the observation of their mixture. Today, let's focus on the structure of natural images.

From 1D to 2D signal transformations

So far, we have been discussing representations of 1D signal, for instance based on the DFT or the DCT of stationary signals. We have seen that these transforms basically consist of representing a signal $\mathbf{x} \in \mathbb{R}^T$ as a linear combination of basis signals $\{\mathbf{u}_f \in \mathbb{C}^T\}_{f=0}^{F-1}$ (complex sinusoids for the DFT and cosine functions for the DCT). We have been focusing on 1D signals that represent the evolution of a quantity (e.g., the air pressure deviation) over time. However, an image is a 2D signal that represents the evolution of a quantity (e.g., light intensity) over two spatial axes, so how do we extend the 1D signal transformations to the 2D case?

The simplest approach consists of generating 2D basis signals by taking the outer product of 1D basis signals. From the basis $\{\mathbf{u}_f\}_{f=0}^{F-1}$ of $\mathbb{C}^T$ we can build a basis $\left\{\mathbf{u}_{f_1, f_2}\right\}_{f_1=0, f_2=0}^{F-1, F-1}$ of $\mathbb{C}^{T \times T}$ defined by: $$ \mathbf{u}_{f_1, f_2} = \mathbf{u}_{f_1} \mathbf{u}_{f_2}^H \in \mathbb{C}^{T \times T}, $$ or equivalently $$ \mathbf{u}_{f_1, f_2}(t_1, t_2) = \mathbf{u}_{f_1}(t_1) \mathbf{u}_{f_2}(t_2)^*, \qquad t_1, t_2 =0,...,T-1. $$

In the next cell, we compute the 2D DCT (type II) of a grayscale image and plot the image, its log-magnitude spectrum (logarithm of the absolute value of the DCT coefficients), and the image reconstruction after inverse DCT.

Question: How many pixels do we have in this image?

Answer: 256 x 256 = 65 536 pixels

Structure of the image in the DCT domain

We are now going to see that the image has a specific strcture in the DCT domain that it does not have in the original spatial domain, and which can be used for image compression, denoising, or more generally reconstruction of corrupted images (e.g., deblurring, inpainting, super-resolution, demosaicing, etc.) Today, we will focus on image compression.

Question: Complete the next cell to compare the histograms of the absolute value of the image and DCT coefficients. What do you observe?

Answer: The DCT is sparse, most DCT coefficients are zero.

In the next cell, we plot the sorted (decreasing order) absolute values of the image and DCT coefficients (the y-axis is logarithmic).

JPEG image compression

Exercise

The JPEG compression standard leverages the sparsity of the image in the DCT domain. It works by thresholding the DCT coefficients to keep only the largest ones. That is, coefficients smaller than a given threshold are set to zero, which takes 1 bit to quantize. Complete the next cell to compare the reconstruction of the image after keeping only 1, 5, 10 and 20 % of the DCT coefficients with the largest absolute value (other coefficients will be set to zero). You will use im_DCT_flat_sort to determine the value of the threshold.

The previous image is actually a non-stationnary 2D signal. Just as we defined the STFT as a local DFT to analyze non-stationary 1D signals, we can define a local DCT to analyze non-stationary 2D images. This is what the JPEG compression standard actually does; it computes the DCT on image blocks of 8x8 pixels.

Exercise

Apply the same compression technique as in the previous cell but using a block-processing approach, i.e., computing the DCT on 8x8 blocks. Compare quantitatively and qualitatively the results with the previous global approach. You will use the Peak signal-to-noise ratio to quantify the reconstruction quality of the image.

Take-home messages

To conclude, let us stress that without signal processing, the phase vocoder would not exist, the autotune would not exist, and Jul) would not have become in 2020 the best-selling artist in the history of French rap.

Appendices

Discrete-time Fourier transform

Definition

Let $x(t) \in \mathbb{R}$ be a signal defined in the time domain $t \in \mathbb{Z}$ such that $\sum\limits_{t \in \mathbb{Z}} |x(t)| < \infty$.

The discrete-time Fourier transform is defined by

$$ \hat{x}(\nu) = \sum\limits_{t \in \mathbb{Z}} x(t) e^{-j 2 \pi \nu t}, $$

where $\nu$ is called the normalized frequency.

By definition, the DTFT is periodic with period 1:

$$\hat{x}(\nu + 1) = \sum\limits_{t \in \mathbb{Z}} x(t) e^{-j 2 \pi (\nu +1) t} = \sum\limits_{t \in \mathbb{Z}} x(t) e^{-j 2 \pi \nu t} e^{-j 2 \pi t} = \sum\limits_{t \in \mathbb{Z}} x(t) e^{-j 2 \pi \nu t} = \hat{x}(\nu). $$

We thus limit its representation to the interval $\nu \in [-0.5, 0.5[$ or $\nu \in [0, 1[$.

The inverse discrete-time Fourier transform is defined by:

$$ x(t) = \int_{-1/2}^{1/2} \hat{x}(\nu) e^{+j 2 \pi \nu t} d\nu.$$

Properties of the DTFT (time domain $\leftrightarrow$ frequency domain)

Real-world signals

In practice, we acquire signals during a limited period of time and using sensors with limited dynamics. The consequence is that the signals are of finite length, amplitude, and energy. We therefore always have:

We could therefore also represent this signal as a vector $\mathbf{x} \in \mathbb{R}^T$, where the entry of index $t \in \{0,...,T-1 \}$ contains the sample $x(t)$.

Sampling the DTFT gives the DFT

In practice, we also work with a finite subset of frequencies. We fix a number of frequencies $F \in \mathbb{N}$ and define

$$ \nu_f = \frac{f}{F} \in [0, 1[, \qquad f \in \{0,...,F-1\}.$$

By sampling the DTFT of a signal of finite length $T$ we obtain the discrete Fourier transform (DFT):

$$ \hat{x}(\nu_f) = \sum\limits_{t \in \mathbb{Z}} x(t) e^{-j 2 \pi \frac{f}{F} t} = \sum\limits_{t=0}^{T-1} x(t) e^{-j 2 \pi \frac{f}{F} t} = X(f). $$

A consequence of this sampling is the periodization of the time-domain signal, which is also why we need to assume a finite-support signal when defining the DFT.

DTFT and DFT of simple signals, precision and resolution

Constant signal

We consider the following finite-length constant signal defined for $t \in \mathbb{Z}$ by

$$ x(t) = \begin{cases} 1 &\text{if } t \in \{0,...,T-1\} \\ 0 &\text{otherwise}, \end{cases} $$

Its DTFT is given by:

$$ \hat{x}(\nu) = \frac{\sin(\pi \nu t)}{\sin(\pi \nu)} e^{-j\pi\nu(T-1)}. $$

In the next cell, we plot the magnitude spectrum of this signal computed with the DTFT and DFT, for the different lengths. We observe a spectrum with the shape of a sine cardinal, and the presence of a principal lobe of width $2/T$.

We can conclude that the shorter the time-domain signal, the larger the pricipal lobe, the lower the resolution.

Sinusoid

We now consider the following finite-length sinusoid defined for $t \in \mathbb{Z}$ by

$$ x(t) = \begin{cases} \sin (2 \pi \nu_0 t) &\text{if } t \in \{0,...,T-1\} \\ 0 &\text{otherwise}, \end{cases} $$

where $\nu_0 = 0.1$ and $T \in \{16,32,64\}$.

In the previous code cell, uncomment the appropriate lines (CTRL + '/') to plot the magnitude spectrum of this sinusoid, for the different lengths. You should observe two principal lobes centered around $\nu_0$ and $- \nu_0$. The same principle applies regarding the resolution: the more periods we observe, the "more certain" we are about the pure tone frequency.

Note that the blue curve is actually an interpolation of the DFT of order $F = 128$. Increasing the order of the DFT simple increases the precision, i.e. the number of points used to sample the DTFT, it does not change the resolution.

Sum of sinusoids

We now consider the following sum of two sinusoids with close normalized frequencies:

$$ \forall t \in \mathbb{Z}, \qquad x(t) = \begin{cases} \sin (2 \pi \nu_0 t) + \sin (2 \pi \nu_1 t) &\text{if } t \in \{0,...,T-1\} \\ 0 &\text{otherwise} \end{cases} $$

where

In the previous code cell, uncomment the appropriate lines (CTRL + '/') to plot the magnitude spectrum of this signal, for the different lengths. You should observe that for $T=16$, the resolution is not sufficient to separate the two frequencies.

References

Parts of this notebook were inspired/adapted by/from the following resources: