EECS20N: Signals and Systems

Analysis/Synthesis

The Voder demonstrated the validity of Homer Dudley's model of speech generation by showing that an electromechanical device with relatively few controllable parameters could produce intelligible speech. His vocoder replaced the human operator with another electromechanical device that analyzed the speech, and hence was fully automatic. The combined analysis/synthesis device is the precursor of most speech coders today, including those used in digital cellular telephony. While modern speech coders use linear prediction in the analysis phase, Dudley's vocoder used a simpler spectral estimator, a bank of bandpass filters.

The rough structure of the synthesis side of the vocoder is shown below:

Voiced spectrum
The analysis side of the vocoder provides the parameters for all of the blocks. The "noise" block generates what is called in the demonstrations below "hiss-type energy," while the "periodic pulses" block generates what is called "buzz-type energy." The period of the periodic pulses controls the "pitch."

The vocoder is demonstrated with audio samples below. The audio is from an original Bell Labs recording of 1939. The speaker is alleged to be C. Voderson, although this seems unbelievable.

Introduction If you were able to run applets, you would have a button here that would play a sound.

The introduction to the vocoder itself has been processed by the vocoder, demonstrating reasonably good audio quality (by telephone standards, which emphasize intelligibility and speaker recognition over audio fidelity).

Comparison If you were able to run applets, you would have a button here that would play a sound.

Here, the vocoder output is compared to uncoded output (over a "public address system").

Unvoiced speech If you were able to run applets, you would have a button here that would play a sound.

Whispered speech is generated by setting the vocoder as if all speech were unvoiced (input to the synthesis filter is only "hiss-type energy"). Below, a plot of voiced speech (top) is compared to a plot of unvoiced speech (bottom) in the time domain.

Voiced/Unvoiced comparison
Notice that the voiced speech is much more periodic, while the unvoiced speech is much more random. The horizontal axis is "number of samples," and the sample rate is 8kHz. In the frequency domain, the spectrum of a segment of unvoiced speech will be smooth, as shown below:
Unvoiced spectrum
(This is not an actual plot of a spectrum, but rather a suggestive sketch.) By contrast, voiced speech, since it is roughly periodic, will have a more discrete spectrum, as shown below:
Voiced spectrum
For most languages, speech is fully intelligible in unvoiced (whispered) form. The voiced spectrum above has the same envelope (shown in red) as the unvoiced spectrum, so we would conclude that the linguistic information is the same in both signals. Indeed, only a few languages, like Mandarin Chinese, contain linguistic information in the periodicity of speech (Mandarin Chinese is "tonal," meaning that there is linguistic information in the inflections). The examples below will reinforce this point by altering the tonal information in interesting ways.

Voiced speech If you were able to run applets, you would have a button here that would play a sound.

Mechanical-sounding speech generated by setting the vocoder as if all speech were voiced (input to the synthesis filter is only "buzz-type energy").

Monotone speech If you were able to run applets, you would have a button here that would play a sound.

Here, both voiced and unvoiced sounds are produced, but the voiced sounds are held at a constant pitch, yielding a monotone effect.

Pitch modifications If you were able to run applets, you would have a button here that would play a sound.

Here, pitch is modified under the control of a hand dial.

One octave lower If you were able to run applets, you would have a button here that would play a sound.

An octave is a factor of two in frequency. In this demonstration, the vocoder halves the pitch of the speaker.

One octave higher If you were able to run applets, you would have a button here that would play a sound.

In this demonstration, the vocoder doubles the pitch of the speaker.

Inflection If you were able to run applets, you would have a button here that would play a sound.

"Inflection" is the variations in pitch in speech. The vocoder can be set to reduce or increase the inflection without shifting the pitch up or down.

Inflection manipulations on a song If you were able to run applets, you would have a button here that would play a sound.

In this demonstration, the inflection reduction and enhancement is demonstrated on a song.

Reversing the inflection If you were able to run applets, you would have a button here that would play a sound.

In this demonstration, inflection is reversed. That is, when the pitch of the original speech would be rising, here it is falling, and vice versa.

Special effects sounds If you were able to run applets, you would have a button here that would play a sound.

Here, the vocoder is used to synthesize non-speech sounds.

Vibrato If you were able to run applets, you would have a button here that would play a sound.

Vibrato is a musical term for a rapid fluctuation in pitch. This illustration uses the vocoder to introduce vibrato into a singing voice signal.

Jones family If you were able to run applets, you would have a button here that would play a sound.

In this demo, various of the above effects are combined to alter a single voice to play several roles in short skit.

Combining two voices If you were able to run applets, you would have a button here that would play a sound.

Here, a voice is shifted in pitch by a frequency interval known to musicians as a major third. The shifted voice signal is combined with the original to achieve a harmonious effect.

Combining three voices If you were able to run applets, you would have a button here that would play a sound.

Here, a voice is shifted in pitch by two frequency intervals to make what is known to musicians as a triad. The shifted voice signals are combined with the original to achieve a harmonious effect.

Permuting the frequency channels If you were able to run applets, you would have a button here that would play a sound.

Here, the three lowest frequency channels are redirected in synthesis to the three middle channels at higher frequencies. The result is a nasal effect, with the low frequencies missing.

Permuting the frequency channels If you were able to run applets, you would have a button here that would play a sound.

Here, the three middle frequency channels are redirected in synthesis to the three lowest channels. The result is a strange effect, with the middle frequencies missing.

Complete Audio File

The entire audio for the above demonstrations is available in Sun Audio format (.au files) (8,3700k).


Professor Edward Lee's Home Page.