Analysis/Synthesis
The Voder demonstrated the validity of Homer Dudley's
model of speech generation by showing that an electromechanical device with
relatively few controllable parameters could produce intelligible
speech. His vocoder replaced the human operator
with another electromechanical device that analyzed the speech,
and hence was fully automatic.
The combined analysis/synthesis device is the precursor
of most speech coders today, including those used in digital
cellular telephony. While modern speech coders use linear prediction
in the analysis phase, Dudley's vocoder used a simpler spectral
estimator, a bank of bandpass filters.
The rough structure of the synthesis side of the vocoder
is shown below:
The analysis side of the vocoder provides the parameters for
all of the blocks. The "noise" block generates what is called
in the demonstrations below "hiss-type energy," while the
"periodic pulses" block generates what is called "buzz-type
energy." The period of the periodic pulses controls the "pitch."
The vocoder is demonstrated with audio samples below.
The audio is from an original Bell Labs recording of 1939.
The speaker is alleged to be C. Voderson, although this
seems unbelievable.
Introduction
The introduction to the vocoder itself has been processed by the
vocoder, demonstrating reasonably good audio quality (by telephone
standards, which emphasize intelligibility and speaker recognition
over audio fidelity).
Comparison
Here, the vocoder output is compared
to uncoded output (over a "public address system").
Unvoiced speech
Whispered speech is generated by setting the vocoder as if all speech
were unvoiced (input to the synthesis filter is only "hiss-type
energy"). Below, a plot of voiced speech (top) is compared to
a plot of unvoiced speech (bottom) in the time domain.
Notice that the voiced speech is much more periodic, while
the unvoiced speech is much more random.
The horizontal axis is "number of samples," and the sample
rate is 8kHz.
In the frequency domain, the spectrum of a segment of
unvoiced speech will be smooth, as shown below:
(This is not an actual plot of a spectrum, but rather
a suggestive sketch.)
By contrast, voiced speech, since it is roughly periodic,
will have a more discrete spectrum, as shown below:
For most languages, speech is fully intelligible in unvoiced
(whispered) form. The voiced spectrum above has the same
envelope (shown in red) as the unvoiced spectrum, so we would conclude that
the linguistic information is the same in both signals.
Indeed, only a few languages, like Mandarin Chinese, contain linguistic
information in the periodicity of speech (Mandarin Chinese is "tonal,"
meaning that there is linguistic information in the inflections).
The examples below will reinforce this point by altering the
tonal information in interesting ways.
Voiced speech
Mechanical-sounding speech generated by setting the vocoder as if all
speech were voiced (input to the synthesis filter is only "buzz-type
energy").
Monotone speech
Here, both voiced and unvoiced sounds are produced, but the voiced
sounds are held at a constant pitch, yielding a monotone effect.
Pitch modifications
Here, pitch is modified under the control of a hand dial.
One octave lower
An octave is a factor of two in frequency.
In this demonstration, the vocoder halves the pitch of the speaker.
One octave higher
In this demonstration, the vocoder doubles the pitch of the speaker.
Inflection
"Inflection" is the variations in pitch in speech.
The vocoder can be set to reduce or increase the inflection without
shifting the pitch up or down.
Inflection manipulations on a song
In this demonstration, the inflection reduction and enhancement
is demonstrated on a song.
Reversing the inflection
In this demonstration, inflection is reversed.
That is, when the pitch of the original speech would be rising,
here it is falling, and vice versa.
Special effects sounds
Here, the vocoder is used to synthesize non-speech sounds.
Vibrato
Vibrato is a musical term for a rapid fluctuation in pitch.
This illustration uses the vocoder to introduce vibrato into
a singing voice signal.
Jones family
In this demo, various of the above effects are combined to alter
a single voice to play several roles in short skit.
Combining two voices
Here, a voice is shifted in pitch by a frequency interval known
to musicians as a major third. The shifted voice signal is combined
with the original to achieve a harmonious effect.
Combining three voices
Here, a voice is shifted in pitch by two frequency intervals to make
what is known
to musicians as a triad. The shifted voice signals are combined
with the original to achieve a harmonious effect.
Permuting the frequency channels
Here, the three lowest frequency channels are redirected in synthesis
to the three middle channels at higher frequencies. The result is a nasal
effect, with the low frequencies missing.
Permuting the frequency channels
Here, the three middle frequency channels are redirected in synthesis
to the three lowest channels. The result is a strange effect,
with the middle frequencies missing.
Complete Audio File
The entire audio for the above demonstrations is available in
Sun Audio format (.au files) (8,3700k).
Professor Edward Lee's Home Page.