Posts Tagged ‘speech recognition’

Recently I attended a series of 6 talks each by Jiri Navratil of the IBM Thomas J. Watson research center and Frédéric Bimbot of the IRISA. Both of them are some of the best researchers I have met. Very helpful and extremely humble. In one of Jiri’s opening talk he mentioned an adage in the literature of speech/language/speaker recognition that interested me. It not only applies to speech processing research but in recognition problems in general.


[Jiri Navratil speaking : Photo taken by me after his permission]

It went like this:

It is easier to reject imposters than it is to accept true speakers.

People’s voices are distinctive. That is, a person’s speech exhibits distinctive characteristics that indicate the identity of the speaker. We are all familiar with this and we all use it in our everyday lives to help us interact with others. Of course from time to time we might notice that a person sounds very much like another person we know. Or we might even momentarily mistake as one person as another because of the sound of the person’s voice. But this similarity between voices of different individuals is not what the technical challenge in speaker recognition is all about.

The challenge in speaker recognition is variance, not similarity. That is, the challenge to decode a highly variable speech signal into the characteristics that indicate the speaker’s identity. These variations are formidable and myriad. The principal cause of variance is the speaker.

An explanation for why the speaker’s variability is such a vexing problem is that the use of speech – unlike fingerprints or handprints or retinal patterns, is to a very large degree a result of what the person “does“; rather then “who the person is” – speech is a “performing art and each performance is unique”

Above are excerpts from The NIST Speaker recognition evaluation – Overview, methodology, systems, results, perspective. By G. R Doddington et al. Speech Communication, vol 31, pp 225-254, 2000.

Dr Navratil basically spoke on Acoustics and Phonotactics in Language Identification, while Dr Bimbot spoke on Gaussian Mixture Models and Universal Background Models in the course of their talks.


Quick Links:

1. Official Webpage of Jiri Navratil

2. Official Webpage of Frédéric Bimbot


Onionesque Reality Home >>


Read Full Post »

Motivation: About a couple of months back i was wondering on designing a speaker dependent speech recognizer on the 8051 micro-controller or any of its derivatives for simple machine control. We would of course need an isolated word (or digit) recognizer.

Problem: Speech recognizers can be implemented using Hidden Markov Models or Artificial Neural Networks. There are plenty of such systems in place. However the problem with these algorithms is that they are computationally pretty intensive, and thus can not be implemented on a simple 8 bit fixed point micro-processor, and that is what we need for simple machine control applications. So there is a need for a simpler algorithm.

All these algorithms also employ a short term feature vector to take care of the non-stationary nature of speech. Generally the vector length is so chosen that the nature of the signal in this band is quasi-stationary. Feature vectors are an area of active research. Generally however at the university level, Mel Frequency Cepstrum Coefficients (MFCC) or Linear Predictive Coefficients are taken as features. These too require computations that are beyond the scope of a simple processor/ micro controller like the 8051.

Solution: I was thinking what could be done to reduce this burden and choose a simpler feature so that it could be implemented on 8051. While researching on this i came across a paper[1]. This papers deals with this problem exactly!

The researchers have used only zero crossings of the speech signal to determine the feature vector. Since this novel feature extraction method is based on zero crossings only, it just needs a one bit A to D conversion[2]. This feature extraction is computationally very simple and does not require the speech signal to be pre-processed.

This feature vector is basically the histogram of the time interval between successive zero-crossings of the utterance in a short time window[1]. These feature vectors for each window are then combined together to form a feature matrix. Since we are dealing with only small time series (isolated words), we can employ Dynamic Time Warping to compare the input matrix with the reference matrix’ stored. I will discuss this in another post sometime.

To obtain this vector the following steps need to be followed.

1. The speech signal x(t) is band-pass filtered to give s(t).

2. s(t) is then subjected to infinite amplitude clipping with the help of a ZCD to give u(t).

3. u(t) is then sampled at say 8Khz to give u[n]. The feature extraction is carried out on u[n].

4. u[n] is divided in a number of short time windows for every one of the calculated W samples.

5. The histogram for each of this short time window is found. The histogram(or vector) is found as follows: The number of times ONLY ONE sample is recorded between successive zero crossings will constitute the element number 1 of the vector. The number of times ONLY TWO samples are recorded between successive zero crossings will constitute the element number two of the feature vector and so on. In this way we construct an histogram which is an appropriate feature vector.

These vectors then can be combined for all windows to get the feature matrix. These as i said earlier can be compared using DTW/DDTW/Fast DTW or some other algorithm.

As an example take an utterance for the number three in Hindi which is spoken as “teen”. The first plot below gives the waveform for the utterance. The second plot gives the end-point detected version of the same, end point detection reduces computations (and hence the memory required) by removing the “useless” portions of the utterance which do not contain any intelligence.

The surface plot for the above utterance by me for the matrix (where, as i have mentioned implicitly the rows represent the windows and the columns represent the histogram terms) prepared is as:


[1] A Microprocessor based Speech Recognizer for Isolated Hindi Digits, Ashutosh Saxena and Abhishek Singh, IEEE ACE.

[2] Zero-Crossing-Based Linear Prediction for Speech Recognition, Lipovac, Electronics Letters, pages 90-92, vol. 25 Issue 2,19 Jan 1989.

Onionesque Reality Home >>

Read Full Post »