Motivation: About a couple of months back i was wondering on designing a speaker dependent speech recognizer on the 8051 micro-controller or any of its derivatives for simple machine control. We would of course need an isolated word (or digit) recognizer.
Problem: Speech recognizers can be implemented using Hidden Markov Models or Artificial Neural Networks. There are plenty of such systems in place. However the problem with these algorithms is that they are computationally pretty intensive, and thus can not be implemented on a simple 8 bit fixed point micro-processor, and that is what we need for simple machine control applications. So there is a need for a simpler algorithm.
All these algorithms also employ a short term feature vector to take care of the non-stationary nature of speech. Generally the vector length is so chosen that the nature of the signal in this band is quasi-stationary. Feature vectors are an area of active research. Generally however at the university level, Mel Frequency Cepstrum Coefficients (MFCC) or Linear Predictive Coefficients are taken as features. These too require computations that are beyond the scope of a simple processor/ micro controller like the 8051.
Solution: I was thinking what could be done to reduce this burden and choose a simpler feature so that it could be implemented on 8051. While researching on this i came across a paper. This papers deals with this problem exactly!
The researchers have used only zero crossings of the speech signal to determine the feature vector. Since this novel feature extraction method is based on zero crossings only, it just needs a one bit A to D conversion. This feature extraction is computationally very simple and does not require the speech signal to be pre-processed.
This feature vector is basically the histogram of the time interval between successive zero-crossings of the utterance in a short time window. These feature vectors for each window are then combined together to form a feature matrix. Since we are dealing with only small time series (isolated words), we can employ Dynamic Time Warping to compare the input matrix with the reference matrix’ stored. I will discuss this in another post sometime.
To obtain this vector the following steps need to be followed.
1. The speech signal x(t) is band-pass filtered to give s(t).
2. s(t) is then subjected to infinite amplitude clipping with the help of a ZCD to give u(t).
3. u(t) is then sampled at say 8Khz to give u[n]. The feature extraction is carried out on u[n].
4. u[n] is divided in a number of short time windows for every one of the calculated W samples.
5. The histogram for each of this short time window is found. The histogram(or vector) is found as follows: The number of times ONLY ONE sample is recorded between successive zero crossings will constitute the element number 1 of the vector. The number of times ONLY TWO samples are recorded between successive zero crossings will constitute the element number two of the feature vector and so on. In this way we construct an histogram which is an appropriate feature vector.
These vectors then can be combined for all windows to get the feature matrix. These as i said earlier can be compared using DTW/DDTW/Fast DTW or some other algorithm.
As an example take an utterance for the number three in Hindi which is spoken as “teen”. The first plot below gives the waveform for the utterance. The second plot gives the end-point detected version of the same, end point detection reduces computations (and hence the memory required) by removing the “useless” portions of the utterance which do not contain any intelligence.
The surface plot for the above utterance by me for the matrix (where, as i have mentioned implicitly the rows represent the windows and the columns represent the histogram terms) prepared is as:
 A Microprocessor based Speech Recognizer for Isolated Hindi Digits, Ashutosh Saxena and Abhishek Singh, IEEE ACE.
 Zero-Crossing-Based Linear Prediction for Speech Recognition, Lipovac, Electronics Letters, pages 90-92, vol. 25 Issue 2,19 Jan 1989.