Have you ever seen a robot talking to a human?? I
used to see that in my dreams during my childhood days. But this is not a matter
of dreams or imaginations now.
This has become practical and made possible by the technique of
Speech-recognition.
Speech Recognition is that ability of any machine which helps them identifying words and phrases in language and converting them into machine convenient language. To work with speech recognition it is required that our software must be sophisticated so that it can accept the speech very clearly. Call routing, voice-dialing, these all things comes under recognition process.
Speech recognition is classified into two categories
Speaker dependent and speaker independent
Speaker dependent systems are trained by a single person, the person who is using the system. So we can say that system which is trained by the user. These systems are very efficient and capable of getting high command count. But the system only gives response to the person who trained the system.
Speaker independent is a system trained to respond to a word independent of who speaks. Thus the system must respond to a large variety of patterns of speech. It is to be noted that the voice input device is mounted on the controller so that the commands related to the movements can be given by voice. When commands are inputted by any microphone. Analog electrical signals which represents voice are first converted into digital form. This is done by an analog to digital converter. And after that these digital signals are given as an input to robotic controller. The robot controller must have a kind of filtering device which is used to filter the input data in the form of voice. To improve accuracy and voice we use a conversion modelling process and form a system response.
System recognition circuit
What we do is to try to train the circuit up to 40 words. Suppose on pressing number '1' to train word number 1. On pressing any number, the red LED will be turned off. Those numbers are displayed on the digital display. After that press the"#" button to train. Pressing the "#" button will provide signals to chip so that it can listen training words and causes turning on LED. Then do a test, the next step is to speak the word you want the circuit to be recognized. For this microphones are used. With the acceptance of a word, the LED will blink, thus shows that the word has been accepted. Similarly, if you want to enter third word, you just have to write 3 followed by '#' word.The circuit has the capability to listen continuously. It is also to be noted that each word which is entered should be displayed.
Speech Recognition system is consisted of four
parts
- Linear separation of the sources
- Multi-channel post filtering
- Computation of the missing feature from the output which is post-filtered
- Speech recognition using the separated audio
Now the question must be arising in your mind that the microphone array which we
are going to use in this case. The array is composed of 'n' number of
omni-directional elements. These elements are mounted on robots. The sources are detected and localized with the help of
any appropriate algorithm.
Source separation stage- This stage consists of a linear separation based on
Geometric Source Separation. Modifications can also be done to get faster
adaptation and shorter time frame estimations.
Post-filter
It is to be noted that the separation which we are talking about by using GSS,
is followed by Multichannel post -filter. It is based on the generalization
of beam-former post filtering for multiple sources. We do spectral estimation of
background noise. The noise which we estimate is decomposed into stationary and
transient components.
Mask Computation
Multi-channel post-filter not only reduces the amount of noise present at a
certain time at a particular frequency. We use post-filter to do estimation of
missing feature mask. It also indicates that how much a spectral feature is
reliable?
Recognition: For recognizing the speech, we can use any kit, the kit is based on
the missing feature theory. In this process of speech recognition, an acoustic
model with a search algorithm is used.
A frequency domain post filter is actually based on optimal estimator. We
consider that all the interferences (except the background noise) are localized
which is detected by the localization algorithm. It is also to assume that
leakage between channels is constant. Leakage is caused by localization error or
due to the differences in microphone frequency responses.
Missing Feature Mask
The missing feature mask is actually a matrix which represents the reliability
of each feature in the time-frequency plane. This reliability is actually a
continuous or discrete as well. This value can range from o to 1.The
more noise present in any frequency band the lower the post-filter gain will be
for that band. We use a circuit for recognition of speech. For building the
circuit an important part is IC. These chips provides the option of
recognizing some words in any particular time. The time can be in seconds. For
memory circuit uses static RAM. The chip has two operational modes:
Manual mode and CPU mode. CPU mode is designed to allow the chip to work under a
host computer. This is good to know that for listening and recognition, there is
no requirement of computer's CPU time.
On the other hand, manual mode allows the user to build a stand alone speech
recognition board, it does not require any host computer.
It has some applications
Command and control of equipments. Telephone assistance system Data entry. Speech recognition is not about understanding the speech. We should
not forget that computer is a machine and a machine never understands the
vocal command it just can respond for that.
References:
Speech recognition in c#
Programming speech in WPF-Speech recognition