ISTC project

ISTC PROJECT #1993P
Title of the Project: Mathematical Basis of Knowledge Discovery and Autonomous Intelligent Architectures
Duration of the Project: 2000-2003
Sponsored by Air Force Research Laboratory (AFRL), USA
Project Manager: Prof. Vladimir I. Gorodetski

Task 4 of the Project: Voice Operated Flying Objects
Principal Investigator: Dr. Andrey Ronzhin

Summary of the project:
The main objective of the Project #1993P, task 4 “Voice operated flying object” was the development of the model of voice operated flying object. In order to achieve maximal robustness of control process to different kinds of impeding factors the data on human’s speech perception have to be maximally used. For that during the project the analysis of approaches and methods for construction of systems with speech interface was accomplished and the integral speech understanding model has been elaborated.
The integral structure of data processing developed in the Speech Informatics Group of SPIIRAS is the basis of the model. Understanding of a spoken message is carried out by integral interaction of all partial processing levels (acoustical, semantic-syntactic and pragmatic). Every recognition level gives a set of hypotheses with their corresponding estimations. The number of possible indeterminacies and errors entered at the current level (or appeared at it) decreases during integral processing. The essence of integral processing is the estimation of the input signal by criteria of some kinds of knowledge. The integral estimation is calculated on the basis of separate estimates and the final decision is made by the minimum of integral deviation. It makes the model robust to probable inaccuracies in the pronounced phrase and distinguishes advantageously this model from the commonly accepted conception of sequential parsing.
It is obvious, that accuracy and flexibility of the system depend on its capability to adapt to various aspects of the concrete application. Moreover during the debugging and exploitation the system can obtain new data and so the databases adjustment is necessary. For this aim the method for integral adaptation of databases has been developed. The integral approach takes into account acoustic aspect, language aspect, subject area as well as the integral optimization of the model parameters. It allows to achieve required efficiency of the databases adjustment as well as the portability of the understanding model to new applied tasks.
The initial data for creation of the concrete realization of a speech understanding model are the description of the applied area and used language. In the context of the proposed integral speech understanding method the initial data are the model of the object work logic in the form of the state diagram, which reflects all the possible situations of the operated object, all the possible transitions from situations to other situations and also the spoken commands required for initialization of these transitions. The semantic weights of words are required for high-level processing too. The whole state diagram can be easily divided into the fragments, which contain the information about possible transitions from the concrete situation, paraphrases required for accomplishing these transitions, and semantic weights of the words included in phrases.
At the current stage of research we have developed the aircraft emulator, which describes the current aircraft state by means of set of the parameters sufficiently exactly and obviously. The aircraft IL-76 of the Russian civil aviation is the real prototype of the emulator. The analysis of the specificity of the flying object for the development of voice control model was accomplished. As a result the detailed situational state diagram describing the flight process at the phases of take-off and climbing the altitude has been developed. The especial attention was paid to investigation of the emergency situations such as fire, engine failure, decompression of the aircraft cabin, etc. For the developed diagram the total amount of situations is over 50 and the size of the vocabulary is over 150 words.
Moreover, in order to create effective, robust and competitive model we researched all the levels of speech processing and carried out the wide spectrum of additional works and have elaborated several original methods.
In order to detect the speech signal in noisy environments the method based on spectral entropy analysis has been developed. The distinction between entropy for speech segments and entropy for background noise is used for speech endpoint detection. Such criterion is less sensitive to variations of signal amplitude. The experiments with the developed method have shown that speech fragments are successfully detected in sound signals, which have diverse kinds of intense noises (including non stationary) and sound artifacts. Moreover, the developed method has sufficiently high speed of processing and can be used in real-time (on-line) speech recognition systems.
For parametrical representation of the speech signal two methods (sign autocorrelation function and spectral-difference features) robust to variations of the signal amplification level have been developed. Besides, the second method has shown high robustness to accidental nonlinear spectrum deformations. At that the subset of pairs of spectral bands is chosen from discrete spectrum and the further processing consists in comparison of the energies of the chosen bands considering the some weight coefficients. In principle, this method allows to describe any forms of speech spectrum with any required accuracy, but very high accuracy is not necessary since there is redundant variability of natural speech spectrum. In the experiments the feature system has demonstrated the best accuracy and robustness in comparison with two other systems of spectral nature: cepstral and autocorrelation features.
For continuous speech recognition we have developed the method robust to grammatical deviations in a pronounced phrase and suitable for using in the real-time speech understanding model. In the proposed method we do not use the composite templates approach and apply the detection of word hypotheses by the method of sliding analysis of an input signal. The method is based on the following steps: (1) the multi-alternative search of word hypotheses in the input signal by “sliding analysis” method with simultaneous estimation of their acoustical likelihood; (2) the recurrent construction of the set of hypotheses of the word chains with any length; (3) the estimation of acoustical lexical probability of these word chains (phrase-hypotheses) based on acoustical probability of words, their mutual time location and summary duration of word hypotheses contained in the phrase. The developed module of continuous speech recognition was introduced into the earlier developed base model of integral speech understanding.
The developed method of continuous speech recognition and the integral structure of processing provide the robustness to various distorting factors (acoustic-phonetic and grammar deviations in the pronounced phrases, etc.) that allows to make the manner of interaction between the speaker and the system more natural.
During the project we paid especial attention to robustification of the speech dialogue model. It interests the researchers and consumers of such systems more and more now, because there were many unsuccessful attempts of real applications of speech technologies. To improve the model robustness the following topics were developed: (1) spectral-difference representation of speech signal; (2) the continuous speech recognition model, robust to semantic-syntactical deviations on the inputted phrase, based on sliding analysis using dynamic programming methods and fuzzy sets theory; (3) phrase meaning recognition method based on integral data processing.
Besides, we investigated the problem of extra linguistic information and its paramount importance in speech understanding process. The especial attention was paid to the situational aspect (the problem of formalization and usage of situational information). The structure of the situational model of the applied area for the tasks connected with the control of technical objects (a car, a plane, a robot, etc.) has been proposed.
Thus, in the framework of the project the detailed analysis of the speech recognition/understanding methods was fulfilled. The main problems and tasks of speech signal processing were marked: detection of speech, parametric representation, isolated speech recognition, continuous speech recognition, high level processing, speech understanding, using situational context, etc.
The following results have been obtained during the project: (1) the mathematical model of integral understanding, which is based on the speech acts theory, the conception on integral processing as well as the results of psycho-physiological experiments on human speech perception; (2) research prototype of the voice control system based on integral speech understanding; (3) demonstration model of the voice operated emulator of the flying object.
During the project the group participated in eleven International Conferences and other scientific activities devoted to speech recognition/understanding. The papers have been published in the two reviewed International journals.
The further research on fundamental level should be oriented to the creation of more adequate models of speech perception, stochastic models of speech signal processing as well as considering the various deviations and inaccuracies peculiar to natural speech.
From the applied point of view this scientific direction should be aimed to creation of the intellectual systems and services for human-computer interaction and their introduction into the technical devices for organization of voice control, modern network systems, new intelligent applications as well as mobile devices, where speech becomes the most perspective means for inquiry and obtaining of information.