LET'S START OFF WITH A PROBLEM, as is common in Science and Engineering. Suppose you are in the living room playing on your favourite gaming console. It is noisy in the room, with friends chatting on the couch and some feisty background music.
It is only natural to interact with the console using (unrestricted) speech, which increases the game's level of immersion. During the heat of the moment, especially in noisy situations, it's also common to raise your voice (known as the Lombard effect). All state-of-the-art speech recognizers repeatedly fail in this scenario .
But let's not stop there. Wouldn't it be great if your console allowed interaction beyond speech recognition,in case you'd like to jump in the discussion with your friends on the couch while gaming. While the speech recognizer tried to plainly map sound to text, now semantics will play a role, the field of Natural Language Understanding (NLU), speech comprehension or simply speech understanding.
Humans are great in reading verbal and non-verbal cues that speech is addressed to them. Computers not so, hence the use of trigger words like "Computer " (and then often followed by a coded command like "go to channel 9"). More so, humans use these verbal and non-verbal cues in the actual understanding of speech.
Now we hit a complexity level that is almost non-explored in Science and Engineering. To unfold or learn the "Semantic Dimension", the speech recognizer needs to learn that speech is more than a reference to plain text. Natural speech often contains faulty references, odd or broken speech patterns and pauses. Still, for humans, the intentions remain (in general) clear and speech is processed correctly.
A solution to this problem is apparently not trivial. 50 years of AI research produced an endless stream of (derivations of) components that are relevant, such as various knowledge representations. For sure, much ground can be won using this classical divide-and-conquer approach and remains important in the future. However in speech, the whole is greater than the sum of the parts. It's not just a reference to text symbols, it means something.
This inspired me to build a framework that goes beyond standard speech recognition - the Real Time Cognition Framework. The main requirements to the framework is that it allows processing in real-time since speech is very sensitive to the moment it is said. Secondly it must allow for a multitude of modalities, most importantly vision, since speech often references features from the physical world (like objects) that cannot be deduced from audio only.
Thirdly, it should allow rapid prototyping. Plug-in a new object recognizer, change parameters or the flow of processing and multimodal synchronization. It should allow facilities to record, manipulate, and visualize data out-of-the-box. The first generation of the framework has a built-in Speech Recognizer, Gesture Recognizer, Object Recognizer and Image Processor.
It also features a Smart Buffer to fuse the multimodal input and a Semantic Analyser that uses machine learning and advanced pattern recognition techniques to further process the input. All recognizers and the semantic analyser can be re-trained with your own custom data. This initial version is built using the Kinect version 1 as provider of the (3D) video streams, (multi channel) audio streams and Skeleton streams.
In the next few months I will gradually show more of this project and eventually, after graduation, release the source code including the (huge) data set that is used. In the mean time, read my paper on the theoretical foundation of the framework like my definitions of meaning and understanding and enjoy the small teaser of the UI and UX of an early version.
I hope it peaked your interest!