StevenBos.com - about (me)aning, understanding, robots

Introducing the .NET Real Time Cognition Framework

posted on: 6 march 2014

o

LET'S START OFF WITH A PROBLEM, as is common in Science and Engineering. Suppose you are in the living room playing on your favourite gaming console. It is noisy in the room, with friends chatting on the couch and some feisty background music.

It is only natural to interact with the console using (unrestricted) speech, which increases the game's level of immersion. During the heat of the moment, especially in noisy situations, it's also common to raise your voice (known as the Lombard effect). All state-of-the-art speech recognizers repeatedly fail in this scenario . But let's not stop there. Wouldn't it be great if your console allowed interaction beyond speech recognition,in case you'd like to jump in the discussion with your friends on the couch while gaming. While the speech recognizer tried to plainly map sound to text, now semantics will play a role, the field of Natural Language Understanding (NLU), speech comprehension or simply speech understanding. Humans are great in reading verbal and non-verbal cues that speech is addressed to them. Computers not so, hence the use of trigger words like "Computer " (and then often followed by a coded command like "go to channel 9"). More so, humans use these verbal and non-verbal cues in the actual understanding of speech. Now we hit a complexity level that is almost non-explored in Science and Engineering. To unfold or learn the "Semantic Dimension", the speech recognizer needs to learn that speech is more than a reference to plain text. Natural speech often contains faulty references, odd or broken speech patterns and pauses. Still, for humans, the intentions remain (in general) clear and speech is processed correctly.

A solution to this problem is apparently not trivial. 50 years of AI research produced an endless stream of (derivations of) components that are relevant, such as various knowledge representations. For sure, much ground can be won using this classical divide-and-conquer approach and remains important in the future. However in speech, the whole is greater than the sum of the parts. It's not just a reference to text symbols, it means something. This inspired me to build a framework that goes beyond standard speech recognition - the Real Time Cognition Framework. The main requirements to the framework is that it allows processing in real-time since speech is very sensitive to the moment it is said. Secondly it must allow for a multitude of modalities, most importantly vision, since speech often references features from the physical world (like objects) that cannot be deduced from audio only. Thirdly, it should allow rapid prototyping. Plug-in a new object recognizer, change parameters or the flow of processing and multimodal synchronization. It should allow facilities to record, manipulate, and visualize data out-of-the-box. The first generation of the framework has a built-in Speech Recognizer, Gesture Recognizer, Object Recognizer and Image Processor. It also features a Smart Buffer to fuse the multimodal input and a Semantic Analyser that uses machine learning and advanced pattern recognition techniques to further process the input. All recognizers and the semantic analyser can be re-trained with your own custom data. This initial version is built using the Kinect version 1 as provider of the (3D) video streams, (multi channel) audio streams and Skeleton streams.

In the next few months I will gradually show more of this project and eventually, after graduation, release the source code including the (huge) data set that is used. In the mean time, read my paper on the theoretical foundation of the framework like my definitions of meaning and understanding and enjoy the small teaser of the UI and UX of an early version. I hope it peaked your interest!

Steven Bos

The Microsoft Kinect version 2

posted on: 6 march 2014

o

I LOVE BEING ON THE CUTTING EDGE of human computer interaction. The FIN, a ring based interaction device, the MYO an arm based interaction device or the EPOC a brain based one. Or a full body sensor suit like PRIOVR or the more mainstream novelties like the GLASS and RIFT.

I wish I had the time to take these wildly crazy inventions to new User Experiences. They help solve one of the biggest challenges of this digital millennia: "how to make sense of unbounded and unstructured Big Data" . How? They allow to interact with data in natural ways using gestures, speech and even thoughts. And just as interestingly, convert the body or its states to strategic data points in the process. It's called "the quantized self", a term to remember. Another such tool is the Kinect, a 3D camera. The first generation Kinect was so revolutionary - bringing the price point of > E10.000 to E200 - that one can wonder what prevented 3D camera technology from consumer introduction. The Kinect sparked a new age for gamers and interaction researchers.

But that was 2010. This year, Microsoft will release the Kinect V2 which is a huge upgrade in almost every aspect. HD resolution, better depth sensing technology (ToF vs Structured Light) and feature streams that are beyond thrilling - heart rate detection, physics models, etc. Compared to the new PS4 Eye, which uses a different tech to determine depth, it is superior in every way but one. The framerate of the Kinect remains at 30fps, while the Eye can do up to 320x192 @ 240fps. In Computer Vision research the general complexity level can be viewed as: Detection > Recognition > Tracking. With Tracking being the hardest task. So for tracking objects, faster is better. With 30fps fast motions will suffer from motion blur and thus require some sort of post-processing (a research field in itself). Ah but there still needs to be room for a V3 right ;)

To be honest up to this point I haven't played a lot with the V2. The Speech Recognition API is not out yet, but might be with the next Developer Preview SDK update. I'm only interested in the whole picture - the multimodal bells and whistles. The moment it hits my computer, I will start integrating the V2 in my RTC Framework (see other post) and show off the results here. I'll keep you posted!

Steven Bos

GEAR: The Titan Supercomputer

posted on: 6 march 2014

o

AT SOME POINT DURING DEVELOPMENT OF A KINECT BASED APP I faced an annoying issue. I was working on a HP Elitebook 8530w, a high end workstation model my university advised me (and offered with a hefty discount in 2009) for the practicals of my study Computer Science.

The model features Windows 8.1 pro, 4 GB DDR2 RAM, Intel Core2Duo T9600@2.8 Ghz, 512 MB dedicated Quadro FX770m graphics card and a 250Gb old school platterdisk. When playing (read: manipulating) with the Kinect output in Visual Studio (2013) often my screen just froze. Other times it became very unresponsive. Obviously my computer was not able to handle the computations. The solution to the unresponsiveness was offloading stuff from the UI thread to a separate one. After implementing a simple frame counter I was able to measure performance - around 1-5 frames per second. And I was not even doing fancy things! Just the usual image processing like depth slicing (from 3D to 2D), pixel corrections over the whole image, etc. Next was playing with pattern recognition algorithms and that's when thing became really sour. At that point experimenting with code samples went from fast trial-and-error to slow think-before-trying.

The solution was a supercomputer. As a student, I tend not to have that or the Bitcoins around. Sure I could opt for time at the TU Delft, but that would be impractical - they work good with batch processing instead of real time development. So with a fair bit of help of my family I was able to buy my dream computer. I wanted to play with the SIRF/SURF algorithms and offload the calculations to my GPU. At that point I started to love working with the EMGU CV framework, so it required the best CUDA graphics card out there. It became the Nvidia GTX Titan, a 6GB DDR 5, CUDA version 3.5 capable card. Oh and performing 5 teraFLOPS of single precision numbers. That's crazy considering that 20 years ago you needed a house + power plant for that kind of power. Next was the processor. Ideally that would have been TheNextTel 128 core @ 4GHz processor, but alas, it's not out yet. Second best was the XEON E7 series with 15 cores, but at E 6000,- for just the CPU not really feasible. Third best was the new Extreme edition or the i4770k Haswell. They perform toe to toe, but in my case 2 extra cores would benefit with the Extreme edition. Still the lack of native usb 3.0 support and the new features Haswell brings I choose to go for the i4770k. To make a long story short, I went for 32 GB of low profile RAM (allowing for a RAM disk :D ) and a RAID 0 SSD array of 2x 250GB Crucials with power loss protection. 1000Mb/s throughput. Woop Woop! Other components are: Dark Rock Pro 2 + artic MX-4 cooling paste + ArtiClean purifier, MSI Z87-G45, Seasonic Platinum 660 Watt, Fractal Design Define R4 Black w/ Window and a Acer T232HLbmidz 23" Touchscreen IPS LED.

Well the system flies. I had to refactor my code into tasks and really embrace the parallel programming paradigm. The new .NET 4.5 with async/await, tasks and the TPL flow library really help to that end, maybe more on that in another post. There is still much performance gains to be made, exploiting the GPU more and using smart caching strategies. But for prototyping, this machine is really the ultimate development workstation.

Cover

Steven Bos

First blog post!

posted on: 1 november 2012

o

YEAH! AFTER A FEW MONTHS of on-and-off website engineering it's finally here. A humble blog on robots and other systems that (try to) understand. Hope you enjoy and plz leave a comment!

Steven Bos