Introduction and video wizardry

Microsoft’s HowOld site for guessing your age from a photo was a viral success. The new My Moustache site that tells you how the moustache you’re growing for the annual ‘Movember’ charity is coming along (and offers to give you a fake moustache if you just want to join in the fun) might not take off the same way.

But it does show off some of the new tools Microsoft has added to the Project Oxford APIs that let developers use machine learning to find faces, understand what users say and type – and now, how they might be feeling.


“The emotion API detects emotions in human faces,” Ryan Galgon of Microsoft’s Technology and Research group told techradar. It suggests up to eight emotions that he calls ‘universal’ for faces detected in an image – anger, contempt, fear, disgust, happiness, neutral, sadness or surprise (or a mix of those) – and it can work with multiple faces in a picture. “We can already tell what’s happening in photos and who is in photos, and now we can move beyond that, with sentiment analysis.”

Imagine a photo app that automatically composites faces from multiple images so you get a family photo where everyone is smiling. “Or you could pick the best photo in an album based on whether people are smiling or not,” Galgon suggests.

Detecting beards and moustaches is another of the new face recognition options that developers will be able to use. “We also have significant improvements for detecting age and gender,” Galgon told us. Some of the new options are available straight away, and others will be available over the coming weeks.

Video wizardry

The existing face detection options will now work for video as well as still images, and the APIs can follow a particular person’s face through a video. Initially that’s about finding a face in the video, including knowing that faces do not usually disappear – so even if it’s not detected in one frame it’s likely to be there.

Video face detection

In time, though, you’re likely to be able to do the same kind of things for faces detected in a video that you can for faces detected in photos, Galgon says – so you could detect the emotions displayed during the video and look for when they change. “The APIs we have are starting to be able to work together, like the face detection and emotion detection. The direction we’re going for is to have them provide a common set of capabilities, regardless of the type of input.”

Not all of the frames in video will be interesting, or fully in focus, of course. Two further new video tools in Project Oxford do image stabilisation to clean up the video (using similar research to Microsoft’s Hyperlapse high-speed video) and motion detection. “The problem with motion detection is the false positives,” Galgon points out. “You do not want to detect motion every time a cloud moves across the sky or a car drives past; you want to detect where there is motion in the foreground.”

Motion detection

Learning new words

A new spell checking service is designed to clean up text users are typing into apps, especially on mobile devices, where it’s easy to miss off a letter or put a space in the middle of words, both of which the API can fix, as well as looking at the context to catch mistakes like ‘four’ instead of ‘for’. “There might be misspellings that can throw off the system,” Galgon pointed out. “If they’re looking for Chicago, typing hicago is not going to find it.”

Instead of the traditional spell check that just looks up words in a dictionary, the idea is to have the spelling API be able to deal with slang and ‘informal’ language. “The challenge is adapting over time when new phrases get coined or when a new startup becomes popular. So all of a sudden ‘lift’ is spelled ‘Lyft’ and it’s a valid word that was not a word a year ago. The nice thing about making this a web service is that when we have new words and models, we update those in the back end and developers get better results for free. “

The spell check API will not learn how different people misspell words (although that’s a possible area of ​​research), but you can give it specific terms for your application. Galgon suggests: “Imagine being able to build a better speller for a particular domain, you can tell the API, here’s a set of our product names that might not get recognised correctly.”

Audio services and powerful AI

Speaker recognition

Two audio services will be available later in the year. The new speaker recognition API will be able to work out who’s talking – not just to tell people apart in an audio track, but to recognise them specifically, based on a speech model built from existing recordings. “People can enrol their voice – we let them say a phrase and build a speech model from that, then when you send audio from them we can say ‘this is Ryan, or that is Mary’.”

That’s the equivalent of the face verification API for speech, he explains. “That tells you with two images, what’s the likelihood that this is the same face in both of them. Here we can say, given this audio file and this historical audio file, what’s the likelihood it’s the same person speaking.”

A voice is unique and apps could use it instead of a password in some situations, he suggests. “It’s not as secure as chip and pin, but it’s useful for apps that only need lighter authentication.”

Background noise

And Custom Recognition Intelligent Services – CRIS for short – learns the acoustics of difficult environments, or the speaking style of people whose speech is currently harder to recognise, to make voice recognition more accurate.

“Right now, the speech APIs do not do a great job with kids’ voices or with elderly folks or people who speak English as a second language,” he explains. “They’ve mainly been trained with people working in an office and in an acoustic model of somewhere like a conference room. If you’re at a kiosk at an airport or a baseball stadium, or you’ve got a mascot at a sports event and you want the system to be able to hear users and talk back to them in some way – the acoustic environment is very challenging at a sports game. There’s a lot of background noise, there might be echoing. “

Child voice model

It takes five or ten minutes of audio, and that takes ten or twenty minutes to process, so you can not yet do it in real-time, but CRIS can significantly improve the accuracy of the recognition.

The system can also build a model of how people speak from a couple of sample sentences, and you can add labelled phrases for unusual words – Galgon notes, “If you have player names or specific sports terms that a default recogniser is not going to recognise. ”

And crucially, it’s not difficult to use. “That’s been a complex task that’s required a lot of expertise in the past. Pretty much anyone can do this.”

Ease of use

Making these powerful AI features easy enough for developers to use with a couple of lines of code is what Galgon thinks is really different about Project Oxford (which remains free while it’s in preview, although some of the features are now included in the Cortana Analytics Suite, so businesses can use them to recognise customers using face verification or analyse sentiment in customer feedback on their website).

“We’re going to keep expanding the portfolio and the set of APIs over time. But we’ve focused on making it as easy as possible for developers to use, regardless of what platform they use – you can use this for any OS, any website. People without any experience of AI could make software understand what someone was saying. “

In time, he thinks we’ll just expect software that has these kind of smarts built in . “These are things that are human and natural to do. Our apps and our software should be able to hear and understand the world around them.”

  • How Microsoft’s machine learning algorithms will make for smarter apps