Taking a new work language in hand asks for gathering of the right ingredients in the way that the automatic speech could reach successful and feasible measures. It’s therefore imperative to create language and acoustic models from quality audio and textual data in terms of clean speech that suits its final purpose, to achieve the balance between common language and domain language or operating area (Courts, Town Halls, Media, for example). However, all of these areas we have been working with have big amounts of spontaneous and street speech, overlapped and noisy speech, dialectal variation, and also people using a second language as their first and main one. Here is a huge challenge for the recognition, not to mention a thorn in our side.

Everything starts in the spoken word and in the sounds available in a language for that word, so a match between sound or phoneme and written word would be possible. This match is our acoustic model, which in conjugation with the language model makes the recognition a reality.

Each word has its own identity and thus particular features demanding other specific words around so all sentences are syntactically speaking well-formed. There are words allocated to some words but not to others, and that’s because, whenever we are speaking or writing, there is always a given massage behind, the audience to whom we address our discourse, the vehicle we used to share it (a book, a newspaper, a call, report, etc.), the textual domain (Sports, Economy, Finance, etc.) and subdomain (football, financial crisis) or theme (Coronavirus, Trump’s Election, Refugees).

To this procedure, we call it words mapping (word sketch) and through which it’s possible to understand what are the words surround or close to the analyzed word, and typically those are the ones appearing with more often (close to the word “problem”, it’s very likely that adjectives with a negative meaning appear, such as “serious”, “huge”, “serious”, “painful”, or other from the same semantic field).

It’s also possible to find compound words by its fixity, by the huge occurrence of a given group of words. Its huge occurrence and fixity could determinate its unity in a given textual domain and not in another (for the word “team”, there’s a big placing of the words: “football”, “development”, “research”). It’s from this behavior analysis and the probability of its distribution, as well from the usage of different methodologies that we extract the language model.

These are the ingredients to make possible the speech recognition task. But there’s inside relevant subtleties, just to mention some, physical variants (recording’s conditions), geographic (phonetic variation for one word), discursive (discourse tone, new words), human (emotions), physiognomic (speaker’s age, vocalic tract), and these are the ones making the difference between a well-succeeded recognition or not.

VoiceInteraction‘s R&D team works hard every day so that new developments in the area are made, giving us a big technological advance! 

For more information, please contact us: info@voiceinteraction.tv