Urdu Speech Corpus: Development of Urdu Speech Recognition Software at ITU

A group of researchers at ITU’s Center for Speech and Language Technologies (CSaLT) laboratory have initiated development of Urdu speech recognition software.

Speech recognition software can be found in almost every phone as technology advances further and further. However, the main problem from the Pakistani public view point is the efficacy of this software. As all the speech recognition software have the default English language incorporated in them, employing the software is difficult for most of the people.

Till now, no software has been developed to recognize and translate Urdu speech. In order for a language to be recognized by a computer, the computer is fed with a collection of words and sentences from that language. This collection of words contains words and sentences in different dialects and sounds to make it easy for the algorithm to recognize the said word.

Thanks to Dr. Agha Ali Raza, an assistant professor at Information Technology University, Lahore, and a Ph.D. in Language Technologies along with his team, has released the collection of Urdu words and sentences now recognizable by the speech recognition program algorithm. Through this collection, it can be said that the development of the software is halfway done.

Also Read: unveils Urdu version of its website

CSaLT Phonetically Rich Urdu Speech Corpus

This collection of words, basically called a corpus has been given the name of “CSaLT Phonetically Rich Urdu Speech Corpus”. This corpus consists of a 70 minutes long transcribed speech. It contains 708 sentences which cover all possible 36 phonemes (different sounds in a language). The total number of words in the corpus is 5,656. According to Dr. Raza:

“Speech recognition is a two-step process. Additionally, the corpus will give the computer application access to all possible phonemes used in the formation of meaningful Urdu words from everyday speech”

Phonemes in the Urdu Language

He explained that even though there are 63 phonemes in the Urdu language, the recognition of them is still difficult as the sound made for one word might not be similar to the one made before or after it. In other words, for any phoneme x, there will be 63*x*63 possible (tri-phoneme) sounds. Astonishingly, the corpus released by him covers all of them.

Dr. Raza developed the corpus along with the help of a dedicated team. He was supervised by Dr. Sarmad Hussain. The corpus has been developed as a part of his thesis during his masters at NUCES, Lahore. Moreover, among the other people who helped him have been Huda Sarfraz, Inaam Ullah, and Zahid Sarfaraz.

This corpus has, thus far, delivered the raw material for the development of the Urdu speech recognition software. Now, all we need is a database to store these words and make them accessible to the computer.

Dr. Raza stated:

“We hope that release of this corpus will also prove beneficial for regional languages in the country and languages lacking ample linguistic resources all over the world. Those interested in working on those languages can follow our technique to develop similar corpora of sentences in those languages”

One of the advantages of the development of this corpus is that the technique used in the expansion of the corpus can now serve as a default platform. As a result, it will work for any language for which written material is available.


Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

To Top