The power of speech and Natural Language Processing
Updated: Mar 14
Now that we have sight, it becomes more powerful with ability to understand and respond to human language. Responding requires the ability to speak. So lets start with that as its the easier piece of the puzzle.
Text to Speech :
I used a package called gtts ( Google Text-to-Speech )
Reading out text is as simple as passing the text and the language to the gTTS
So lets make it say "These are my first words"
output = gTTS (text = "These are my first words", lang=language, slow=False)
and then we save the output in an audio file and play it.
So there we have it. As simple as that.
Combining Vision and Speech
Now we will go one step further and now connect vision to speech by feeding the output ( detected object text labels ) of the Object Detection ( Vision ) ability and let the AI read out what it sees through this simple text to speech output.
This will need extraction of the objects detected from the object detection program and then some logic to collect the labels only once in a python list. This is because in the object detection program, the labels appeared for each frame and we don't want the AI to repeat the label 24 times every second ! So a little thinking on the logic and there we have it , our AI which speaks what it sees !
Want to play an object detection game with my AI? Then take the challenge and check who wins. ( Needs audio ) . Good Luck !
So did you beat the AI ? Comment below and let me know !
You will notice that it could not detect the boat and missed other objects like tomatoes. I will need to lower the threshold for confidence level to cover more objects and as far as tomatoes, the AI is still a baby and the program did not include the lesson on identifying tomatoes. This program currently is able to identify 90 objects but the more objects we include in the training the more it will understand. Unlike humans however in this case, learning is not a matter of time.
Now that we think about what we have achieved, its remarkable. We have just created an AI that can beat a human is recalling and speaking out objects it saw for fairly large sized videos.
So next, I will be working on deploying this prototype on AWS so that anyone can try any youtube video they want and the AI will speak out the contents of the video. And for large videos, lets see who wins - you or the AI...
Speech to Text
Now, lets continue to do the reverse and convert speech to text. I again use a google API do do so and pass the captured audio from the microphone to the google API which returns the text. Here is the output of me reading out the first paragraph from the Wikipedia article for Canada.
Making sense of the Natural Language
The more challenging part is to make the AI understand human language.
I implemented a very basic sentiment analyzer by counting ratio of positive and negative words in a given text. The words classified by positive or negative sentiments are from the paper: Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews."
I tried it on a few customer reviews from amazon for the apple watch and here are the results. Not perfect but not bad.
Finally, lets combine the speech to text and the sentiment analysis to make an AI that is more conversational in understanding sentiment.