Unlocking the Power of Voice Recognition and AI

🔗

In this blog post, I will expand upon the concepts introduced in my previous article, Leveraging Embedded Systems and C for Sensor-Based Logic. This post is part three of the Jarvis v3.0 series. I'll delve into integrating voice capabilities into the Jarvis system and introduce part four of the series, which will focus on Apple's Vision Pro AR Headset.

At its core, Jarvis is intended to be a personal AI-powered assistant designed to streamline daily tasks and assist me with various activities. While popular voice assistants like Google Home, Siri, and Alexa have made significant strides as voice-controlled assistants, my version of Jarvis sets itself apart by offering full programmability and access to the internet's vast resources. It's not just a voice assistant; it's a fully-fledged computer companion.

Full Fledged Computer Companion:

One key aspect distinguishing Jarvis from other AI assistants is its ability to function as a fully-fledged computer companion. Unlike typical cloud-based AI assistants, Jarvis runs locally on your computer, allowing it to interact directly with the operating system and leverage many features a device offers. This local operation provides Jarvis with flexibility and power unavailable to cloud-based systems.

For instance, Jarvis can interact with your computer's file system, allowing it to create, read, update, and delete files. You can ask Jarvis to organize your files, back up important documents, or even clean up your desktop. For example, you could say, "Jarvis, organize my documents folder by file type," Jarvis would sort all the files in your documents folder into subfolders based on their file type.

Furthermore, Jarvis can interact with other software on your computer. It can open applications, control media playback, and even automate tasks within specific applications. For instance, you could ask Jarvis to open your email client and read out your new emails, or you could ask it to open your web browser and search for a specific term.

Jarvis can also leverage the power of your computer's hardware. For example, Jarvis can use computer vision algorithms to recognize objects or people if your computer has a webcam. If your computer has a microphone, Jarvis can listen for voice commands, transcribe audio, or monitor for specific sounds.

One of the most powerful features of Jarvis is its ability to automate tasks. With its access to your computer's operating system and software, Jarvis can automate many tasks, from simple tasks like setting reminders or sending emails to complex tasks like data analysis or web scraping. For example, you could ask Jarvis to scrape data from a specific website daily and save it to a spreadsheet, or you could ask it to analyze a dataset and generate a report.

Inspired by the Jarvis system from Iron Man, I aimed to create an assistant that responds to voice commands and integrates with embedded sensor systems. The latest iteration of Jarvis is a significant leap towards this goal. It merges the foundational elements of voice recognition, generational and conversational AI, and multimedia displays, bridging the gap between fiction and reality.

Voice Recognition:

One of Jarvis's core user experience features is the ability to comprehend and interpret human speech. Jarvis can convert spoken words into text by leveraging advanced voice recognition algorithms offered by any of the following services.

Google Speech Recognition
Google Cloud Speech
Wit.ai
Microsoft Bing Voice Recognition
Microsoft Azure Speech
Houndify
IBM Speech to Text
Whisper API

I chose to use Google's free voice-to-text service. Still, an alternative approach is utilizing a service such as PicoVoice, which essentially does the same thing and offers a Python module capable of callbacks. In my implementation, I bypassed the need for callbacks by parsing the speech-to-text (STT) content for specific keywords, which enables the distinguishing of commands from questions or general conversation.

Generational and Conversational AI:

Jarvis goes beyond voice recognition and employs generational/conversational AI techniques to enable dynamic interactions and deliver more human-like responses to users via text-to-speech (TTS) audio. By leveraging existing Large Langauge Models (LLMs) and public AI models, Jarvis can understand and process various queries while facilitating natural and effortless communication with a user. For instance, if a user asks Jarvis about the weather and then asks, "What about tomorrow?" Jarvis understands that the second question refers to the weather forecast for the next day.

By leveraging existing LLMs and public AI models, Jarvis can understand and process various queries while facilitating natural and effortless communication with a user. Many text-based/conversational AI services are available; however, I have tested Open AI's GPT and Google's PALM AI and plan to write up a post comparing the pros/cons of each.

Enhancing User Experience with Multimedia Displays:

Another advancement in this version of Jarvis is incorporating multimedia. Through the power of Python programming, computer vision, and libraries like OpenCV, Jarvis brings visual feedback to the forefront. Whether displaying videos, images, or user feedback, Jarvis adds an immersive and engaging element to our interactions.

Intelligent Command Parsing and Media Integration:

What sets Jarvis apart from Google Home or Alexa is its ability to be more than a single command-based system. It can seamlessly transition between a command executor, query engine, and multimedia player by parsing inputs and recognizing key phrases. For instance, while enjoying a movie or a series of images, Jarvis can intelligently detect when a new command is issued and temporarily pause media playback, prioritizing user interaction and ensuring a smooth flow of communication. Once the command has been processed, Jarvis smoothly resumes the paused media.

The Path Towards Personalized AI Assistance:

By combining voice recognition, generational AI, and multimedia displays, Jarvis can transform interaction while making the experience more intuitive and immersive. In the upcoming blog post, I will delve into the integration of Jarvis with Apple's Vision Pro AR Headset, unlocking the potential of Heads-Up Display (HUD) capabilities. Imagine seeing Jarvis's responses as text or images displayed in your field of vision or using gestures to interact with Jarvis.