Mobile Apps Development

Voice Technology: What You Should Know In 2020

September 28, 2020

Introduction

Imagine you are playing your favorite video game – Call of Duty on your smartphone. The game will without fail continue on for another 30 minutes and now you are hungry. You want to place a food order but you can’t exactly take your eyes and hands off the game. What do you do now? Luckily, your voice is still free, if only there were a device that could listen to your oral order and do the work needed so that your food will be at your doorstep 20 minutes later. Such a future is not off-base at all. In fact, there is a solution to your wishes, and it’s called voice recognition technology. In this article, you will learn about some key facts about voice technology and several well-known voice assistants.

Definition of voice technology

voice technology
Figure 1: Voice search on the web. Source: pixabay.com

So, what is voice tech? Voice recognition technology can be a program (software) or a standalone device (hardware). However, what lies at the core of voice tech is the ability to take a user’s voice as input or a command to either produce a desirable output or execute a task. A program/product that takes a user’s voice as input to execute a task/produce an output. For example, to check what the weather is, instead of having to touch your smartphone, you can just say “What’s the weather now?” and your phone will speak back the result to you.

Classification

Although there are many ways to classify voice recognition technology, for simplicity, we can divide it into two types of speech recognition systems: speaker-dependent and speaker-independent. For speaker-dependent systems, the user has to provide training datasets (clips of their voice) to a system for it to train (get used to) before it can start recognizing the user’s voice. With the other type of speech recognition systems, the user doesn’t need to give it training datasets first. The downside for this system is that it can only recognize a limited amount of words (learned by the system while in production). Therefore, voice technology embedded in consumer-grade products is mostly in the form of speaker-dependent systems.

Architecture 

Like many other computer-powered technologies, a typical voice-enabled app has two layers: 

  • Software (which includes the frontend, a.k.a. the “face” of the app, consisting of user interface, UI, UX; backend running behind-the-scenes services like AI.)
  • Hardware (smart device: computers, smartphones, smart speakers, other IoT devices, etc.) 

As Moore’s law has not shown signs of stopping, voice technology can benefit and improve continually from the latest advances in both software and hardware.

Applications

As you can probably have guessed, the potential for voice technology is enormous with applications in many fields with more to come (maybe you can come up with a new solution to solve a pressing problem using voice tech one day!). Here’s a few of them:

  • Healthcare: recording patients’ medical history without having to interact digitally with the EHR (electronic health record).
  • Virtual assistant (or voice assistant, can carry almost every task a smart device can do using voice only, which will be explored further below).
  • Web browsing without touching any key.
  • Automatic subtitling (the voice from an audio clip will be transcribed from speech to text and synced automatically.

Market forecast

Judging from the huge prospect of voice recognition technology, you’d be right to guess that demand for it can only be growing. In actuality, the market for voice tech is projected to be 8 billion voice assistants in 2023, most of which are on smartphones, with a growing number of voice-enabled apps developed on smart TVs. Consequently, there are many opportunities for developers to take advantage of voice tech APIs to build apps to improve and even transcend user experience.

Software

For developers, since many voice tech hardware solutions are closed-source (e.g., smart speakers), the main avenue for making use of voice tech sits mainly on the software side. Indeed, the software can be proprietary, too, as most popular voice technologies are. That said, there’s a tiny growth of open source, community-backed voice tech that may be of interest to developers who are willing to take risks.

  • Popular closed-source voice recognition technology (Alexa, Google Assistant, etc) uses API to enable developers and users to create apps exploiting the potentials of voice assistants
  • Open-source voice assistant tech that’s compatible with many types of hardware: Mycroft, Firefox voice.

Next, we will introduce a number of popular voice assistants.

Well-known voice assistants

Amazon’s Alexa

voice technology, voice command
Figure 2: Amazon Echo – A smart speaker with built-in Alexa developed by Amazon. Source: pixabay.com
  • Applications:  You can build Alexa “skills” (voice-enabled apps), voice-forward devices, and enterprise solutions (i.e., Alexa for Hospitality).
  • Languages supported: Node.js, Python, other back-end languages are also supported.
  • How it works: For a simple Alexa skill, there are 4 steps to build: create the voice user interface, input training data (utterances, intents, and slots) to the skill, test the skill, and finally deploy the skill.
  • Advantages: Amazon’s Alexa accounts for 80% of the smart speaker market. Amazon is also the most popular cloud provider with AWS. Alexa also has thorough documentation with detailed written, video tutorials for beginners.
  • Disadvantages: Alexa has not yet been widely used on mobile devices like Google Assistant.

Google Assistant 

Figure 3: A section of Google’s Assistant for developers homepage, introducing common uses of Google Assistant
  • Applications: Google Assistant is available on most Android devices. As a result, a user can command Android apps hands-free by voice.
  • Programming languages supported: Node.js, Python, Java.
  • How it works: There are 4 ways to create Google Assistant apps, each has a different library and templates: mobile apps, personalized conversations on smart devices, smart home devices control, and web content.
  • Advantages: Since the majority of the smartphone market is dominated by Android devices, Google Assistant is just a touch away. Google Assistant resources for developers contain extensive documentation, use cases, available apps using Google Assistant. Google has an online training platform to help to learn Google Assistant (Qwiklabs).
  • Disadvantages: The market share for smart speakers (Google Home) is small compared to Alexa (20% to 80%).

Apple’s Siri

voice technology
Figure 4: A section of Siri’s homepage, giving hints to Siri’s popular function “shortcuts”, which recommend actions users can take by voice
  • Applications: Similar to Google Assistant, Siri replaces a user’s touching input by voice to control an Apple device. Sir is also available on Apple Home (a smart home control system).
  • Programming languages supported: Swift.
  • How it works: Eg: To create a Siri “shortcut” (Siri’s personalized recommendation action to users), you need to do the following: build custom intents, responses and donate the shortcut of the action to Siri.
  • Advantages: Suitable for developers who want to build apps for an Apple audience.
  • Disadvantages: Less user-friendly documentation compared to other voice recognition systems’. Only Swift programming language is supported.

Facebook’s Wit

Figure 5: Wit’s homepage, showing brief introduction to common usages
  • Applications: Wit can be integrated into smart homes, wearable devices, bots, mobile apps, etc.
  • Programming languages supported: Node.js, Python, Ruby, Go.
  • How it works: First, you provide training data to Wit, then make an HTTP request to the API, finally the request will be returned to the user.
  • Advantages: Wit is open source software developed by Facebook. It has an evolving online community supporting development.
  • Disadvantages: Despite supporting many programming languages, there is sparse documentation for a language SDK.

Mycroft

voice technology
Figure 6: A section of Mycroft’s homepage, introducing its ubiquitous multi platform availability
  • Applications: With Mycroft, you can make voice-enabled apps on computers, cars, smartphones, and even microcomputers like Raspberry Pi.
  • Programming languages supported: Python.
  • How it works: Steps to create a Mycroft “skill” (similar to an Alexa skill): introduce input phrases, build intents, create intent handlers, finally write a function to return the “skill”.
  • Advantages: Mycroft is fully open-source, so you can modify its code as needed. It is available on multiple platforms, privacy-friendly and cloud-independent (meaning you can store your data locally).
  • Disadvantages: There is a slight learning curve for developers coming from other voice assistants as you need to know Python comfortably prior to making even a simple Mycroft skill. In overall, Mycroft is not yet popular with users and voice-enabled app developers.

A conclusion about the voice technology in 2020

Depending on your project/product, skill level, programming languages you know well that you choose a suitable voice-assistant development kit. In the case of Siri, even if the documentation is a bit cryptic, there are many more easy-to-follow Siri tutorials on the web to get you started. Since the core of voice tech is the same, the skills you learn from building a voice recognition app on one platform will be transferable. You will not have much difficulty when learning another voice assistant. 

Other articles you might like:

Also published on

Share post on

Get in touch

Simply register below to receive our weekly newsletters with the newest blog posts