Best Options for Automatic Voice Transcription for Call Centers

Your data is valuable, not just the customer data you have on hand, not just the market data you use to forecast the best possible business strategies. Your call data is valuable. Contact centers handle tens of thousands of calls per day. If you are recording 100k hours of calls and you are not taking full advantage of this data through analysis and machine learning, you are missing out on low hanging fruit that can give you insights into best sales practices, efficiency optimization gains, and customer-agent type matching. Even if your company records 1,000 hours of call data per week, you are missing out massive potential gains.

This article is part of a series on speech analytics for call centers. Transcription is an essential nexus of the speech analytics pipeline. One should remember the adage “garbage in, garbage out” when evaluating the transcription process. Spending a little more on ensuring that your call transcription data gives you everything you need for the feature extraction and analysis will be well worth it. Transcription takes place near the beginning of the pipeline and everything else after it is built upon its integrity.

In this article we discuss automatic call transcription and the details you need to know when looking for the best fit for your call data, then we discuss hands off approaches to call transcription that let you leave most or all of the leg work up to professionals, and finally we discuss hands on approaches that involve utilizing the most cutting edge transcription models and tuning them to best fit the idiosyncrasies of your contact centers call data.

Example Pipeline for Call Data Processing

alt text

Automatic Call Transcription

Automatic call transcription vs manual transcription

The methods covered in this article are for automatic transcription only. Manual methods of transcription can be the most costly, and the accuracy is less certain with manual transcribers than machines. The best use for manually transcribed calls in the world of big data is that it provides a very accurate gold label for each example. This could be helpful in training a transcription model from scratch or in evaluating how accurately someone else’s model is transcribing your specific call type. Evaluating Google’s API vs your own model vs a professional transcription service’s would be prudent at the research or proof-of-concept phase of your speech analytics pipeline. Create a table with the costs, accuracy, turnaround time, and other factors for each transcription model.

Post call transcription vs live transcription

Many transcription services tout a 12 hour turnaround time for returning your transcribed calls to you after recording them. Other options offer quicker turnaround or even live transcription. Receiving your transcriptions the day after they arrive is not all that bad, you are able to do analysis on your data up to yesterday. This could enable you to keep a keen eye on your sales or service floor’s performance. It also gives you the data you need in order to start mining your calls for the hidden features and train artificial intelligence that can optimize your processes.

Live transcription offers all of the benefits of post call transcription, but it also enables you to adjust tactics while the agent is still on the call! One possible application is automatic flagging for upset customers. If the customer expresses behavior indicative of a meltdown the agent receives a popup message with instructions and the manager receives a notice so that she can check in if necessary. Another possible application is a popup message for the agent with a tip on how to best sell to that customer based on that customer’s speech in the call thus far. Another use for live transcription in a contact center is to get a real-time pulse on how your well your agents are adhering to the proven sales language behaviors you use in your speech analytics model.

Dual Channel

Contact centers split the audio data from recorded calls into two channels, one for the agent and one for the customer. You should maintain this split throughout your pipeline because you will want to be able to filter by speaker when you are doing analysis later on down the pipeline. You do not want to be trying to answer a question like “How often do my sales agent’s say x?” and suddenly realize that you can find how many times x was said in the call, but not by whom it was said. Transcription Services (Hands-off approach)

Full-service style

For those who are looking for a more turnkey operation when it comes to transcribing their calls, a full-service transcription company can give you transcriptions with minimal efforts from you and your team. These companies often offer additional services like analytics or an interactive dashboard that may be appealing to you if you want to pass everything off to the professionals and receive a finished product. One weakness of this approach is that this is the most costly approach to transcription. Some full-service style transcription service providers include Call Miner and Verint.

API as black box

If you have an analytics team or development team on hand at your company you may want to use an API (Application Program Interface) for your transcription needs. Using an API gives you the opportunity to integrate transcription into existing code you already have. You get more options to customize where transcription fits into your pipeline, how you manage your call data, and you configure any analytics or data visualization to your company’s needs.

Three popular APIs for speech-to-text transcription include IBM Watson, Google, and Amazon Transcribe. Google’s pricing is $0.006 per minute. Google caps the monthly usage at 1,000,000 minutes. Watson’s pricing starts at $0.02 per minute for the first 250,000 minutes, getting cheaper with each tier until the price bottoms out at $0.01 per minute after 1,000,000 minutes. The cost of using Amazon’s Transcribe API is $0.0004 per second (not in minutes as the others are) with a minimum charge per request of 15 seconds.

Speech-to-text API costs

Amazon Transcribe

Google Speech-to-text

IBM Watson

Cost per second USD




Cost per minute USD




Cost per 15 minute call




Of the APIs covered here, Google’s is the least expensive. Though since they cap the number of minutes you can send through their service to 1,000,000 per month. This would mean you would be limited to transcribing no more than 100k 10 minute calls per month or around 66k 15 minute calls per month.

Hands-on Approach

API personalized for your unique call type

The APIs mentioned above offer some ways to tailor the transcription to your needs. Watson’s phrase flagging could be useful when you want to look for specific phrases in your calls. One example would be making sure that your agents read a certain disclaimer at a given phase in the call. Since you curate the code that calls the API, you will be able to adjust the parameters as needed.

Open source solutions

CMU Sphinx is an open source project from Carnegie Mellon University that bring speech-to-text to anyone who can download and run the software. This approach is by far the least expensive in terms of cost per minute to transcribe, it is effectively zero. The challenges that will come from using CMU Sphinx are that you are installing and running the software on your machines, you will be responsible for updates, and that if any issues arise you may reach out to the open source community for help, but you will ultimately be responsible for keeping your transcriber running.

Build your own model

Building your own speech-to-text model is only for the most audacious. This would be an arduous task without a clear picture of how much resources it would take to achieve. But if you wanted to build a model that fits best to the idiosyncrasies of your specific call type, it may be useful. However, you should soberly weigh the costs of building your own to the costs of using another’s model. Additionally, even if your team is composed of several PhDs in computational linguistics, you may find it extremely difficult to achieve an accuracy score on par with the services and APIs above.

Next steps

What does your call processing pipeline look like? Are you wondering what the next steps you should take in modernizing your current call processing pipeline are? At Front Analytics we have experience in speech pipeline engineering, optimization, reporting, and integrating machine learning and AI into clients’ current processes. We have empowered numerous clients with the tools and knowledge to utilize the most up-to-date technology in artificial intelligence, machine learning, and data streaming on their valuable data. Let us see how we can help you extract the most from your call data.


Call Now 1 (801) 716-8101

- or -

Schedule a FREE Strategy Session