What exactly is AI Voice Over and how will this affect the Voiceover Industry? Previously known as TTS, or Text To Speech, AI Voice Over, or Artificially Intelligent Voice Over is a huge and growing part of the industry that has far-reaching consequences for all voiceover artists, agents and buyers alike.
This article looks at AI Voice Over in-depth, explains what needs to be explained, and debunks myths that need to be debunked!
We are going to look at:
- The History of AI in Voice
- Concatenated Phrasing
- Phonemic Concatenation
- Algorithmic IA and the Tacotron 2 model
- Where AI Voice Over is Now?
- The Different Types of AI VO in the Industry
- AI Voice Contracts
- Your Currently VO Contracts
- AI VO Contracts Themselves
- Is TTS / AI Voice Forever?!
- How To Choose Whether to Do an AI Voice Over Job?
- How to Price AI Voice Over Jobs
- Conclusion
So. To bring you up to speed and understand the current industry, we first need to look at what has gone before so that all is clear. So first up….
The History of AI Voice Over
AI Voice Over has gone through several evolutions in the past few decades. Let’s look at them all in chronological order:
Concatenated Phrasing
The very first iteration of AI Voice Over was concatenated phrasing. Concatenation means the joining of more than one thing together, in this case vocal phrases. The most obvious example of this were the sentences created on telephone systems or train stations from a bank of pre-recorded material:
“The train arriving on platform – 1 – is the – 13 – 35 – Great Western – Service to – Scunthorpe – calling at – Wembley Stadium – High Wycombe – Princes Risborough…” … and so on
This was an effective, yet basic way of creating multiple sentences from previously recorded material. It’s debatable whether this could be actually described as AI Voice Over in its truest sense, but this approach was the first application of it in the real world.
Phonemic Concatenation
AI Voice Over then advanced to Phonemic Concatenation. Wait, what on earth is a Phoneme!?
Ok, a phoneme is a linguistic term for the smallest unit that speech can be broken into, or to use the dictionary definition: “The perceptually distinct units of sound in a specified language that distinguish one word from another”
To give you an example of these phonemes in the simplest sense, let’s take the word ‘Cup’. The phonemes in Cup are C-u-p. The phonemes in Bathroom are B-aaah-th-r-oo-m.
Phonemic Concatenation is the same approach as the original Concatenated Phrasing, just with phonemes instead of phrases.
So if you record the B, the Aah, the Th, the R, the OO and the M, and then glue them all together, you get….Bathroom.
Now obviously, this is a much more complicated thing to do, and there is a much, much greater change of this sounding completely rubbish, especially if you actually recorded the phonemes individually – it just wouldn’t work and would sound very stilted.
So the initial masters of phonemic concatenation, the tech giants, found algorithmic ways to record large collections of dialogue (voiced of course, by voiceover artists and actors, not machines), cut them up into phonemes, then glue them back together to make sentences. They would then apply clever smoothing algorithms to make the end-result less ‘bumpy’.
Massive Collections of AI Voice Over Data
For the sake of everyone’s sanity, I won’t get into the linguistic terms of Diaphonemes and Alephones, suffice it to say that these phonemic concatenations were created on-the-fly and live, in a matter of milliseconds.
You typed into the system “Hey I want to be a robot!” and within milliseconds, the system had spat out the speech to you.
But how did they actually do it?
The truth of the phonemic concatenation system is that vast amounts of data needed to be recorded. Not only covering all possible different constituent parts of the language in question, but also many different versions to allow matching of tones, pitch, speed and so on.
The recorded sentences were then cut into tiny parts, ready to be glued back together later.
So if we were making the word ‘Robot’, we need:
- The “space to the R” sound
- the “R to the O” sound
- the “O to the B” sound
- the “B to the O” sound
- the “O to the T” sound
- and then the “T to a space” sound.
…taking just one of these, for example, the algorithm could look at the 25 “R to the O” sounds it had stored in it’s database, pick the most appropriate based on it’s pitch or prosody, select that, move to the next one and so on.
Et voila! Robot!
This is why the voice sessions were often 6 months +, recording 5 days a week! There was such a vast amount of voice data that needed to be captured, to get enough coverage to make the whole thing work.
The more data you had, the less chance of glitches you had. The less data you had, the more bumpy it was going to sound.
What you Speak is What You Get!
One of the more interesting things about the phonemic concatenation method, and indeed, AI Voice Over in general, is that the style you deliver the dialogue into the system (i.e., how you record it) is how it sounds when it comes out the other end.
If you deliver dialogue in a very sad way, the AI Voice Over at the end will sound sad!
Playback Dialogue
But fantastic though the ‘big 5’s text-to-speech services sounded – you know them all, Amazon, Google, Apple, Samsung etc – not everything was as it seemed. Sometimes you listened to the output and thought…..”Holy cow, that’s impressive!”
But as with many forms of media and entertainment, there was (and is) a certain amount of smoke and mirrors involved.
Many of the firms actually mixed the AI Voice Over output with normally recorded, or ‘wild’ lines of dialogue.
It was genuinely impressive because it was just the voice actor being great!
To make their systems more efficient still, the companies would cache requests so that less processing had to be done. If they received a string request for “What size shoes does Hugh Grant wear?”, that was processed and stored so that the next time that string was asked for, it was already there to be delivered.
This form of mixing playback lines with AI Voice Over lines still exists to this day, and is one of the reasons that the voice over artists worrying about their livelihoods should not panic just yet. More on that later.
Either way, the production costs were prohibitively high for large scale production of the phonemic concatenation model due to the sheer scale of data needed; the big tech companies could afford to do it, but no-one else could. So, the next evolution occurred….
Algorithmic AI Voiceover & “Tachotron 2”
For many reasons, primarily cost but also giant leaps in machine learning technology – oh, and notice that I don’t use the term Artificial Intelligence here, because it’s not actually AI, just machine learning – the industry moved to a more algorithmic model.
The first widely adopted model was called the Tacotron 2 model, which uses machine learning to analyse voice data, create a ‘model’ and then use that model to output speech files.
For anyone who is interested in a slightly more in-depth look into Tacotron 2, check out Google’s notes and documents on the subject.
Tacotron 2 had it’s flaws; it had a comparatively low bitrate and bit-depth making it sound a little ‘lossy’ and took a lot of processing time to build a model. It also was comparatively slow to process, meaning that most companies or teams using it couldn’t process in ‘real-time’, or at least real-time enough for a customer to not notice the lag time needed.
But it was a huge leap forwards. No longer did you need to record 500,000 words of dialogue; now 50,000 would do. Then 40,000. Then 20,000, and so on.
Not only that, platforms like Google released their Google Cloud Development platform, making machine learning commercially available to anyone who wanted to pay.
It was now readily available to the entire world, and this spawned the gold-rush of AI companies wanting to work in AI Voice Over. Just from my own tracking records, and the Samsung Now AI reports, in 2018 we counted around 15 companies working in the sector, by mid 2020, we stopped counting at 250, and now there are thousands around the world, all forging new ground, and in innovative ways.
Where AI Voice Over is Now
The AI models have now evolved. Some are still using Tacotron, some have evolved their own systems and algorithms. Some are quite fantastic sounding, some, are pretty terrible, and there are many iterations in-between.
Some companies, like Replica Studios and Veritone, are actively trying to engage the voice community and come up with fair ways of remunerating the artists.
Some companies, like Speechelo, are actively attacking and against the voice community, are VC backed and are trying to disrupt the market to their own advantage.
Some companies like Voice123, the pay-to-play job site, have started taking an active role and engaged AI Voice Over companies to create models for them, as tests to start the research process
But suffice it to say, Pandora’s Box cannot be closed and the AI Voice Over industry is firmly here to stay.
So it’s not going away. How much should we as voice artists be worried, and how is it going to affect us? Well to understand that, we need to answer a few more questions first. Let’s look at:
The Different types of AI Voice Over in the industry
This section looks not from the consumer’s point of view, but from ours, the voiceover artist’s. There are a few different types of AI Voice Over that we should be aware of:
- Company AI Voice Over
- Contract AI Voice Over
- Prospective AI Voice Over
- P2P AI Voice Over
- Own Model AI Voice Over
- Training Model Jobs
There are so many (and ever increasing) opportunities for doing this kind of work, that it can be a bit of a minefield, and many VO’s just don’t know what to agree to, nor why they should. Or shouldnt.
So let’s go through each of these in-turn.
Company AI Voice Over
This is where you are being employed by a company (for example, Amazon, Google, Apple etc) to be the voice of their own brand TTS / AI voice.
Some important things to consider here.
Positives:
- You know where and for whom this is going to be used.
- You know which platforms this will be used on
- You can estimate a rough shelf-life of the product based on the company
- You are likely to get a lot of exposure
- You are likely to be asked to iteratively record more in the future
- You can charge additionally for exclusivity
- The casting processes are large and time-consuming; you have leverage to negotiate once you are chosen as the final voice
Negatives:
- The company will almost definitely insist on a full buyout, in-perpetuity
- You may not be able to get work with direct competitors
- The company might sell your voice in the future which you may have no control over
Conclusion:
Generally, if you’re going to do AI Voice Over work at all, these are often the safe bets. You know what you’re getting into, where the voice will be used and what you are compensated for it – you can make a legitimateand educated choice to do the work or not.
Contract AI Voice Over
This is quite similar to Company, in that you know what you are getting yourself into. Ostensibly, you are being hired to do one contract and fulfil that contract with dialogue recordings which will then be made into a voice model for use on that particular contract.
A good example of this would be creating a character model for a game, which is very likely only to be used for that game.
Another example might be creating a model for some of the new speech-to-speech technologies, seen developed by companies like Altered AI where you are creating a voice model for another actor to perform in, just like wearing a ‘voice’ for a performance.
Positives:
- You know where and for whom this is going to be used.
- You know which platforms this will be used on
- You can estimate a rough shelf-life of the product based on the contract
- You might get a lot of exposure, and indeed are likely to not be under NDA, so could use this on your resume
- You are likely to be asked to iteratively record more in the future
- You can charge additionally for exclusivity if required
- Full-buyout in perpetuity can be negotiated down to the lifetime and limitations of that particular contract
Negatives:
- The company might sell your voice in the future which you may have no control over
Conclusion:
Properly negotiated and contracted, Contract is likely to be a very good bet, and to present huge opportunities for voice artists in the industry. As the AI Voice Over industry grows, these contracts will grow and grow and be more and more available.
These should be seen as opportunities, and a great potential of the future of voice artists working in the sector.
Prospective AI Voice Over
If you’ve been working in the VO industry for any length of time, you’ve doubtless seen these gigs come up already, and likely been contacted by some of the companies doing them.
The process goes something like this. They contract you to provide the voice. They then create a model. They then find a customer to buy the model. They sell the model to that customer.
Positives:
- You get some voice work.
- You might get a lot of exposure
Negatives:
- The company will sell your voice in the future which you may have no control over. You cannot decide or vote on where this is sold, whether this is to a reputable company, or to be the voice of a porn-site or a sex-doll (I do not jest: both of which we have had reports of happening).
- You do not know which platforms this will be used on
- You cannot estimate a rough shelf-life of the product based on the contract
- You are unlikely to be asked to iteratively record more in the future
- You can not charge additionally for exclusivity if required
- Full-buyout in perpetuity will absolutely be required
- You will very likely exclude yourself from many markets. For example, if your model is sold to British Airways, you can no longer work for any other airline. If they also sell it to Ford, you can no longer work for any other car company etc.
- You will likely encounter legal problems in the future where you are asked to be exclusive for a company but cannot due to your existing models in the marketplace.
- You are paid once and then never again, even though the hiring company will sell, and resell, and resell – you are excluded from future usage.
Conclusion:
It is your choice how you work, entirely. However, we do not advise doing prospective AI Voice Over work.
The caveat to this would be unless you have watertight contracts that allow you to decide on usage, where it is sold, get future royalty payments and so on…..just as you do now with your own voice.
P2P AI Voice Over
This section of the industry is where you are involved in creating a model, and the host company then sells the dialogue line-by-line, or in bundles, or subscriptions (with x lines or unlimited lines per month) and so on.
These are used generally, and line by line, like companies such as Murf AI do, or sometimes in bundles or subscriptions, sometimes in games and so on.
Positives:
- You get some voice work.
- You might get a lot of exposure
- Some companies will pay royalties or ongoing fees or usage
Negatives:
- The company will sell your voice in the future which you may have no control over. You cannot decide or vote on where this is sold, whether this is to a reputable company, as with the Prospective model.
- You do not know which platforms this will be used on
- You cannot estimate a rough shelf-life of the product based on the contract
- You are unlikely to be asked to iteratively record more in the future
- You can not charge additionally for exclusivity if required
- Full-buyout in perpetuity will absolutely be required
- You will very likely exclude yourself from many markets.
- You have no control over the subject matter being used; you may be republican but have your voice used promoting the democrats, or for tobacco advertising, or alcohol etc.
- There is no policing on the internet where this can be used, nor how much.
- You will likely encounter legal problems in the future where you are asked to be exclusive for a company but cannot due to your existing models in the marketplace.
- You are paid once and then never again, even though the hiring company will sell, and resell, and resell – you are excluded from future usage.
Conclusion:
We do not advise doing P2P AI Voice Over work. The caveat to this, again, would be unless you have watertight contracts that allow you to decide on usage, where it is sold, get future royalty payments and so on, in the same way as Prospective.
One other point to note though, is that there is often Character work in the P2P genre, and if you are doing character work that does not sound like your natural voice, you might not be worried about signing away the rights to this particular character as it might not affect you so much in the future.
Conclusion – Part 2 – Tracking & Usage
One of the main reasons that this part of the industry is so difficult is that there is no standard, global way of tracking or watermarking in the industry yet. But, we are seeing huge amounts of VC money being thrown into non-fungible tokens and blockchain technology, which may well yield a solution to this problem.
Once that problem is solved, then tracking, and therefore usage could become a large part of our earnings and the advise of steering clear, may well change. But that’s not currently the case.
Your Own Model AI Voice Over
So, being a human, you kind of have to sleep. But the internet doesn’t and is a 24×7 global industry.
Although it’s not at all common at the moment, there is a future train of thought that we as voice artists can create our own models (likely by a company we partner with, or pay for the service), and the sell them in an evergreen-fashion ourselves, or on a marketplace.
The idea is that you would continue doing your own VO work, but also have your model working for you alongside that.
Because this hasn’t really taken off yet, it’s not really clear what the pros and cons of this situation are.
How much would it cost to create the model? How much work would you get from it? Would it be worth it? How much control would you actually have on who purchases, what it’s used for and where?
The one thing that is clear is that the contracts you have with the partner are still going to be king, and just as important as the contracts in any other section of the AI Voice Over industry.
Training Model AI Voice Over
This one is a little bit of a red-herring, as it’s not really a genre. But there are lots of jobs in the industry at the moment that focus on this, and they are widely misunderstood. So I thought I’d include it for you.
When a model is created there are lots of component parts and data points. There’s the actual voice that’s used and heard, yes (see the above types of jobs!) but there are others too.
One of these are prosody patterns. The inflection & pitch of your sentence delivery, over time. Your natural prosody pattern is not the same as mine. Mine is not the same as your mothers. Your mothers is not the same as my son’s first-grade teacher, and so on.
Training jobs are just that. Your voice is used as a basis for training the model to do other things – it’s not going to be using your actual voice to do it, and your voice won’t be heard in the final result.
Clearly it’s hugely important to check your contract with these jobs to make sure they aren’t pulling a fast one, but these are legitimate jobs, and don’t really affect your future career, usage, reputation and so on. But because these are not actually using your voice, they tend to be much lower paid.
AI Voice Over Contracts
…and this is where it starts to get tricky. I’m not a lawyer, but have had a lot of experience with AI Voice Contracts. Do not take this section lightly, skim it, skip it, or otherwise not ingest this section in the same way that James Michael Collins ingests lobsters…
GFTB’s own Bev Standing recently took Tik Tok to court for misuse of her voice model, so getting these right at the start is very important.
Your Current Voice Contracts
That’s right! We haven’t even gotten onto the AI Voice Over Contracts yet!
I recently asked the attendees at the One Voice Conference if they used voice contracts on every job they did. Out of 100 people in my session, only 3 put their hands up. This is madness people!
If you do not have a contract with your employers you are leaving yourself wide open, and worse, stopping future earnings with that company!
Please go and review your contracts, and the Contracts webinars at Gravy For The Brain now, in order to protect yourself.
But, what does this have to do with AI Voice Over I hear you ask?
It’s this. With all your current jobs that are not AI related, for example, e-learning….have you signed a bought-out, in-perpetuity contract?
Or maybe you remember some language in a contract that said something like…..”on any known platforms, items, instances or devices in the future whether known or not known now”?
If you did either of those…and let’s be honest, we all have…. that company legally can now go and create a voice model with your recordings you did for them in the past, sell it to whoever they want and you have no rights to that at all.
The Moral of this Section!
….could not be clearer:
Go and update all your current T&C’s, contracts, templates, whatever, so include language that specifies… that the recordings you are hired for may not be used to create an artificial voice models of any type, now or in the future, on any platforms or devices known or not yet known….
Now is the time to protect yourself for everything that you do in your profession.
The Actual AI Voice Over Contracts!
Ok, so having scared you to death about your current contracts, let’s look at things you might want to consider with the AI Voice Over contract when you get it, or when you’re negotiating it:
- Exactly how much dialogue are you recording?
- Do you have an exclusivity period, and if so how long? What happens at the end of this period?
- What fee are you getting?
- What usage are you getting?
- What is your future working / pickup rate?
- What kind of model are your recordings going to be used for?
- How long can they use your model for?
- Where can they use your model?
- On what platforms can they use your model?
- In which territories can they use your model?
- Can your model or individual recordings, or individual lines be sold? Can they be resold? Do you have any say in this?
- What happens if the host company is sold? What happens to your model? Do you have any say in this?
- Can they use your model or recordings or lines on broadcast media, such as television, film, on-demand, radio, podcasts -0 and if so how are you remunerated for the usage?
- Can your recordings be used to train other models?
- Can your model be performed by another human (e.g., speech to speech)?
- Can your recordings / model be amalgamated with another person’s model or recordings?
- Can your recordings or model be maniuplated, e.g, pitch shifted, made to be emotional etc.
- How are the host company going to track or watermark the recordings, or model?
- What specifically are your rights, and what are being licensed or signed away by the contract?
- Are you being contracted by a third party and not directly by the client? If not, who is the client! How will you contact them when the job is over?
Now clearly that is a long list, but it’s by no means exhaustive. My advice would be to consult an experienced legal entity well-versed in these matters, and become a member of your local union, such as Equity, or Sag Aftra who are all working hard on these subjects on your behalf.
Google and peer-led advice are also your friends here.
Is TTS / AI Voice Forever?!
It’s often bandied about that if you sign an in-perpetuity deal, that the evil companies have your voice forever. And technically, yes that’s true. But what’s the reality?
The reality is that technology and development moves faster now that at any point in human history.
Jon Briggs and Susan Bennett were the first TTS voices for Apple’s Siri, and went global in fame (largely because they both were involved in Prospective AI that didn’t give them the remuneration they deserved) – but were replaced on Siri in less than a decade.
Standards change, approaches and methodology changes. Voices, and fashions change.
A voice you record now is not very likely to still be in circulation in 20 years time. That’s worth bearing in mind, when thinking about whether to actually take an AI Voice Over job.
How To Choose Whether to Do an AI Voice Over Job?
The answer to this is, of course, highly subjective. But I hope the information given within this article will help you.
Think about:
- The type of job it is
- The future ramifications to you (even if they aren’t obvious)
- Whether the remuneration is worth it
How would you advise yourself if you could look back in ten years time to now!? (Sorry, I’m a trekkie…)
Crucially though – have a good conversation with the hiring party. Ask them questions. All the questions. The list of considerations I posed above.
Immerse yourself with the detail of this project.
If the client isn’t willing to answer all the questions openly and honestly, it’s probably time to thank them for their time, and walk away.
If they do answer you properly and engage with you, get all the information you need and then make an educated & informed decision.
Good AI Voice Over jobs are fantastic, and can be excellent for your career and progression. Bad ones…not so much.
How to Price AI Voice Over Jobs
This is probably the hardest question to answer. The range that I have experience of, through my own knowledge, and through seeing other jobs come and go, ranges from the low thousands to the medium hundreds of thousands.
It’s almost impossible to come up with a consistent pricing structure for AI Voice Over gigs, because by their very definition, they are so inconsistent.
Take a look at the GFTB Rate Guide and look up TTS / AI for more information and you’ll get a better idea.
The most important thing is to be fully aware of the project, it’s reach, the usage and the client, and then discuss with peers and appropriate compensation level.
Remember, once you’re actually chosen, you have way more leverage than you think.
Conclusion
I hope you’ve found this useful. This is still an industry in its infancy and things are changing rapidly. I would encourage you not to be a naysayer or a prophet of doom regarding AI Voice Over jobs because….
Good AI gigs are absolutely fantastic and will enhance your career!
It’s just that….
Bad AI gigs have the potential to be very harmful to you!
…and each case must be looked at in turn to see their individual merits, problems and details. Yes the waters here are often shark-infested, but that doesn’t mean that there aren’t dolphins out there.
As always we will be updating our education on this subject at Gravy For The Brain, and I wish you the best of luck!
Hugh Edwards
CEO, Gravy For The Brain
(Casting Director / Voice Director on 45+ TTS/AI Voice Projects)
John Nelles says
Thanks for this very informative and thorough article. In Canada, ACTRA is also dealing with this. Currently it’s in dispute in the commercial industry and the use of non-union talent.
It’s is a frightening future as this technology advances so swiftly.
Thanks again
Chris Lines says
Well done. This was very useful, though your caveats on contracts is pretty much par for the industry.
Pat Jones says
Thank you for this factual and in-depth explanation and for your instruction on how to respond.
Jinny Martin says
Thank you Hugh!
Incredibly helpful. Clearly explained and you’ve hit all the points about which I’ve spent considerable time researching and remained confused!
Many many thanks!
Jinny
Jen Lawson says
Thanks for a fabulously detailed article team!
Lots of great stuff in here that will be super useful to many of us.
Always a huge concern, but you’ve de-mystified much of it for us. Thank you.
Jen Lawson
Ebony says
This is a topic that needs to be published on any voice acting website. There is a desperate need to address this issue full on. I remember when they wanted to motion capture Jet Li’s moves for Kung Fu panda(I believe). He declined as it was a smart thing to do. Any studio can just reach in retrieve data and create a character they no longer need to pay for at all. ” The Congress” is a great film to watch regarding the technological advances that give rise to do we necessarily need people to create performances or can we recycle those performances on screen and make 5x the money back. After all most backgrounds are green-screened now, their is no need to film in locations as their used to be, and the same could be said for actors and voice actors. They can just insert you into wherever you need to be, and reuse the performance repeatedly. Studios are looking for methods to cut costs, and unfortunately yes if they could reuse “Peter Cullens”, “Frank Welkers”, or even the infamous “Mel Blanc” VA performance. Its something to look out for the in the future.
Abbe Holmes says
Excellent article. Thank you.
Such clear and important information
I’m a voice actor and vo coach in Australia and this has been causing confusion and anxiety amongst us all.
🙂
Tony Harris says
It would seem that even if we never knowingly take an AI job, we are already losing jobs, on low-budget ‘explainer’ videos for example, to TTS software – even if the results are not very good! And if the client doesn’t mind or notice that it’s poor, he’s saving money.