2023 has seen a significant rise in the popularity of AI-based programmes and apps to help users produce increasingly sophisticated results, from copywriting and networking to artwork and music.
The latest field showcasing significant progress is text-to-speech (TTE). TTS technology transforms the written word into speech and is becoming increasingly human-sounding and difficult to distinguish from a person.
One of the most sophisticated models currently in testing is Microsoft’s VALL-E. According to the example website for the product, VALL-E is a TTS framework that utilizes massive and diverse data to improve speech synthesis. It uses phoneme and acoustic code cues to create discrete audio codec codes that allow for zero-shot TTS, voice editing, and content creation.
Having evolved from over 60,0000 hours of speech and incorporating the voices of over 7,000 distinct English speakers, the framework uses advanced sampling techniques to provide a variety of synthesised outputs.
One of the most notable features of VALL-E is the way it manages to preserve a speaker’s emotions and acoustic environment, creating an unrivalled naturalness in its output. For example, if the input sample is from a lecture, the output will recreate the reverberations and echoes that can occur in larger spaces.
While the technology is not currently available to the public, there are significant opportunities for businesses to utilise the technology when it comes to market. As with any technology, it’s important to consider a variety of factors to ensure it meets the needs of the specific business and its core customer base.
Whether it is eventually put to use for customer service, training, safety or as a guide, three considerations must always be taken into account:
1. Naturalness and intelligibility: One of the most important factors to consider when evaluating any TTS program is how natural and intelligible the generated speech sounds. With so many distinct speakers incorporated into the VALL-E AI system, naturalness and intelligibility should be one of its biggest strengths.
2. Customisation options: Another factor to consider is the level of customisation available with a TTS program. Can you adjust the speed, volume, and pitch of the generated speech? These will all need to be optimised based on your intended end goal for the technology. With its superior data bank, VALL-E should offer a wide range of customisation options.
3. Accessibility: As with any TTS program, the integration of the system needs to deliver accessibility to all users, including those with disabilities. As VALL-E is rolled out for commercial use, it must address accessibility standards.
VALL-E is a significant step forward from existing TTS systems such as ‘Pocket’ or ‘Speechify’. Whether it becomes a commercial success is still to be determined, the security and autonomy of key data will be among the chief concerns for investors and businesses implementing the technology.
However, if the factors outlined above are addressed, the technology could deliver significant changes to the way consumers interact with brands.