Introducing Amazon Nova Sonic: A New Gen AI Model for Building Voice Applications and Agents
Amazon has unveiled Amazon Nova Sonic, a groundbreaking foundation model that combines speech understanding and generation into a single unified system. Available through Amazon Bedrock, this new model simplifies voice application development across various industries.
Nova Sonic demonstrates superior performance with a 51.0% and 69.7% win-rate against OpenAI's GPT-4o and Google's Gemini Flash 2.0 respectively in American English masculine voice tests. The model achieves a 4.2% word error rate on Multilingual LibriSpeech, 36.4% lower than OpenAI's GPT-4o Transcribe model.
Key features include real-time speech processing with an average latency of 1.09 seconds, tool-use capabilities for enterprise applications, and support for multiple English accents. Nova Sonic is notably 80% less expensive than OpenAI's GPT-4o (Realtime), making it the most cost-efficient model in its category.
Amazon ha svelato Amazon Nova Sonic, un modello di base innovativo che combina comprensione e generazione del linguaggio in un unico sistema integrato. Disponibile tramite Amazon Bedrock, questo nuovo modello semplifica lo sviluppo di applicazioni vocali in vari settori.
Nova Sonic dimostra prestazioni superiori con un 51,0% e 69,7% di tasso di vittoria rispetto a GPT-4o di OpenAI e Gemini Flash 2.0 di Google rispettivamente nei test vocali maschili in inglese americano. Il modello raggiunge un 4,2% di tasso di errore delle parole su Multilingual LibriSpeech, 36,4% in meno rispetto al modello Transcribe di GPT-4o di OpenAI.
Le caratteristiche principali includono l'elaborazione vocale in tempo reale con una latenza media di 1,09 secondi, capacità di utilizzo di strumenti per applicazioni aziendali e supporto per diversi accenti inglesi. Nova Sonic è notevolmente 80% meno costoso rispetto a GPT-4o di OpenAI (Realtime), rendendolo il modello più conveniente nella sua categoria.
Amazon ha presentado Amazon Nova Sonic, un modelo fundamental innovador que combina la comprensión y generación del habla en un único sistema unificado. Disponible a través de Amazon Bedrock, este nuevo modelo simplifica el desarrollo de aplicaciones de voz en diversas industrias.
Nova Sonic demuestra un rendimiento superior con una tasa de éxito del 51,0% y 69,7% frente a GPT-4o de OpenAI y Gemini Flash 2.0 de Google respectivamente en pruebas de voz masculina en inglés americano. El modelo alcanza una tasa de error de palabras del 4,2% en Multilingual LibriSpeech, 36,4% menos que el modelo Transcribe de GPT-4o de OpenAI.
Las características clave incluyen procesamiento de voz en tiempo real con una latencia promedio de 1,09 segundos, capacidades de uso de herramientas para aplicaciones empresariales y soporte para múltiples acentos en inglés. Nova Sonic es notablemente 80% menos costoso que GPT-4o de OpenAI (Realtime), lo que lo convierte en el modelo más rentable en su categoría.
아마존은 아마존 노바 소닉을 공개했습니다. 이 혁신적인 기본 모델은 음성 이해와 생성을 단일 통합 시스템으로 결합합니다. 아마존 베드록을 통해 제공되는 이 새로운 모델은 다양한 산업에서 음성 애플리케이션 개발을 간소화합니다.
노바 소닉은 미국 영어 남성 음성 테스트에서 OpenAI의 GPT-4o 및 구글의 제미니 플래시 2.0에 대해 각각 51.0% 및 69.7%의 승률을 보여줍니다. 이 모델은 다국어 리브리 스피치에서 4.2%의 단어 오류율을 달성하며, 이는 OpenAI의 GPT-4o 전사 모델보다 36.4% 낮습니다.
주요 기능으로는 평균 지연 시간 1.09초로 실시간 음성 처리, 기업 애플리케이션을 위한 도구 사용 기능, 그리고 다양한 영어 억양 지원이 포함됩니다. 노바 소닉은 OpenAI의 GPT-4o (실시간)보다 80% 저렴하여 이 카테고리에서 가장 비용 효율적인 모델입니다.
Amazon a dévoilé Amazon Nova Sonic, un modèle fondamental révolutionnaire qui combine compréhension et génération de la parole en un seul système unifié. Disponible via Amazon Bedrock, ce nouveau modèle simplifie le développement d'applications vocales dans divers secteurs.
Nova Sonic démontre des performances supérieures avec un taux de victoire de 51,0% et 69,7% par rapport à GPT-4o d'OpenAI et à Gemini Flash 2.0 de Google respectivement lors des tests de voix masculine en anglais américain. Le modèle atteint un taux d'erreur de mots de 4,2% sur Multilingual LibriSpeech, 36,4% de moins que le modèle Transcribe de GPT-4o d'OpenAI.
Les caractéristiques clés incluent le traitement de la parole en temps réel avec une latence moyenne de 1,09 seconde, des capacités d'utilisation d'outils pour les applications d'entreprise, et un support pour plusieurs accents anglais. Nova Sonic est remarquablement 80% moins cher que GPT-4o d'OpenAI (temps réel), ce qui en fait le modèle le plus rentable de sa catégorie.
Amazon hat Amazon Nova Sonic vorgestellt, ein bahnbrechendes Basis-Modell, das Sprachverständnis und -erzeugung in einem einzigen, einheitlichen System kombiniert. Über Amazon Bedrock verfügbar, vereinfacht dieses neue Modell die Entwicklung von Sprachanwendungen in verschiedenen Branchen.
Nova Sonic zeigt überlegene Leistungen mit einer Gewinnrate von 51,0% und 69,7% im Vergleich zu OpenAI's GPT-4o und Google's Gemini Flash 2.0 in männlichen Sprachtests in amerikanischem Englisch. Das Modell erreicht eine Wortfehlerquote von 4,2% bei Multilingual LibriSpeech, was 36,4% niedriger ist als das Transcribe-Modell von OpenAI's GPT-4o.
Zu den Hauptmerkmalen gehören die Echtzeit-Sprachverarbeitung mit einer durchschnittlichen Latenz von 1,09 Sekunden, Werkzeugnutzungsfähigkeiten für Unternehmensanwendungen und Unterstützung für mehrere englische Akzente. Nova Sonic ist bemerkenswert 80% günstiger als OpenAI's GPT-4o (Echtzeit), was es zum kosteneffizientesten Modell seiner Kategorie macht.
- Achieves 51.0% and 69.7% win-rate against major competitors OpenAI and Google in voice quality tests
- 36.4% lower word error rate compared to OpenAI's model in multilingual performance
- 80% more cost-efficient than OpenAI's GPT-4o (Realtime)
- Faster response time at 1.09 seconds compared to competitors' 1.18-1.41 seconds
- Successfully implemented by major enterprises including ASAPP, Education First, and Stats Perform
- Currently to English language support with only three voice options
- Requires integration with Amazon Bedrock platform for implementation
Insights
Amazon's launch of Nova Sonic represents a strategic advancement in the competitive generative AI landscape, directly challenging OpenAI's GPT-4o and Google's Gemini Flash 2.0. The unified architecture that combines speech understanding and generation into a single model addresses a significant pain point in voice application development.
Benchmark results are particularly compelling: Nova Sonic demonstrates 36.4% lower word error rates than OpenAI's offering across multiple languages and 46.7% lower error rates in noisy conditions. The cost advantage is substantial - nearly 80% less expensive than OpenAI's competing model - positioning Amazon for potential market share gains in the enterprise AI space.
Early customer adoption across diverse sectors (customer service, education, and sports data) suggests strong product-market fit. The strategic importance extends beyond immediate revenue potential, as voice AI represents a critical technology layer that could drive expanded usage of AWS services and strengthen customer lock-in.
The speed advantage (1.09 seconds latency versus competitors' 1.18-1.41 seconds) might seem minor numerically but represents a meaningful improvement in user experience for real-time applications. Amazon's ability to outperform specialized AI companies in speech technology leverages their long-term investments in Alexa and voice services, creating defensible competitive advantages in this high-growth market segment.
“From the invention of the world’s best personal AI assistant with Alexa, to developing AWS services like Connect, Lex, and Polly that are used across a wide range of industries, Amazon has long believed that voice-powered applications can make all of our customers’ lives better and easier,” said Rohit Prasad, SVP of Amazon Artificial General Intelligence. “With Amazon Nova Sonic, we are releasing a new foundation model in Amazon Bedrock that makes it simpler for developers to build voice-powered applications that can complete tasks for customers with higher accuracy, while being more natural, and engaging.”
Traditional approaches to building voice-enabled applications involve complex orchestration of multiple models, such as speech recognition to convert speech to text, large language models (LLMs) to understand and generate responses, and text-to-speech to convert text back to audio. This fragmented approach not only increases development complexity but also fails to preserve crucial acoustic context and nuances like tone, prosody, and speaking style that are essential for natural conversations.
Nova Sonic solves these challenges through a unified model architecture that delivers speech understanding and generation, without requiring a separate model for each of these steps. This unification enables the model to adapt the generated voice response to the acoustic context (e.g. tone, style) and the spoken input, resulting in more natural dialog. Nova Sonic even understands the nuances of human conversation, including the speaker’s natural pauses and hesitations, waiting to speak until the appropriate time, and gracefully handling barge-ins. It also generates a text transcript for the user’s speech, enabling developers to use that text to call specific tools and APIs for building voice-enabled AI agents (e.g., an AI-powered travel agent that can book flights by retrieving up to date flight information). These capabilities, along with its lightning-fast inference, make voice applications powered by Nova Sonic more natural and useful.
State-of-the-art accuracy and quality
Nova Sonic has been rigorously tested against a wide range of industry standard benchmarks for speech understanding and generation, demonstrating exceptional quality and accuracy for human-like, real-time voice conversations.
The model excels in natural dialog handling, seamlessly understanding and adapting to pauses, hesitations, and interruptions while maintaining conversational context throughout the interaction. This capability contributed to strong performance for overall quality and accuracy in turn-taking tests.
Nova Sonic demonstrates strong performance on overall conversation quality compared to other models in the industry, which at this time include a select few with similar real-time conversational speech capabilities, such as OpenAI's GPT-4o (Realtime) and Google Gemini Flash 2.0 (available via Gemini’s experimental live API). For example, single-turn dialogs in its American English masculine-sounding voice achieved a
Since recognizing spoken words is critical in generating accurate responses, measuring Nova Sonic's speech recognition accuracy in terms of word error rate (WER) across a wide range of languages, dialects, and accents is also critical. On the Multilingual LibriSpeech, Nova Sonic achieved a WER of
On English utterances of the Multilingual LibriSpeech (MLS) data set, it has
Nova Sonic is also robust to noisy conditions, with
Tool-use for function calling and agentic workflows
Nova Sonic also supports tool-use for applications—like customer service call automation—that require the responses to be factually grounded in enterprise data, such as pricing plans, available inventory, and schedule availability. Nova Sonic’s native tool-use also enables the model to resolve complex customer queries and complete tasks on behalf of customers, for example, “make a reservation” or “find alternate flights.”
Multiple native voices and speaking styles
Nova Sonic supports three expressive voices, including both masculine-sounding and feminine-sounding voices now generally available in English, and supports speech generation in different English accents including American and British. Support for additional languages and accents will be coming soon.
Industry-leading speed and price performance
Nova Sonic delivers an average customer-perceived latency of 1.09 seconds from the time the customer is done talking to the time the system generates the first speech response. This is compared to 1.18 seconds for OpenAI’s GPT-4o (Realtime), and 1.41 seconds for Google’s Gemini Flash 2.0 (available via Gemini’s experimental live API), per benchmarking by Artificial Analysis.
Nova Sonic is the most cost-efficient model in the industry, when compared to models that have similar functionality of real-time speech conversations and have public pricing available. For example, it is nearly
Amazon Nova Sonic is helping companies drive better customer satisfaction and productivity
ASAPP empowers enterprise customers’ contact centers to deliver unmatched customer service through GenerativeAgent, a fully conversational generative Al voice agent. “At ASAPP, we are focused on using generative AI to deliver reliable, secure, and high-performing solutions for improving customer service in contact centers. We’ve been particularly impressed by Amazon Nova Sonic’s highly accurate speech understanding capabilities which allow for more natural voice interactions and precise dialog handling over telephony,” said Nirmal Mukhi, VP of AI Engineering at ASAPP. “We’re excited to continue using Nova Sonic to deliver secure, high-quality, and precise conversations that meet the demands of enterprise contact centers.”
Education First (EF) is a leader in international education through its networks of schools and offices in over 50 countries. “Amazon Nova Sonic enables EF students to practice new vocabulary and refine their pronunciation in a dynamic learning environment, while the interactive nature of the model allows students to receive immediate feedback on their pronunciation attempts, contributing to a more efficient and effective learning process. The model is capable of accurately understanding non-native English speakers with a variety of accents. We were also impressed with the barge-in feature of Nova Sonic, where the model quickly reacts to interruptions,” said Tim Hesse, VP of AI and Data at EF. “The scalability and reliability of the technology will allow us to expand our capacity to serve a larger student population simultaneously, without compromising the quality of instruction.”
Stats Perform is a sports data and AI technology provider, serving global media organizations, betting operators, and professional sports teams. “At Stats Perform, our goal is to empower the world’s top sports broadcasters, media, federations and teams with magic in the detail of our vast live and historical Opta sports dataset, to help them win audiences, customers and trophies. With the Opta AI Chat they can generate unique, accurate, and contextual responses, driven by live data insights with remarkable speed, in multiple formats and languages, to find a winning analytical or storytelling edge,” said Mike Perez, Chief Operating Officer at Stats Perform. “We’ve been testing Amazon Nova Sonic and have been particularly impressed by the system's low latency, which enables near-instantaneous responses even to complex queries of our model, creating a seamless user experience that turns human experts into superhuman experts. The intuitive prompting capability and ease of setup have exceeded our expectations, making implementation simple. Overall, Nova Sonic has proven to be a fantastic solution.”
Amazon is committed to the responsible development of artificial intelligence
Amazon Nova models are built with integrated safety measures and protections. The company has launched AWS AI Service Cards for Nova models, offering transparent information on use cases, limitations, and responsible AI practices.
To get started with Amazon Nova models, visit: https://aws.amazon.com/nova/
To learn more, visit: About Amazon for details on today’s announcement.
View source version on businesswire.com: https://www.businesswire.com/news/home/20250408227167/en/
Amazon.com, Inc.
Media Hotline
Amazon-pr@amazon.com
www.amazon.com/pr
Source: Amazon.com, Inc.