Today, Amazon.com Inc (NASDAQ: AMZN) introduced Amazon Nova
Sonic, a new foundation model that unifies speech understanding
and speech generation into a single model, to enable more
human-like voice conversations in artificial intelligence (AI)
applications. Available in Amazon Bedrock via a new bi-directional
streaming API, the model simplifies the development of voice
applications, such as customer service call automation and AI
agents across a broad range of industries, including travel,
education, healthcare, entertainment, and more.
“From the invention of the world’s best personal AI assistant
with Alexa, to developing AWS services like Connect, Lex, and Polly
that are used across a wide range of industries, Amazon has long
believed that voice-powered applications can make all of our
customers’ lives better and easier,” said Rohit Prasad, SVP of
Amazon Artificial General Intelligence. “With Amazon Nova Sonic, we
are releasing a new foundation model in Amazon Bedrock that makes
it simpler for developers to build voice-powered applications that
can complete tasks for customers with higher accuracy, while being
more natural, and engaging.”
Traditional approaches to building voice-enabled applications
involve complex orchestration of multiple models, such as speech
recognition to convert speech to text, large language models (LLMs)
to understand and generate responses, and text-to-speech to convert
text back to audio. This fragmented approach not only increases
development complexity but also fails to preserve crucial acoustic
context and nuances like tone, prosody, and speaking style that are
essential for natural conversations.
Nova Sonic solves these challenges through a unified model
architecture that delivers speech understanding and generation,
without requiring a separate model for each of these steps. This
unification enables the model to adapt the generated voice response
to the acoustic context (e.g. tone, style) and the spoken input,
resulting in more natural dialog. Nova Sonic even understands the
nuances of human conversation, including the speaker’s natural
pauses and hesitations, waiting to speak until the appropriate
time, and gracefully handling barge-ins. It also generates a text
transcript for the user’s speech, enabling developers to use that
text to call specific tools and APIs for building voice-enabled AI
agents (e.g., an AI-powered travel agent that can book flights by
retrieving up to date flight information). These capabilities,
along with its lightning-fast inference, make voice applications
powered by Nova Sonic more natural and useful.
State-of-the-art accuracy and quality
Nova Sonic has been rigorously tested against a wide range of
industry standard benchmarks for speech understanding and
generation, demonstrating exceptional quality and accuracy for
human-like, real-time voice conversations.
The model excels in natural dialog handling, seamlessly
understanding and adapting to pauses, hesitations, and
interruptions while maintaining conversational context throughout
the interaction. This capability contributed to strong performance
for overall quality and accuracy in turn-taking tests.
Nova Sonic demonstrates strong performance on overall
conversation quality compared to other models in the industry,
which at this time include a select few with similar real-time
conversational speech capabilities, such as OpenAI's GPT-4o
(Realtime) and Google Gemini Flash 2.0 (available via Gemini’s
experimental live API). For example, single-turn dialogs in its
American English masculine-sounding voice achieved a 51.0% and
69.7% win-rate against OpenAI’s GPT-4o (Realtime) and Google’s
Gemini Flash 2.0 respectively, based on the Common Eval data set.
Likewise, Nova Sonic’s American English feminine-sounding voice
scored 50.9% and 66.3% win-rate against OpenAI’s GPT-4o (Realtime)
and Google’s Gemini Flash 2.0 respectively on the same data set.
Nova Sonic also exceeds performance for its British English
feminine-sounding voice, scoring a 58.3% win-rate against OpenAI’s
GPT-4o (Realtime).
Since recognizing spoken words is critical in generating
accurate responses, measuring Nova Sonic's speech recognition
accuracy in terms of word error rate (WER) across a wide range of
languages, dialects, and accents is also critical. On the
Multilingual LibriSpeech, Nova Sonic achieved a WER of 4.2%, which
is 36.4% relative lower than OpenAI's GPT-4o Transcribe model, when
averaged across English, French, Italian, German and Spanish.
On English utterances of the Multilingual LibriSpeech (MLS) data
set, it has 24.2% relative lower WER compared to OpenAI’s GPT-4o
Transcribe model.
Nova Sonic is also robust to noisy conditions, with 46.7%
relative lower WER for English compared to OpenAI’s GPT-4o
Transcribe model measured on Augmented Multi Party Interaction
(AMI) meeting benchmark that consists of real-world noisy and
multi-speaker interactions.
Tool-use for function calling and agentic workflows
Nova Sonic also supports tool-use for applications—like customer
service call automation—that require the responses to be factually
grounded in enterprise data, such as pricing plans, available
inventory, and schedule availability. Nova Sonic’s native tool-use
also enables the model to resolve complex customer queries and
complete tasks on behalf of customers, for example, “make a
reservation” or “find alternate flights.”
Multiple native voices and speaking styles
Nova Sonic supports three expressive voices, including both
masculine-sounding and feminine-sounding voices now generally
available in English, and supports speech generation in different
English accents including American and British. Support for
additional languages and accents will be coming soon.
Industry-leading speed and price performance
Nova Sonic delivers an average customer-perceived latency of
1.09 seconds from the time the customer is done talking to the time
the system generates the first speech response. This is compared to
1.18 seconds for OpenAI’s GPT-4o (Realtime), and 1.41 seconds for
Google’s Gemini Flash 2.0 (available via Gemini’s experimental live
API), per benchmarking by Artificial Analysis.
Nova Sonic is the most cost-efficient model in the industry,
when compared to models that have similar functionality of
real-time speech conversations and have public pricing available.
For example, it is nearly 80% less expensive than OpenAI’s GPT-4o
(Realtime).
Amazon Nova Sonic is helping companies drive better customer
satisfaction and productivity
ASAPP empowers enterprise customers’ contact centers to
deliver unmatched customer service through GenerativeAgent, a fully
conversational generative Al voice agent. “At ASAPP, we are focused
on using generative AI to deliver reliable, secure, and
high-performing solutions for improving customer service in contact
centers. We’ve been particularly impressed by Amazon Nova Sonic’s
highly accurate speech understanding capabilities which allow for
more natural voice interactions and precise dialog handling over
telephony,” said Nirmal Mukhi, VP of AI Engineering at ASAPP.
“We’re excited to continue using Nova Sonic to deliver secure,
high-quality, and precise conversations that meet the demands of
enterprise contact centers.”
Education First (EF) is a leader in international
education through its networks of schools and offices in over 50
countries. “Amazon Nova Sonic enables EF students to practice new
vocabulary and refine their pronunciation in a dynamic learning
environment, while the interactive nature of the model allows
students to receive immediate feedback on their pronunciation
attempts, contributing to a more efficient and effective learning
process. The model is capable of accurately understanding
non-native English speakers with a variety of accents. We were also
impressed with the barge-in feature of Nova Sonic, where the model
quickly reacts to interruptions,” said Tim Hesse, VP of AI and Data
at EF. “The scalability and reliability of the technology will
allow us to expand our capacity to serve a larger student
population simultaneously, without compromising the quality of
instruction.”
Stats Perform is a sports data and AI technology
provider, serving global media organizations, betting operators,
and professional sports teams. “At Stats Perform, our goal is to
empower the world’s top sports broadcasters, media, federations and
teams with magic in the detail of our vast live and historical Opta
sports dataset, to help them win audiences, customers and trophies.
With the Opta AI Chat they can generate unique, accurate, and
contextual responses, driven by live data insights with remarkable
speed, in multiple formats and languages, to find a winning
analytical or storytelling edge,” said Mike Perez, Chief Operating
Officer at Stats Perform. “We’ve been testing Amazon Nova Sonic and
have been particularly impressed by the system's low latency, which
enables near-instantaneous responses even to complex queries of our
model, creating a seamless user experience that turns human experts
into superhuman experts. The intuitive prompting capability and
ease of setup have exceeded our expectations, making implementation
simple. Overall, Nova Sonic has proven to be a fantastic
solution.”
Amazon is committed to the responsible development of
artificial intelligence
Amazon Nova models are built with integrated safety measures and
protections. The company has launched AWS AI Service Cards for Nova
models, offering transparent information on use cases, limitations,
and responsible AI practices.
To get started with Amazon Nova models, visit:
https://aws.amazon.com/nova/
To learn more, visit: About Amazon for details on today’s
announcement.
View source
version on businesswire.com: https://www.businesswire.com/news/home/20250408227167/en/
Amazon.com, Inc. Media Hotline Amazon-pr@amazon.com
www.amazon.com/pr
Amazon.com (NASDAQ:AMZN)
Gráfico Histórico do Ativo
De Abr 2025 até Mai 2025
Amazon.com (NASDAQ:AMZN)
Gráfico Histórico do Ativo
De Mai 2024 até Mai 2025