P

Speech To Text: Definition and Examples

Speech To Text (STT), or voice recognition, is an artificial intelligence technology that converts human speech into written text, allowing machines to automatically transcribe audio content.

Full definition

Speech To Text (literally "from speech to text") refers to all technologies capable of transforming an audio signal containing human speech into a textual transcription. This technology relies on deep learning models that analyze sound waves, identify phonemes, and then reconstruct the corresponding words and sentences.

Modern STT systems mainly use neural network architectures such as Transformers, which have significantly improved transcription accuracy. Models like OpenAI's Whisper or Google Speech-to-Text services can transcribe in dozens of languages, handle accents, background noise, and even distinguish multiple speakers (diarization).

In the context of prompt engineering and generative AI, Speech To Text often constitutes the first step of a larger pipeline. For example, a voice assistant first uses STT to understand the user's request, then a language model (LLM) to generate a response, and finally a Text To Speech system to vocalize it. The quality of the initial transcription directly impacts the relevance of the generated response.

Applications of STT are ubiquitous: automatic video subtitling, voice dictation, meeting transcription, accessibility for the hearing impaired, voice control of connected devices, or conversation analysis in call centers. The rapid evolution of these technologies makes automatic transcription increasingly reliable, approaching or even exceeding human accuracy in some contexts.

Etymology

The term "Speech To Text" is an anglicism composed of three words: "speech", "to", and "text". It literally describes the process of converting speech into text. It is also referred to as "reconnaissance automatique de la parole" (RAP) in French, or "Automatic Speech Recognition" (ASR) in English. The field has existed since the 1950s, with early systems capable of recognizing isolated digits, but the advent of deep learning in the 2010s made the technology truly exploitable at scale.

Concrete examples

Transcription of a meeting to extract minutes

Here is the automatic transcription of a 45-minute team meeting. Generate a structured meeting minutes with decisions made, actions to be taken, and responsible persons identified. Correct any transcription errors based on context.

Automatic subtitling of a YouTube video

From this Speech To Text transcription of a tutorial video, generate subtitles in SRT format with segments of maximum 42 characters per line and a maximum of 2 lines per subtitle. Correct punctuation and segmentation.

Sentiment analysis on transcribed customer calls

Analyze the following transcriptions of customer service calls. For each call, identify the overall customer sentiment (positive, neutral, negative), the friction points mentioned, and the satisfaction level at the end of the call.

Practical usage

In prompt engineering, Speech To Text is often used as an input step to feed an LLM with transcribed oral content. It is crucial to ask the model to correct typical transcription errors (homophones, proper nouns, missing punctuation) before any processing. For optimal results, always specify the context of the source audio (meeting, interview, podcast) so that the model can tailor its corrections.

Related concepts

Text To SpeechNatural Language ProcessingWhisperVoice recognition

FAQ

What is the difference between Speech To Text and voice recognition?
The two terms are often used interchangeably, but there is a nuance. Speech To Text specifically refers to the conversion of speech into written text (transcription). Voice recognition is a broader term that also includes speaker identification (voice biometrics) and understanding voice commands without necessarily producing a full transcription.
What are the best Speech To Text tools in 2025?
Among the most effective solutions are OpenAI's Whisper (open source, multilingual, highly accurate), Google Cloud Speech-to-Text, Amazon Transcribe, and Microsoft Azure Speech Services. For local and free use, Whisper remains the benchmark. For professional needs with advanced features (diarization, custom vocabulary), cloud services are generally preferred.
How can I improve the quality of a Speech To Text transcription with an LLM?
After raw transcription, you can use a prompt asking the LLM to correct common errors: misinterpreted homophones, distorted proper nouns, missing punctuation, and paragraph segmentation. Provide context (topic of conversation, participant names, expected technical vocabulary) so the model can make more relevant corrections.

See also

How to use this prompt

  1. Copy the prompt with the button above.
  2. Paste it into ChatGPT, Claude or your favorite AI assistant.
  3. Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

More definitions

Get new prompts every week

Join our newsletter.