Speech To Text: Definition and Examples

Speech To Text (STT), or voice recognition, is an artificial intelligence technology that converts human speech into written text, allowing machines to automatically transcribe audio content.

Full definition

Speech To Text (literally "from speech to text") refers to all technologies capable of transforming an audio signal containing human speech into a textual transcription. This technology relies on deep learning models that analyze sound waves, identify phonemes, and then reconstruct the corresponding words and sentences.

Modern STT systems mainly use neural network architectures such as Transformers, which have significantly improved transcription accuracy. Models like OpenAI's Whisper or Google Speech-to-Text services can transcribe in dozens of languages, handle accents, background noise, and even distinguish multiple speakers (diarization).

In the context of prompt engineering and generative AI, Speech To Text often constitutes the first step of a larger pipeline. For example, a voice assistant first uses STT to understand the user's request, then a language model (LLM) to generate a response, and finally a Text To Speech system to vocalize it. The quality of the initial transcription directly impacts the relevance of the generated response.

Applications of STT are ubiquitous: automatic video subtitling, voice dictation, meeting transcription, accessibility for the hearing impaired, voice control of connected devices, or conversation analysis in call centers. The rapid evolution of these technologies makes automatic transcription increasingly reliable, approaching or even exceeding human accuracy in some contexts.

Etymology

The term "Speech To Text" is an anglicism composed of three words: "speech", "to", and "text". It literally describes the process of converting speech into text. It is also referred to as "reconnaissance automatique de la parole" (RAP) in French, or "Automatic Speech Recognition" (ASR) in English. The field has existed since the 1950s, with early systems capable of recognizing isolated digits, but the advent of deep learning in the 2010s made the technology truly exploitable at scale.

Concrete examples

Transcription of a meeting to extract minutes

Here is the automatic transcription of a 45-minute team meeting. Generate a structured meeting minutes with decisions made, actions to be taken, and responsible persons identified. Correct any transcription errors based on context.

Automatic subtitling of a YouTube video

From this Speech To Text transcription of a tutorial video, generate subtitles in SRT format with segments of maximum 42 characters per line and a maximum of 2 lines per subtitle. Correct punctuation and segmentation.

Sentiment analysis on transcribed customer calls

Analyze the following transcriptions of customer service calls. For each call, identify the overall customer sentiment (positive, neutral, negative), the friction points mentioned, and the satisfaction level at the end of the call.

Practical usage

In prompt engineering, Speech To Text is often used as an input step to feed an LLM with transcribed oral content. It is crucial to ask the model to correct typical transcription errors (homophones, proper nouns, missing punctuation) before any processing. For optimal results, always specify the context of the source audio (meeting, interview, podcast) so that the model can tailor its corrections.

Related concepts

Text To SpeechNatural Language ProcessingWhisperVoice recognition

FAQ

What is the difference between Speech To Text and voice recognition?

The two terms are often used interchangeably, but there is a nuance. Speech To Text specifically refers to the conversion of speech into written text (transcription). Voice recognition is a broader term that also includes speaker identification (voice biometrics) and understanding voice commands without necessarily producing a full transcription.

What are the best Speech To Text tools in 2025?

Among the most effective solutions are OpenAI's Whisper (open source, multilingual, highly accurate), Google Cloud Speech-to-Text, Amazon Transcribe, and Microsoft Azure Speech Services. For local and free use, Whisper remains the benchmark. For professional needs with advanced features (diarization, custom vocabulary), cloud services are generally preferred.

How can I improve the quality of a Speech To Text transcription with an LLM?

After raw transcription, you can use a prompt asking the LLM to correct common errors: misinterpreted homophones, distorted proper nouns, missing punctuation, and paragraph segmentation. Provide context (topic of conversation, participant names, expected technical vocabulary) so the model can make more relevant corrections.

How to use this prompt

Copy the prompt with the button above.
Paste it into ChatGPT, Claude or your favorite AI assistant.
Replace the bracketed variables with your details, then refine the result.

About Prompt Guide

Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.

Prompt library Learn prompting Prompt builder Prompt optimizer

More definitions

Stop Sequence: Definition and Examples

A stop sequence is a predefined string of characters that tells the language model to stop generating text as soon as it produces it.

Streaming: Definition and Examples

Streaming is a technique for transmitting AI model responses in real time, token by token, rather than waiting for the complete generation before

Superintelligence: Definition and Examples

Superintelligence refers to a form of artificial intelligence that would vastly surpass human cognitive abilities in all domains, including

Synthetic Media: Definition and Examples

Synthetic media refers to any content — text, image, audio, or video — generated or manipulated by artificial intelligence algorithms, particularly through

System Prompt: Definition and Examples

The system prompt is an initial hidden instruction, defined by the developer, that configures the behavior, tone, and limits of an AI model before

Temperature (AI): Definition and Examples

Temperature is a parameter that controls the degree of randomness and creativity in AI responses.

Get new prompts every week

Join our newsletter.