Speech To Text: Definition and Examples
Speech To Text (STT), or voice recognition, is an artificial intelligence technology that converts human speech into written text, allowing machines to automatically transcribe audio content.
Full definition
Speech To Text (literally "from speech to text") refers to all technologies capable of transforming an audio signal containing human speech into a textual transcription. This technology relies on deep learning models that analyze sound waves, identify phonemes, and then reconstruct the corresponding words and sentences.
Modern STT systems mainly use neural network architectures such as Transformers, which have significantly improved transcription accuracy. Models like OpenAI's Whisper or Google Speech-to-Text services can transcribe in dozens of languages, handle accents, background noise, and even distinguish multiple speakers (diarization).
In the context of prompt engineering and generative AI, Speech To Text often constitutes the first step of a larger pipeline. For example, a voice assistant first uses STT to understand the user's request, then a language model (LLM) to generate a response, and finally a Text To Speech system to vocalize it. The quality of the initial transcription directly impacts the relevance of the generated response.
Applications of STT are ubiquitous: automatic video subtitling, voice dictation, meeting transcription, accessibility for the hearing impaired, voice control of connected devices, or conversation analysis in call centers. The rapid evolution of these technologies makes automatic transcription increasingly reliable, approaching or even exceeding human accuracy in some contexts.
Etymology
The term "Speech To Text" is an anglicism composed of three words: "speech", "to", and "text". It literally describes the process of converting speech into text. It is also referred to as "reconnaissance automatique de la parole" (RAP) in French, or "Automatic Speech Recognition" (ASR) in English. The field has existed since the 1950s, with early systems capable of recognizing isolated digits, but the advent of deep learning in the 2010s made the technology truly exploitable at scale.
Concrete examples
Transcription of a meeting to extract minutes
Here is the automatic transcription of a 45-minute team meeting. Generate a structured meeting minutes with decisions made, actions to be taken, and responsible persons identified. Correct any transcription errors based on context.
Automatic subtitling of a YouTube video
From this Speech To Text transcription of a tutorial video, generate subtitles in SRT format with segments of maximum 42 characters per line and a maximum of 2 lines per subtitle. Correct punctuation and segmentation.
Sentiment analysis on transcribed customer calls
Analyze the following transcriptions of customer service calls. For each call, identify the overall customer sentiment (positive, neutral, negative), the friction points mentioned, and the satisfaction level at the end of the call.
Practical usage
In prompt engineering, Speech To Text is often used as an input step to feed an LLM with transcribed oral content. It is crucial to ask the model to correct typical transcription errors (homophones, proper nouns, missing punctuation) before any processing. For optimal results, always specify the context of the source audio (meeting, interview, podcast) so that the model can tailor its corrections.
Related concepts
FAQ
What is the difference between Speech To Text and voice recognition?
What are the best Speech To Text tools in 2025?
How can I improve the quality of a Speech To Text transcription with an LLM?
See also
How to use this prompt
- Copy the prompt with the button above.
- Paste it into ChatGPT, Claude or your favorite AI assistant.
- Replace the bracketed variables with your details, then refine the result.
About Prompt Guide
Prompt Guide is a free library of 2500+ ready-to-use prompts for ChatGPT, Claude and other AIs, with guides to learn prompting and tools to build and optimize your own prompts.
More definitions
Stop Sequence: Definition and Examples
A stop sequence is a predefined string of characters that tells the language model to stop generating text as soon as it produces it.
Streaming: Definition and Examples
Streaming is a technique for transmitting AI model responses in real time, token by token, rather than waiting for the complete generation before
Superintelligence: Definition and Examples
Superintelligence refers to a form of artificial intelligence that would vastly surpass human cognitive abilities in all domains, including
Synthetic Media: Definition and Examples
Synthetic media refers to any content — text, image, audio, or video — generated or manipulated by artificial intelligence algorithms, particularly through
System Prompt: Definition and Examples
The system prompt is an initial hidden instruction, defined by the developer, that configures the behavior, tone, and limits of an AI model before
Temperature (AI): Definition and Examples
Temperature is a parameter that controls the degree of randomness and creativity in AI responses.
Get new prompts every week
Join our newsletter.