Chain-of-thought prompting is a technique in which you explicitly ask an AI language model to work through a problem step by step before giving an answer. Instead of demanding the result immediately, you let the model write out its reasoning — the individual considerations, intermediate steps, and conclusions. On logic, math, and multi-step tasks this raises the hit rate dramatically, because the model can build and check its own argument rather than guessing blindly.
The idea sounds simple, yet it fundamentally changed the practice of prompt engineering. The trigger was the paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" by Jason Wei and colleagues at Google Research (2022). It showed that the very same models that fail a task suddenly solve it once you ask them to reason step by step. This guide explains exactly what chain-of-thought is, why it works, how to use it yourself — and when you are better off skipping it.
What is chain-of-thought prompting?
Chain-of-thought prompting (CoT) is a prompting technique that gets a model to spell out its reasoning explicitly before delivering a final answer. The core is a single instruction: "Think step by step." That request produces a chain of intermediate thoughts — hence the name — that leads the model to the solution, instead of having it blurt out a result directly.
There are two variants. With zero-shot CoT, the phrase "Let's think step by step" is enough, with no further examples. With few-shot CoT, you show the model one to three fully worked examples with the reasoning written out, which it then imitates. According to the original Google study (Wei et al., 2022), few-shot CoT lifted a large model's accuracy on the GSM8K math benchmark from 17.9 to 58.1 percent — a tripling driven purely by how the request was phrased, with no change to the model itself.
Where CoT came from
Before 2022 the assumption was: bigger models get better, but on genuine reasoning they hit a wall. The CoT paper showed that the ability was latent and merely needed the right prompt to be unlocked. Kojima et al. added in 2022, with "Large Language Models are Zero-Shot Reasoners," that even the bare addition of "Let's think step by step," with no example at all, produces large jumps. Since then CoT has been a fixture of any serious prompt engineering practice and forms the basis of modern reasoning models.
An emergent ability, not a trick
A key finding of the original study: CoT only kicks in above a certain model size. On small models the instruction to reason step by step brought little improvement, sometimes even a regression, because they could not produce the necessary depth of intermediate steps. Only above roughly 100 billion parameters did the effect break through — the researchers called it an "emergent ability." That explains why CoT works so reliably on the large 2026 models: they have the capacity to sustain a coherent train of thought across many steps. In practice this means CoT is not a magic word you throw at any system, but a technique that deliberately activates the existing competence of a capable model. The gain is largest on strong models and may vanish on very small or heavily pruned ones.
CoT versus the direct answer
The contrast with a direct answer makes the mechanism clear. A direct answer forces the model to land the end result in a single prediction step — it has to run through the whole logic implicitly, invisibly, with no chance to correct. CoT distributes the same work across many small, explicit steps. Each step is simpler on its own and therefore more reliable, and each builds on the text generated before it. That is exactly why CoT is not a cosmetic add-on but changes how the model arrives at the answer. It is the difference between "write the result down immediately" and "work it out in front of me." The latter is slower but markedly more precise on anything that takes more than one mental operation.
Why does chain-of-thought improve answers?
Chain-of-thought improves answers because it lets the model break a complex problem into smaller, individually solvable substeps — and make each intermediate state visible. A language model predicts the most probable next word. Demand the answer immediately and it has to produce the end result in a single leap. Let it reason first, and each intermediate step builds useful context the next step can lean on.
Technically, CoT creates more "compute room" in the output. The model uses the extra tokens to organize assumptions, track numbers, and unfold logic, rather than doing everything implicitly in one shot. An Anthropic analysis of visible reasoning (2025) confirms that written-out reasoning lowers the error rate especially on multi-step tasks. A pleasant side effect: you see the path and can check it. When the model is wrong, you immediately spot where — a huge advantage over a bare, unjustified number that could simply be false.
The difference between guessing and computing
Take the question: "A shirt costs 48 euros after a 20 percent discount. What was the original price?" Without CoT the model often jumps to a plausible-sounding but wrong number. With CoT it writes: "48 euros equals 80 percent of the original. One percent is 0.60 euros. The original is 100 percent, so 60 euros." The same path a human would take on paper — just written out. The visibility forces consistency.
Where the gain is largest
Not every task benefits equally. The biggest jump appears on problems with several interdependent steps, where an early error topples the whole solution. These include multi-step word problems, logic puzzles, if-then chains, sorting by multiple criteria, and weighing conditions. CoT also helps on tasks that tempt the model to jump to an "obvious" answer, because the written-out reasoning brakes the premature intuition. The overview below ranks typical tasks by expected benefit.
| Task type | Benefit from CoT |
|---|---|
| Multi-step math problems | Very high |
| Logic puzzles, if-then chains | Very high |
| Planning, prioritization with conditions | High |
| Code debugging, error analysis | High |
| Simple factual question | Low to none |
| Translation, short classification | Low to none |
Valuable for explanations too
Beyond accuracy, CoT has an instructional value. When you want to understand not just the result but the path to it, the written-out chain delivers a ready-made explanation. That is useful when learning, when following a recommendation, or when you have to trust a model output in front of others. An answer that shows its work can be checked, questioned, and corrected — a bare number cannot. Especially in areas of responsibility, such as financial or medical reasoning, this traceability often matters more than a few percentage points of extra accuracy. The chain is thus not only a means to a better answer but also evidence by which you can judge the quality of the answer yourself.
How do you write a chain-of-thought prompt?
You write a chain-of-thought prompt by placing a clear reasoning instruction before the actual task and specifying the format of the solution path. The simplest entry point is zero-shot CoT: append the sentence "Think step by step and explain your reasoning before stating the result" to your question. That alone is enough for most everyday cases.
For more demanding tasks you structure the prompt more explicitly. A proven template reads: "Solve the following task. First list the given quantities. Then describe the required steps one by one. Carry out each calculation. Only at the very end, output the final result on a separate line, prefixed with 'Answer:'." Separating the reasoning from the final result makes the output easy to process further, both for humans and for downstream systems.
The three levels at a glance
| Variant | When to use | How to phrase it |
|---|---|---|
| Zero-shot CoT | Fast everyday, simple logic | "Think step by step." |
| Few-shot CoT | Format fidelity, consistent method | Prepend one to three worked examples |
| Structured CoT | Production pipelines, auditability | Name the steps, output the result separately |
Few-shot CoT pays off when you need not just correct but uniform answers. You show the model two full examples with reasoning, and it adopts not only the logic but the presentation. That produces a consistent pattern across many requests — crucial when the output is processed automatically.
A complete example
"You are a careful analyst. Task: A team of 4 people handles 60 tickets per day. Two people drop out, and one new person joins who is only half as fast. How many tickets does the team handle now? Think step by step, show every calculation, and output the result at the end with 'Answer:'." Such prompts combine role, task, and the CoT instruction — the building blocks from the [prompt engineering fundamentals](/magazin/prompt-engineering-fundamentals) interlock seamlessly here.
Few-shot CoT in concrete terms
For few-shot CoT you give the model complete exemplars. A classification example with reasoning looks like this: "Example 1: Request: 'My server has been down for an hour and customers are complaining.' Reasoning: production outage with direct business impact, affects multiple users. Urgency: high. Example 2: Request: 'Could you adjust the logo in the footer at some point?' Reasoning: cosmetic, no time pressure, one user. Urgency: low. New request: [...]. Reasoning:". The model adopts not only the logic but the form of the reasoning and the vocabulary of the levels. That very uniformity is why few-shot CoT is often the first choice in production systems: it delivers not just correct but predictably formatted answers that can be processed automatically.
CoT and delimiters
Once the prompt grows longer, delimiters help keep the model from confusing the instruction, the examples, and the new task. Mark the examples clearly as examples, place the actual task in its own clearly fenced block, and end the prompt with the CoT instruction and the result format. This cleanliness stops the model from mistaking the worked examples for the task to solve — a typical error in densely packed few-shot prompts. Structured inputs are not only more readable for you but measurably increase the reliability of the chain across many requests.
If you want more related methods — such as self-consistency, prompt chaining, or tree-of-thought — you will find them compactly in the overview [15 prompt engineering techniques](/magazin/15-prompt-engineering-techniques), which places CoT in the larger toolbox.
Self-consistency: amplifying CoT
A powerful extension is self-consistency. Instead of letting the model reason once, you have it solve the same task several times with some variation — each run produces its own chain of thought — and then take the answer that occurs most often. The idea: on a hard problem, different correct reasoning paths converge on the same answer, while errors scatter in different directions. The majority therefore corrects individual slips. Wang et al. showed in 2022 that self-consistency lifted CoT accuracy on benchmarks like GSM8K markedly higher again. The price is multiple runs and thus higher cost, which is why the technique pays off mainly where correctness is critical and the extra effort is justified.
Common mistakes when writing CoT
Three pitfalls trip up beginners again and again. First, failing to separate the final result from the reasoning. Without a clear instruction like "output the result separately at the end," the model blends rationale and answer, which complicates automatic processing. Second, combining CoT with a tight length limit. "Think step by step, answer in one sentence" is a contradiction — the brevity smothers the chain. Third, throwing CoT at tasks that need no reasoning at all and then being surprised by bloated answers. Avoid these three mistakes and you extract the technique's full value without creating new problems.
When should you not use it?
You should not use chain-of-thought when the task is simple, requires no reasoning, or calls for a short, immediate answer. For a factual question like "What is the capital of France?" or a simple classification, CoT produces only needless text, costs more tokens, increases latency, and can even distract. For those tasks a direct zero-shot prompt is faster and just as correct.
There are three further caveats. First, CoT costs money and time: longer outputs mean more tokens and slower responses — a real factor in high-throughput production systems. Second, the written-out reasoning is not guaranteed to be the model's true internal logic; Anthropic's 2025 work on the "faithfulness" of reasoning showed that the visible chain is occasionally rationalized after the fact to fit the answer. Third, modern reasoning models such as the o-series already solve many steps internally — an extra "think step by step" instruction adds little there and can disrupt the format.
Reasoning models change the calculation
One development deserves special attention. Since 2024 there has been a distinct class of reasoning models that already run a detailed chain of thought internally before answering — visibly or hidden. On these models the manual "think step by step" instruction is often redundant and can even disrupt the desired answer format, because the model is already reasoning anyway. The rule of thumb therefore shifts by tool: on classic chat models, explicit CoT remains a strong lever; on dedicated reasoning models, you state the task clearly and leave the reasoning to the model. When in doubt, check the respective provider's documentation — Anthropic, OpenAI, and Google sometimes explicitly state for their reasoning models that you should skip manual CoT instructions.
The rule of thumb
Use CoT whenever a task involves math, multi-step logic, planning, or weighing several conditions. Skip it for factual lookups, short classifications, translations, and anywhere speed or brevity matters. When in doubt, test both variants on three real examples and compare quality against cost — that very discipline of comparing is what separates professional prompt engineering from trial and error. Record your findings: once you know a particular class of task benefits from CoT, you save the test next time and reach straight for the right variant.
Building CoT into your daily work
To keep CoT from staying a one-off experiment, a small process pays off. Collect the task types that recur in your work — quote calculations, log-file debugging, or prioritizing tasks by multiple criteria. Write one clean CoT template per type, with a clear step structure and a separated final result, test it on three real cases, and store the winners centrally. That turns a technique into a reusable routine that delivers consistent results across a team. A maintained prompt library such as Prompt2Love makes exactly this possible: instead of reinventing "think step by step" every time, you pull up your proven template. Across many requests, this discipline advantage outweighs any single model jump — because it makes good results reproducible rather than accidental.
Conclusion
Chain-of-thought prompting is one of the most powerful and at the same time simplest techniques in prompt engineering. A single instruction to reason step by step turns a model that guesses on logic into one that computes cleanly — and makes the path auditable along the way. The price is extra tokens and latency, which is why you reserve CoT deliberately for demanding, multi-step tasks and skip it on simple lookups.
The next step is practice: take a real task with a math or logic component, prepend the instruction "Think step by step and state the result separately at the end," and compare the result with the direct question. Save the better variant in your prompt library — for example in Prompt2Love — so that every successful chain becomes a reusable template for you and your team.
CoT is just one building block of a larger repertoire. Set role, context, format, and constraints cleanly, combine them with the right reasoning technique, and you extract the maximum from any model. The systematic foundation for this comes from the [prompt engineering fundamentals](/magazin/prompt-engineering-fundamentals) — they show how CoT interlocks with the five building blocks of a good prompt and when which combination makes sense.
The three-point summary
1. What for: CoT makes models markedly more accurate on logic, math, and multi-step tasks because they write out their solution path. 2. How: Append "think step by step," or for few-shot show one to three worked examples and output the result separately at the end. 3. When not: On simple factual questions, short classifications, and on modern reasoning models that already reason internally.
You might also like
15 Prompt Engineering Techniques That Actually Work
15 proven prompt engineering techniques with examples: few-shot, chain-of-thought, role prompting, self-consistency and more. A practical 2026 guide to better AI output.
Prompt Engineering Fundamentals
Prompt engineering from the ground up: building blocks, techniques, iteration, and the most common mistakes. The complete 2026 guide to reliable AI output.
How to Write Effective AI Prompts
Write effective AI prompts: the five building blocks, proven formulas, a repeatable process, and the most common mistakes. The complete 2026 practitioner's guide.
