Working with LLMs

Reasoning Through Problems

Learning Objectives

You understand what chain-of-thought prompting is and how it improves reasoning.
You know techniques for eliciting step-by-step reasoning from language models.
You understand what reasoning models are and how they differ from standard language models.
You can identify when reasoning techniques are appropriate for a task.

When tasks require reasoning

Many tasks require reasoning that involves more than pattern matching or information retrieval: breaking down problems, applying logical steps, and working through solutions systematically. Mathematical calculations, logical puzzles, multi-step planning, and complex decision-making all benefit from explicit reasoning processes rather than direct pattern-based responses.

Older large language models failed at relatively simple arithmetic tasks (e.g., 23 + 47 + 4 - 2) because the models lacked the ability to perform multi-step calculations reliably. Modern models have improved but still struggle with complex reasoning without guidance.

Consider a practical example: baking chocolate cookies when you’re short on ingredients. Suppose a recipe that produces 50 cookies requires 200g of chocolate, 150g of butter, 2 dl of sugar, 1 egg, 3 dl of wheat flour, and a pinch of salt. If you only have 75g of butter, how many cookies can you make?

Let’s try asking the model using few-shot prompting to constrain the output format:

A chocolate cookie recipe that produces 50 cookies requires 200g of chocolate, 150g of butter, 2 dl of sugar, 1 egg, 3 dl of wheat flour, and a pinch of salt.
#####
I have all the ingredients, but I only have 100 grams of chocolate. How many cookies can I produce?
#####
Answer: 25
#####
I have all the ingredients, but I only have 0.2 dl of sugar. How many cookies can I produce?
#####
Answer: 5
#####
I have all the ingredients, but I only have 75 grams of butter. How many cookies can I produce?
#####
Answer:

12.5

The answer should be 25 cookies (since 75g is half of 150g, yielding half of 50 cookies). The model has failed to reason correctly about the proportion, despite the examples provided.

This illustrates a common challenge: while language models can perform many tasks remarkably well, they sometimes struggle with tasks requiring multi-step reasoning, particularly when the reasoning process isn’t explicitly demonstrated in the prompt.

Chain-of-thought prompting

Chain-of-thought (CoT) prompting is a technique where models are guided to show their reasoning process step-by-step before reaching a conclusion. Rather than jumping directly to an answer, the model articulates intermediate reasoning steps, which often leads to more accurate results on complex problems.

The key insight behind CoT prompting is that language models are better at reasoning when they externalize their thinking process. By generating the reasoning steps as text, the model can “use” those steps to inform subsequent token generation, similar to how humans work through problems on paper or verbally explain their thinking.

Few-shot chain-of-thought

Let’s revise our cookie problem to include explicit reasoning in the examples:

A chocolate cookie recipe that produces 50 cookies requires 200g of chocolate, 150g of butter, 2 dl of sugar, 1 egg, 3 dl of wheat flour, and a pinch of salt.
#####
I have all the ingredients, but I only have 100 grams of chocolate. How many cookies can I produce?
#####
Answer: As 200 grams of chocolate yields 50 cookies, 100 grams of chocolate yields 25 cookies.
#####
I have all the ingredients, but I only have 0.2 dl of sugar. How many cookies can I produce?
#####
Answer: As 2 deciliters of sugar yields 50 cookies, 0.2 deciliters of sugar yields 5 cookies.
#####
I have all the ingredients, but I only have 75 grams of butter. How many cookies can I produce?
#####
Answer:

As the original recipe calls for 150 grams of butter to produce 50 cookies, having only 75 grams of butter would allow you to produce 25 cookies.

By showing the reasoning process in the examples, we guide the model to reason about the proportional relationship rather than pattern-match to incorrect answers. The model now correctly calculates that 75g (half of 150g) produces 25 cookies (half of 50).

Zero-shot chain-of-thought

Surprisingly, chain-of-thought reasoning can be elicited without providing examples simply by adding phrases that prompt step-by-step thinking. The most well-known is “Let’s think step by step,” introduced in Large Language Models are Zero-Shot Reasoners in 2022. Variants like “Let’s work through this systematically” or “Let’s break this down” also work effectively.

A chocolate cookie recipe that produces 50 cookies requires 200g of chocolate, 150g of butter, 2 dl of sugar, 1 egg, 3 dl of wheat flour, and a pinch of salt.

I have only 75g of butter. How many cookies can I produce?

Let's think step by step

First, we need to calculate the ratio of the ingredients to determine how many cookies can be made with 75g of butter.

The original recipe calls for 150g of butter to produce 50 cookies.

To find out how many cookies can be made with 75g of butter, we can set up a proportion:

150g butter is to 50 cookies as 75g butter is to x cookies

150g/50 cookies = 75g/x cookies

Cross multiplying, we get:

150g * x = 75g * 50

150x = 3750

x = 3750 / 150

x = 25

So, with 75g of butter, you can produce 25 cookies.

The phrase “Let’s think step by step” triggered detailed reasoning, including setting up a proportion and showing the mathematical steps explicitly. This demonstrates that modern language models have learned during training to associate certain phrases with more careful, explicit reasoning processes.

Try it out!

Using the AI chatbot, try asking it to perform mathematics. Start with simple arithmetic like 7+7, then gradually build to more complex calculations. The chatbot is powered by a fairly powerful model, so it should handle basic math correctly. If you find a scenario where it makes a mistake, try adding “Let’s think step by step” to see if that helps.

Loading Exercise...

Advanced reasoning techniques

Researchers have developed several extensions to basic chain-of-thought prompting that improve performance on particularly challenging problems.

Self-consistency

Self-consistency generates multiple reasoning paths for the same problem and selects the most common answer through majority voting. Instead of accepting the first response, the system:

Generates several responses (typically 5-10) using chain-of-thought prompting with sampling temperature > 0 to produce diversity
Extracts the final answer from each response
Takes the majority vote among the answers as the final output

This approach improves accuracy on complex reasoning tasks because incorrect reasoning paths are less likely to converge on the same wrong answer, while correct paths tend to reach the same conclusion even if they take different routes. The method exploits the intuition that there are more ways to reason incorrectly than correctly, so correct answers emerge through consensus.

For example, asking a model to solve a complex math problem multiple times might yield:

Response 1: [reasoning steps] → Answer: 42
Response 2: [different reasoning steps] → Answer: 42
Response 3: [yet different reasoning steps] → Answer: 38
Response 4: [more reasoning steps] → Answer: 42
Response 5: [final reasoning steps] → Answer: 42

Self-consistency would select 42 as the final answer based on the majority vote (4 out of 5 responses).

The trade-off is computational cost — generating multiple responses requires multiple model calls, multiplying both latency and API costs. However, for high-stakes decisions or particularly difficult problems where accuracy is paramount, this investment can be worthwhile.

Loading Exercise...

Tree of thoughts and graph of thoughts

Tree of thoughts extends chain-of-thought by exploring multiple reasoning paths simultaneously rather than committing to a single path. The approach:

Generates several possible next reasoning steps from the current state
Evaluates which steps seem most promising using a scoring mechanism
Explores multiple branches in parallel
Backtracks if a path seems unproductive based on evaluation
Continues until reaching satisfactory solutions

This mirrors how humans approach complex problems — trying different approaches, evaluating progress at intermediate stages, and switching strategies if necessary rather than following a single path to completion.

Graph of thoughts generalizes this further by allowing reasoning paths to merge and diverge arbitrarily rather than following a strict tree structure. This enables more flexible exploration of solution spaces where insights from different reasoning paths can be combined.

These techniques are primarily research directions rather than widely-deployed practical tools, but they illustrate ongoing work to improve reasoning capabilities beyond linear chains of thought.

Loading Exercise...

Tool-augmented reasoning

Tool-augmented reasoning recognizes that language models shouldn’t solve all problems purely through text generation. For certain tasks — precise calculations, database queries, web searches, code execution — external tools are more reliable than the model’s internal reasoning capabilities.

In tool-augmented systems, the model:

Recognizes when a tool would be helpful for solving part of the problem
Generates a properly-formatted tool call with appropriate parameters
Receives the tool’s output
Incorporates that output into its ongoing reasoning
Continues until the problem is solved

For example, when asked “What’s the 47th Fibonacci number?”, a tool-augmented model might:

Recognize this requires precise calculation beyond its reliable capabilities
Generate Python code to calculate Fibonacci numbers efficiently
Execute that code using a code interpreter tool
Receive the result: 2,971,215,073
Return this precise answer with confidence

This approach combines the model’s language understanding and reasoning with the precision of computational tools, yielding more reliable results than pure text generation for many technical tasks.

Loading Exercise...

Reasoning models

Recent developments have introduced a new class of models specifically designed for reasoning: reasoning models that use extended inference-time computation to solve complex problems more reliably.

How reasoning models differ

Standard language models generate responses relatively quickly, using similar computational resources regardless of problem difficulty. A simple question and a complex mathematical proof receive roughly the same computational budget. Reasoning models, by contrast, “think longer” on harder problems by:

Producing extensive chain-of-thought (or similar) reasoning that may not be shown to users but guides the final response. This internal reasoning can span thousands of tokens.
Using self-consistency techniques to generate multiple reasoning paths and select the most consistent answer.
Using self-verification techniques to double- and triple-check their own reasoning, catching errors before producing a final answer.

Overall, these techniques mean that the model invests more computation on difficult problems, improving accuracy at the cost of increased latency and expense.

Loading Exercise...

Practical guidance for reasoning tasks

When approaching tasks that require reasoning, seek to balance accuracy, cost, and latency:

Try direct prompting first: Many problems don’t require explicit reasoning techniques. Start with clear, direct prompts.
Add “Let’s think step by step”: If direct prompting fails or produces uncertain results, zero-shot CoT is a low-cost next step.
Use few-shot CoT: If zero-shot CoT is insufficient, provide explicit reasoning examples demonstrating the desired problem-solving approach.
Consider reasoning models: For particularly complex problems where accuracy matters more than speed or cost.
Apply advanced techniques: Self-consistency or tool augmentation for critical decisions requiring high confidence.

Regardless of the technique you end up with, always verify the outputs carefully.

Loading Exercise...

← Reference Text and Retrieval-Augmented Generation

Prompt Chaining →