Machine LearningResearch Frontiers

Large Language Models

LevelAdvanced

Duration120 mins

TopicResearch Frontiers

5 / 5

Prompting and In-Context Learning

Programming with Natural Language

Large language models have introduced a remarkable new programming paradigm: prompting. Rather than writing code that explicitly defines behavior, we write natural language instructions that the model interprets and executes. The same model, with different prompts, can be a translator, a coding assistant, a creative writer, or a logical reasoner.

Even more striking is in-context learning: the ability of LLMs to learn new tasks from a few examples provided in the prompt, without any parameter updates. Show a model three examples of English-to-French translation, and it can translate a fourth. This emergent capability—discovered rather than designed—has transformed how we interact with AI.

Prompting is not merely an interface concern. The difference between a poor prompt and an excellent one can mean the difference between a useless response and a brilliant insight. Understanding prompting is essential for anyone working with modern language models.

What You Will Learn

This page covers the art and science of prompting: from basic prompt structure to advanced techniques like chain-of-thought reasoning, retrieval-augmented generation, and prompt optimization. You will understand how to extract maximum capability from any language model.

The Mechanics of Prompting

At its core, prompting provides context that guides the model's next-token prediction. Understanding the mechanics helps explain why certain techniques work.

From Prompt to Prediction

When you send a prompt to an LLM:

Tokenization: Text is converted to token IDs
Embedding: Tokens become vectors in continuous space
Attention: Each position attends to all previous positions
Prediction: Final layer outputs probability distribution over next token
Sampling/Decoding: Token selected according to strategy

The prompt establishes the "context" in which generation occurs. Everything the model knows about what should come next derives from this context.

The Prompt Sets the Distribution

Think of the prompt as conditioning the model's output distribution:

$$P(\text{response} | \text{prompt}) eq P(\text{response})$$

Different prompts activate different "modes" of the model by constraining the likely continuations.

How Prompt Components Influence Generation
Component	Effect on Distribution	Example
System prompt	Sets overall role/behavior mode	"You are a Python expert..."
Instructions	Constrains task and format	"Explain in 3 bullet points..."
Examples	Defines input-output pattern	"Input: 2+2, Output: 4"
Context	Provides relevant information	"Given the following article..."
Query	Specifies current request	"What is the main argument?"

Prompt Anatomy

A well-structured prompt typically includes:

[System/Role Definition]
You are an expert data scientist helping with machine learning tasks.

[Task Description]
Your task is to explain the following concept clearly and concisely.

[Constraints/Format]
Provide your explanation in 3 paragraphs:
1. Intuitive overview
2. Technical details
3. Practical implications

[Examples] (optional)
Example:
Concept: Overfitting
Explanation: [detailed example explanation]

[Input]
Concept: Regularization

[Output Primer] (optional)
Explanation:

The output primer (partial response to complete) is particularly powerful—it forces the model to continue in the specified format.

Position Matters

Information at the beginning and end of prompts has the strongest influence due to attention patterns. Critical instructions should appear early (strong attention) and be repeated at the end (recency effect). Long contexts can cause 'lost in the middle' effects.

In-Context Learning

In-context learning (ICL) is the ability of LLMs to learn new tasks from examples in the prompt without gradient updates. This is a remarkable emergent capability of scaled models.

Zero-Shot, Few-Shot, and Many-Shot

Zero-Shot: Task description only, no examples

Translate the following sentence to French:
"Hello, how are you?"

Few-Shot (1-5 examples): Task demonstrated through examples

Translate English to French:

English: Hello
French: Bonjour

English: Thank you
French: Merci

English: How are you?
French:

Many-Shot (10-100+ examples): Extensive demonstration

Training-like behavior without weight updates
Can match fine-tuning performance for simple tasks
Limited by context length

In-Context Learning Examples
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Zero-shot classification
zero_shot_prompt = """
Classify the sentiment of the following movie review as Positive or Negative.
 
Review: "This film was an absolute waste of time. The acting was wooden and the plot made no sense."
 
Sentiment:"""
 
# Few-shot classification
few_shot_prompt = """
Classify movie review sentiment as Positive or Negative.
 
Review: "I loved every minute of this masterpiece!"
Sentiment: Positive
 
Review: "Boring and predictable. I wanted my money back."
Sentiment: Negative
 
Review: "An incredible journey that left me in tears. Must see!"
Sentiment: Positive
 
Review: "This film was an absolute waste of time. The acting was wooden and the plot made no sense."
Sentiment:"""
 
# Many-shot with structured examples
def create_many_shot_prompt(examples: list[dict], query: str) -> str:
    """
    Create a many-shot prompt from labeled examples.
    """
    prompt = "Classify movie review sentiment.
 
"
    
    for ex in examples:
        prompt += f"Review: {ex['text']}
Sentiment: {ex['label']}
 
"
    
    prompt += f"Review: {query}
Sentiment:"
    return prompt
 
# Research finding: Few-shot often outperforms zero-shot significantly
# But adding more examples has diminishing returns after 5-10
# Quality and diversity of examples matters more than quantity

How Does ICL Work?

The mechanism behind ICL is an active research area. Leading theories:

1. Implicit Fine-tuning Attention layers perform something analogous to gradient descent in their forward pass. Examples in context create "meta-gradients" that temporarily adjust behavior.

2. Task Recognition The model recognizes the task from examples and retrieves relevant pre-trained capabilities. Examples serve as a "task identifier" rather than training data.

3. Induction Heads Specialized attention patterns (induction heads) copy patterns from earlier in context. [A][B]...[A] → [B]. Examples establish [input][output] patterns that are copied.

Empirical Findings:

Finding	Implication
ICL emerges at scale (~10B+ params)	Smaller models don't reliably do ICL
Example format matters more than content	Correct structure > correct labels
Label space defined by examples	Novel labels OK if exemplified
Performance varies with example selection	Random selection suboptimal

ICL Limitations

ICL cannot teach fundamentally new capabilities—it can only activate capabilities the model already has from pre-training. If the model can't do a task zero-shot (even poorly), examples won't help. ICL also consumes context, reducing space for actual content.

Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting is one of the most important prompting techniques, dramatically improving performance on reasoning, math, and multi-step tasks.

The Key Insight

Standard prompting asks for an answer directly:

Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
   How many balls does he have now?
A: 11

Chain-of-thought prompting asks the model to show its work:

Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
   How many balls does he have now?
A: Roger starts with 5 balls. He buys 2 cans with 3 balls each.
   So he gets 2 × 3 = 6 new balls.
   Total: 5 + 6 = 11 balls.

Why it works: By generating intermediate steps, the model:

Decomposes complex tasks into simpler sub-tasks
Can use earlier tokens as "working memory"
Reduces compounding errors (each step can be verified)
Activates relevant reasoning patterns from training

Chain-of-Thought Prompting
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Standard prompting
standard_prompt = """
Q: A farmer has 17 sheep. All but 8 run away. How many are left?
A:"""
# Model often outputs: "9" (wrong! Answer is 8)
 
# Zero-shot Chain-of-Thought
zero_shot_cot_prompt = """
Q: A farmer has 17 sheep. All but 8 run away. How many are left?
A: Let's think step by step."""
# Model: "Let's think step by step. 'All but 8 run away' means that 8 sheep 
# remain. So the farmer has 8 sheep left."
 
# Few-shot Chain-of-Thought
few_shot_cot_prompt = """
Q: There are 15 trees in the grove. Grove workers plant trees today.
   After they are done, there will be 21 trees. How many trees did they plant?
A: There are 15 trees originally. Then there were 21 trees after planting.
   So they planted 21 - 15 = 6 trees. The answer is 6.
 
Q: If there are 3 cars in the parking lot and 2 more cars arrive,
   how many cars are in the parking lot?
A: There are originally 3 cars. 2 more cars arrive.
   So there are 3 + 2 = 5 cars. The answer is 5.
 
Q: A farmer has 17 sheep. All but 8 run away. How many are left?
A:"""
# Model produces correct step-by-step reasoning
 
# Self-Consistency: Sample multiple chains, take majority vote
def self_consistent_cot(model, prompt, n_samples=5, temperature=0.7):
    """
    Generate multiple reasoning chains and vote on the answer.
    """
    answers = []
    for _ in range(n_samples):
        response = model.generate(prompt, temperature=temperature)
        answer = extract_final_answer(response)
        answers.append(answer)
    
    # Majority vote
    from collections import Counter
    return Counter(answers).most_common(1)[0][0]

Advanced CoT Techniques

Self-Consistency

Generate multiple reasoning chains (with temperature > 0), extract answers, return majority vote:

Method	Accuracy (GSM8K)
Standard prompting	18%
CoT prompting	56%
CoT + Self-Consistency (10 chains)	74%

Tree of Thought (ToT)

Explicit search over reasoning paths:

Generate multiple candidate next steps
Evaluate each (self-evaluate or heuristic)
Keep best candidates
Continue until solution or max depth

Least-to-Most Prompting

Decompose before solving:

Break problem into subproblems
Solve subproblems sequentially
Use previous answers to solve later subproblems

Analogical Reasoning

Recall relevant examples before answering:

First, recall a similar problem you know how to solve.
Then, use that approach for the current problem.

When CoT Helps Most

CoT provides the largest gains on multi-step reasoning: math word problems, logic puzzles, code debugging. For simple retrieval ("What is the capital of France?") or pattern matching, CoT may not help and can even hurt by introducing errors in the chain.

Prompt Engineering Best Practices

Effective prompting combines principles from cognitive science, software engineering, and empirical experimentation.

Clarity and Specificity

Vague prompts produce vague outputs. Be explicit about what you want:

Weak:

Write about machine learning.

Strong:

Write a 500-word blog post explaining gradient descent to software engineers
who know programming but not ML. Include a code example in Python.
Tone: Technically accurate but accessible.

Role and Persona

Defining a role activates relevant knowledge and behavior:

You are a senior software architect at a Fortune 500 company
with 20 years of experience in distributed systems.

Review the following system design and identify potential issues.

Effective roles:

Match expertise level to task
Include relevant context (domain, constraints)
Specify perspective (critical, supportive, neutral)

Prompt Engineering Checklist

•Define the task explicitly — What exactly should the model do?
•Specify the output format — JSON, markdown, bullet points, code?
•Set constraints — Length, style, vocabulary, topics to include/avoid
•Provide context — What does the model need to know to answer well?
•Give examples — Show don't just tell, especially for complex formats
•Use delimiters — Clearly separate instructions from content (```, ###)
•Include edge case handling — What should it do if uncertain?
•Request reasoning — "Explain your reasoning" improves accuracy

Prompt Engineering Patterns
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Pattern 1: Structured Output
structured_output_prompt = """
Extract information from the following job posting.
Return your answer as JSON with these fields:
{
    "title": "job title",
    "company": "company name",
    "location": "location or 'Remote'",
    "salary_min": number or null,
    "salary_max": number or null,
    "required_skills": ["skill1", "skill2", ...]
}
 
Job posting:
###
{job_posting_text}
###
 
JSON:"""
 
# Pattern 2: Step-by-Step with Verification
verification_prompt = """
Task: Determine if the following code has any bugs.
 
Approach:
1. First, describe what the code is intended to do
2. Trace through the code with a sample input
3. Identify any potential issues
4. For each issue, explain why it's a problem
5. Finally, state: BUGS FOUND or NO BUGS FOUND
 
Code:
```python
{ code }
```
 
Analysis:"""
 
# Pattern 3: Perspective Prompting
perspective_prompt = """
Analyze this business proposal from three perspectives:
 
1. **Optimist**: What could go well? Best-case scenarios?
2. **Pessimist**: What could go wrong? Risks and weaknesses?
3. **Pragmatist**: Most likely outcome? Key uncertainties?
 
For each perspective, provide 3-4 specific points.
 
Proposal:
{proposal_text}
 
Analysis:"""
 
# Pattern 4: Self-Critique
self_critique_prompt = """
{initial_response}
 
Now critically review your response above:
1. What assumptions did you make?
2. What might be incorrect or oversimplified?
3. What important considerations did you miss?
4. Provide an improved response incorporating these critiques.
 
Improved response:"""

Iterate Empirically

Prompting is empirical. Small changes can dramatically affect output. Test systematically: vary one aspect at a time, evaluate on diverse examples, keep a log of what works. The best prompt is the one that works on your data, not the one that looks most clever.

Retrieval-Augmented Generation (RAG)

LLMs have knowledge cutoffs and can hallucinate facts. Retrieval-Augmented Generation (RAG) addresses this by providing relevant documents in the prompt.

The RAG Architecture

User Query → Retriever → Relevant Documents → LLM → Response
                              ↓
              [docs injected into prompt]

Components:

Document Store: Corpus of knowledge (websites, PDFs, databases)
Embedder: Converts text to dense vectors for similarity search
Index: Enables fast similarity search (FAISS, Pinecone, etc.)
Retriever: Finds relevant documents for a query
Generator: LLM that conditions on retrieved context

RAG Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
 
class SimpleRAG:
    def __init__(self, documents: list[str], llm_client):
        self.documents = documents
        self.llm = llm_client
        
        # Initialize embedder
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        
        # Create document embeddings
        self.doc_embeddings = self.embedder.encode(documents)
        
        # Build FAISS index
        dimension = self.doc_embeddings.shape[1]
        self.index = faiss.IndexFlatL2(dimension)
        self.index.add(self.doc_embeddings.astype('float32'))
    
    def retrieve(self, query: str, k: int = 3) -> list[str]:
        """Retrieve k most relevant documents for query."""
        query_embedding = self.embedder.encode([query])
        
        distances, indices = self.index.search(
            query_embedding.astype('float32'), k
        )
        
        return [self.documents[i] for i in indices[0]]
    
    def generate(self, query: str, k: int = 3) -> str:
        """Generate response using retrieved context."""
        retrieved_docs = self.retrieve(query, k)
        
        prompt = f"""Answer the question using the provided context.
If the context doesn't contain relevant information, say so.
 
Context:
{chr(10).join(f'[{i+1}] {doc}' for i, doc in enumerate(retrieved_docs))}
 
Question: {query}
 
Answer (cite sources using [1], [2], etc.):"""
        
        response = self.llm.generate(prompt)
        return response
 
# Usage
rag = SimpleRAG(documents=my_document_corpus, llm_client=llm)
answer = rag.generate("What are the symptoms of condition X?")

RAG Design Decisions

Decision	Options	Tradeoffs
Chunk size	100-2000 tokens	Smaller = precise, larger = more context
Chunk overlap	0-50%	More overlap = redundancy but fewer boundary issues
Number of chunks (k)	1-20	More = more context, but dilutes relevance
Retrieval method	Dense, sparse, hybrid	Dense = semantic, sparse = keyword, hybrid = both
Reranking	None, cross-encoder	Improves relevance but adds latency

Advanced RAG Techniques

Query Transformation

Expand query with synonyms/related terms
Decompose complex queries into sub-queries
Hypothetical Document Embedding (HyDE): generate hypothetical answer, embed that

Hierarchical Retrieval

First-stage: fast, approximate retrieval (dense or sparse)
Second-stage: cross-encoder reranking for precision
Final: LLM generation on top-k after rerank

Self-Reflective RAG

Generate initial response
Identify gaps or uncertainties
Retrieve additional information for gaps
Regenerate with augmented context

RAG Failure Modes

RAG can fail silently: if retrieved documents are irrelevant, the LLM may incorporate them anyway, producing confident-sounding nonsense. Always include instructions for the model to indicate when context is insufficient, and consider confidence calibration.

Advanced Techniques and Agentic Patterns

Beyond basic prompting, advanced techniques enable complex, multi-step AI applications.

Tool Use and Function Calling

Modern LLMs can be taught to invoke external tools:

tools = [
    {
        "name": "calculator",
        "description": "Perform mathematical calculations",
        "parameters": {
            "expression": {"type": "string", "description": "Math expression"}
        }
    },
    {
        "name": "web_search",
        "description": "Search the web for current information",
        "parameters": {
            "query": {"type": "string", "description": "Search query"}
        }
    }
]

prompt = """
You have access to the following tools: {tools}

To use a tool, respond with: <tool>tool_name(param=value)</tool>

Question: What is the current stock price of AAPL multiplied by 1.15?
"""

The agent loop:

LLM decides whether to use a tool
If yes, outputs structured tool call
System executes tool, returns result
LLM incorporates result, continues
Repeat until task complete

Multi-Turn and Memory

Maintaining context across turns:

conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."},
    {"role": "user", "content": "How does gradient descent work in it?"}
]

# Context window limits require summarization or retrieval
def manage_context(conversation, max_tokens=4000):
    """Keep recent turns, summarize older context."""
    if estimate_tokens(conversation) > max_tokens:
        # Summarize older turns
        summary = llm.summarize(conversation[1:-4])  # Keep system + recent
        return [
            conversation[0],  # System
            {"role": "system", "content": f"Previous context: {summary}"},
            *conversation[-4:]  # Recent turns
        ]
    return conversation

ReAct: Reasoning + Acting

Combines chain-of-thought reasoning with tool use:

Question: What is the hometown of the current president of France's population?

Thought: I need to find who the current president of France is.
Action: web_search(query="current president of France 2024")
Observation: Emmanuel Macron is the current president of France.

Thought: Now I need to find Macron's hometown.
Action: web_search(query="Emmanuel Macron hometown")
Observation: Emmanuel Macron was born in Amiens, France.

Thought: Now I need to find the population of Amiens.
Action: web_search(query="Amiens France population")
Observation: Amiens has a population of approximately 135,000.

Thought: I now have all the information needed.
Answer: Amiens, the hometown of French President Emmanuel Macron,
        has a population of approximately 135,000.

Agentic Design Patterns

•Planning: Decompose complex tasks into steps before execution
•Reflection: Critique own outputs and iterate to improve
•Tool Use: Access external systems (calculators, APIs, databases)
•Multi-Agent: Specialized agents collaborate on complex tasks
•Self-Verification: Generate tests/checks for own outputs
•Iterative Refinement: Draft → critique → revise cycles

The Reliability Challenge

Agentic systems compound LLM unreliability. A 90% accurate model making 10 sequential decisions has only 35% chance of all being correct. Build in verification, fallbacks, and human oversight for high-stakes applications.

Systematic Prompt Optimization

For production systems, prompts should be systematically optimized rather than hand-crafted.

Evaluation-Driven Development

class PromptEvaluator:
    def __init__(self, test_cases: list[dict], metrics: list[callable]):
        self.test_cases = test_cases  # [{input, expected, ...}]
        self.metrics = metrics         # [accuracy_fn, f1_fn, ...]
    
    def evaluate(self, prompt_template: str, model) -> dict:
        results = []
        for case in self.test_cases:
            prompt = prompt_template.format(**case)
            output = model.generate(prompt)
            
            scores = {m.__name__: m(output, case['expected']) 
                     for m in self.metrics}
            results.append(scores)
        
        return aggregate_scores(results)

Evaluation workflow:

Define success criteria (accuracy, format compliance, latency)
Create diverse test cases covering edge cases
Establish baseline with initial prompt
Systematically vary prompt components
Measure impact on metrics
Select best-performing variant

Prompt Optimization Techniques
Technique	Description	When to Use
Manual iteration	Human refinement based on failure analysis	Initial development, understanding failure modes
A/B testing	Compare variants on traffic/test set	Production optimization
DSPy	Automatic prompt optimization via LLM	Scalable, reproducible optimization
OPRO	LLM generates and scores prompt variants	When many iterations affordable
Gradient-based	Soft prompt tuning, continuous optimization	When model access available

DSPy: Programmatic Prompt Optimization

DSPy treats prompts as programs that can be automatically optimized:

import dspy

class RAGSignature(dspy.Signature):
    """Answer questions using retrieved context."""
    context = dspy.InputField(desc="retrieved documents")
    question = dspy.InputField(desc="user question")
    answer = dspy.OutputField(desc="answer based on context")

class RAGModule(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought(RAGSignature)
    
    def forward(self, context, question):
        return self.generate(context=context, question=question)

# Compile (optimize) the module
from dspy.teleprompt import BootstrapFewShot

teleprompter = BootstrapFewShot(metric=accuracy_metric)
optimized_rag = teleprompter.compile(RAGModule(), trainset=examples)

# Optimized module has learned effective prompts

Key insight: Rather than manually crafting prompts, define the structure (inputs, outputs, modules) and let optimization find the best prompts.

Production Considerations

Version control prompts like code
A/B test changes before full rollout
Monitor prompt drift (model updates may invalidate prompts)
Handle failures gracefully (retries, fallbacks)
Cache expensive calls where outputs are deterministic

Prompts Are Technical Debt

Treat prompts as code: document them, test them, version them, review changes. A 'quick prompt fix' in production can cause cascading failures. Maintain a test suite for your prompts just as you would for software.

Summary: The Art and Science of Prompting

Prompting is the primary interface to large language models—the craft of translating human intent into AI behavior. Mastery of prompting techniques unlocks the full potential of these powerful systems.

Key Takeaways

•Prompts condition output distribution — The context determines what the model generates
•In-context learning is emergent — Models learn from examples without weight updates, but can't learn fundamentally new capabilities
•Chain-of-thought transforms reasoning — Explicit intermediate steps dramatically improve multi-step problem solving
•Prompt engineering is systematic — Clear structure, specific instructions, and appropriate examples beat clever tricks
•RAG provides grounding — External retrieval addresses knowledge cutoffs and reduces hallucination
•Tool use enables agents — LLMs can orchestrate external systems for complex, multi-step tasks
•Optimization is empirical — Test systematically, measure rigorously, iterate based on data

Module Complete:

You have now completed the Large Language Models module. From transformer scaling through pre-training objectives, instruction tuning, RLHF alignment, and prompting techniques—you understand the complete modern LLM pipeline.

This knowledge enables you to:

Reason about model capabilities and limitations at their source
Fine-tune and align models for specific applications
Design effective prompts and build LLM-powered applications
Evaluate and deploy language models responsibly

The field continues to evolve rapidly, but these fundamentals provide a foundation for understanding new developments as they emerge.

Module Complete

Congratulations on completing the Large Language Models module! You now understand the complete stack: from transformer scaling and pre-training through alignment with RLHF and practical deployment via prompting. This knowledge positions you at the cutting edge of machine learning research and application.

5 / 5

Loading learning content...

Machine LearningResearch Frontiers

Large Language Models

LevelAdvanced

Duration120 mins

TopicResearch Frontiers

5 / 5

Prompting and In-Context Learning

Programming with Natural Language

What You Will Learn

The Mechanics of Prompting

At its core, prompting provides context that guides the model's next-token prediction. Understanding the mechanics helps explain why certain techniques work.

From Prompt to Prediction

When you send a prompt to an LLM:

Tokenization: Text is converted to token IDs
Embedding: Tokens become vectors in continuous space
Attention: Each position attends to all previous positions
Prediction: Final layer outputs probability distribution over next token
Sampling/Decoding: Token selected according to strategy

The prompt establishes the "context" in which generation occurs. Everything the model knows about what should come next derives from this context.

The Prompt Sets the Distribution

Think of the prompt as conditioning the model's output distribution:

$$P(\text{response} | \text{prompt}) eq P(\text{response})$$

Different prompts activate different "modes" of the model by constraining the likely continuations.

How Prompt Components Influence Generation
Component	Effect on Distribution	Example
System prompt	Sets overall role/behavior mode	"You are a Python expert..."
Instructions	Constrains task and format	"Explain in 3 bullet points..."
Examples	Defines input-output pattern	"Input: 2+2, Output: 4"
Context	Provides relevant information	"Given the following article..."
Query	Specifies current request	"What is the main argument?"

Prompt Anatomy

A well-structured prompt typically includes:

[System/Role Definition]
You are an expert data scientist helping with machine learning tasks.

[Task Description]
Your task is to explain the following concept clearly and concisely.

[Constraints/Format]
Provide your explanation in 3 paragraphs:
1. Intuitive overview
2. Technical details
3. Practical implications

[Examples] (optional)
Example:
Concept: Overfitting
Explanation: [detailed example explanation]

[Input]
Concept: Regularization

[Output Primer] (optional)
Explanation:

The output primer (partial response to complete) is particularly powerful—it forces the model to continue in the specified format.

Position Matters

In-Context Learning

In-context learning (ICL) is the ability of LLMs to learn new tasks from examples in the prompt without gradient updates. This is a remarkable emergent capability of scaled models.

Zero-Shot, Few-Shot, and Many-Shot

Zero-Shot: Task description only, no examples

Translate the following sentence to French:
"Hello, how are you?"

Few-Shot (1-5 examples): Task demonstrated through examples

Translate English to French:

English: Hello
French: Bonjour

English: Thank you
French: Merci

English: How are you?
French:

Many-Shot (10-100+ examples): Extensive demonstration

Training-like behavior without weight updates
Can match fine-tuning performance for simple tasks
Limited by context length

In-Context Learning Examples
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Zero-shot classification
zero_shot_prompt = """
Classify the sentiment of the following movie review as Positive or Negative.
 
Review: "This film was an absolute waste of time. The acting was wooden and the plot made no sense."
 
Sentiment:"""
 
# Few-shot classification
few_shot_prompt = """
Classify movie review sentiment as Positive or Negative.
 
Review: "I loved every minute of this masterpiece!"
Sentiment: Positive
 
Review: "Boring and predictable. I wanted my money back."
Sentiment: Negative
 
Review: "An incredible journey that left me in tears. Must see!"
Sentiment: Positive
 
Review: "This film was an absolute waste of time. The acting was wooden and the plot made no sense."
Sentiment:"""
 
# Many-shot with structured examples
def create_many_shot_prompt(examples: list[dict], query: str) -> str:
    """
    Create a many-shot prompt from labeled examples.
    """
    prompt = "Classify movie review sentiment.
 
"
    
    for ex in examples:
        prompt += f"Review: {ex['text']}
Sentiment: {ex['label']}
 
"
    
    prompt += f"Review: {query}
Sentiment:"
    return prompt
 
# Research finding: Few-shot often outperforms zero-shot significantly
# But adding more examples has diminishing returns after 5-10
# Quality and diversity of examples matters more than quantity

How Does ICL Work?

The mechanism behind ICL is an active research area. Leading theories:

1. Implicit Fine-tuning Attention layers perform something analogous to gradient descent in their forward pass. Examples in context create "meta-gradients" that temporarily adjust behavior.

2. Task Recognition The model recognizes the task from examples and retrieves relevant pre-trained capabilities. Examples serve as a "task identifier" rather than training data.

3. Induction Heads Specialized attention patterns (induction heads) copy patterns from earlier in context. [A][B]...[A] → [B]. Examples establish [input][output] patterns that are copied.

Empirical Findings:

Finding	Implication
ICL emerges at scale (~10B+ params)	Smaller models don't reliably do ICL
Example format matters more than content	Correct structure > correct labels
Label space defined by examples	Novel labels OK if exemplified
Performance varies with example selection	Random selection suboptimal

ICL Limitations

Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting is one of the most important prompting techniques, dramatically improving performance on reasoning, math, and multi-step tasks.

The Key Insight

Standard prompting asks for an answer directly:

Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
   How many balls does he have now?
A: 11

Chain-of-thought prompting asks the model to show its work:

Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
   How many balls does he have now?
A: Roger starts with 5 balls. He buys 2 cans with 3 balls each.
   So he gets 2 × 3 = 6 new balls.
   Total: 5 + 6 = 11 balls.

Why it works: By generating intermediate steps, the model:

Decomposes complex tasks into simpler sub-tasks
Can use earlier tokens as "working memory"
Reduces compounding errors (each step can be verified)
Activates relevant reasoning patterns from training

Chain-of-Thought Prompting
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Standard prompting
standard_prompt = """
Q: A farmer has 17 sheep. All but 8 run away. How many are left?
A:"""
# Model often outputs: "9" (wrong! Answer is 8)
 
# Zero-shot Chain-of-Thought
zero_shot_cot_prompt = """
Q: A farmer has 17 sheep. All but 8 run away. How many are left?
A: Let's think step by step."""
# Model: "Let's think step by step. 'All but 8 run away' means that 8 sheep 
# remain. So the farmer has 8 sheep left."
 
# Few-shot Chain-of-Thought
few_shot_cot_prompt = """
Q: There are 15 trees in the grove. Grove workers plant trees today.
   After they are done, there will be 21 trees. How many trees did they plant?
A: There are 15 trees originally. Then there were 21 trees after planting.
   So they planted 21 - 15 = 6 trees. The answer is 6.
 
Q: If there are 3 cars in the parking lot and 2 more cars arrive,
   how many cars are in the parking lot?
A: There are originally 3 cars. 2 more cars arrive.
   So there are 3 + 2 = 5 cars. The answer is 5.
 
Q: A farmer has 17 sheep. All but 8 run away. How many are left?
A:"""
# Model produces correct step-by-step reasoning
 
# Self-Consistency: Sample multiple chains, take majority vote
def self_consistent_cot(model, prompt, n_samples=5, temperature=0.7):
    """
    Generate multiple reasoning chains and vote on the answer.
    """
    answers = []
    for _ in range(n_samples):
        response = model.generate(prompt, temperature=temperature)
        answer = extract_final_answer(response)
        answers.append(answer)
    
    # Majority vote
    from collections import Counter
    return Counter(answers).most_common(1)[0][0]

Advanced CoT Techniques

Self-Consistency

Generate multiple reasoning chains (with temperature > 0), extract answers, return majority vote:

Method	Accuracy (GSM8K)
Standard prompting	18%
CoT prompting	56%
CoT + Self-Consistency (10 chains)	74%

Tree of Thought (ToT)

Explicit search over reasoning paths:

Generate multiple candidate next steps
Evaluate each (self-evaluate or heuristic)
Keep best candidates
Continue until solution or max depth

Least-to-Most Prompting

Decompose before solving:

Break problem into subproblems
Solve subproblems sequentially
Use previous answers to solve later subproblems

Analogical Reasoning

Recall relevant examples before answering:

First, recall a similar problem you know how to solve.
Then, use that approach for the current problem.

When CoT Helps Most

Prompt Engineering Best Practices

Effective prompting combines principles from cognitive science, software engineering, and empirical experimentation.

Clarity and Specificity

Vague prompts produce vague outputs. Be explicit about what you want:

Weak:

Write about machine learning.

Strong:

Write a 500-word blog post explaining gradient descent to software engineers
who know programming but not ML. Include a code example in Python.
Tone: Technically accurate but accessible.

Role and Persona

Defining a role activates relevant knowledge and behavior:

You are a senior software architect at a Fortune 500 company
with 20 years of experience in distributed systems.

Review the following system design and identify potential issues.

Effective roles:

Match expertise level to task
Include relevant context (domain, constraints)
Specify perspective (critical, supportive, neutral)

Prompt Engineering Checklist

•Define the task explicitly — What exactly should the model do?
•Specify the output format — JSON, markdown, bullet points, code?
•Set constraints — Length, style, vocabulary, topics to include/avoid
•Provide context — What does the model need to know to answer well?
•Give examples — Show don't just tell, especially for complex formats
•Use delimiters — Clearly separate instructions from content (```, ###)
•Include edge case handling — What should it do if uncertain?
•Request reasoning — "Explain your reasoning" improves accuracy

Prompt Engineering Patterns
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Pattern 1: Structured Output
structured_output_prompt = """
Extract information from the following job posting.
Return your answer as JSON with these fields:
{
    "title": "job title",
    "company": "company name",
    "location": "location or 'Remote'",
    "salary_min": number or null,
    "salary_max": number or null,
    "required_skills": ["skill1", "skill2", ...]
}
 
Job posting:
###
{job_posting_text}
###
 
JSON:"""
 
# Pattern 2: Step-by-Step with Verification
verification_prompt = """
Task: Determine if the following code has any bugs.
 
Approach:
1. First, describe what the code is intended to do
2. Trace through the code with a sample input
3. Identify any potential issues
4. For each issue, explain why it's a problem
5. Finally, state: BUGS FOUND or NO BUGS FOUND
 
Code:
```python
{ code }
```
 
Analysis:"""
 
# Pattern 3: Perspective Prompting
perspective_prompt = """
Analyze this business proposal from three perspectives:
 
1. **Optimist**: What could go well? Best-case scenarios?
2. **Pessimist**: What could go wrong? Risks and weaknesses?
3. **Pragmatist**: Most likely outcome? Key uncertainties?
 
For each perspective, provide 3-4 specific points.
 
Proposal:
{proposal_text}
 
Analysis:"""
 
# Pattern 4: Self-Critique
self_critique_prompt = """
{initial_response}
 
Now critically review your response above:
1. What assumptions did you make?
2. What might be incorrect or oversimplified?
3. What important considerations did you miss?
4. Provide an improved response incorporating these critiques.
 
Improved response:"""

Iterate Empirically

Retrieval-Augmented Generation (RAG)

LLMs have knowledge cutoffs and can hallucinate facts. Retrieval-Augmented Generation (RAG) addresses this by providing relevant documents in the prompt.

The RAG Architecture

User Query → Retriever → Relevant Documents → LLM → Response
                              ↓
              [docs injected into prompt]

Components:

Document Store: Corpus of knowledge (websites, PDFs, databases)
Embedder: Converts text to dense vectors for similarity search
Index: Enables fast similarity search (FAISS, Pinecone, etc.)
Retriever: Finds relevant documents for a query
Generator: LLM that conditions on retrieved context

RAG Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
 
class SimpleRAG:
    def __init__(self, documents: list[str], llm_client):
        self.documents = documents
        self.llm = llm_client
        
        # Initialize embedder
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        
        # Create document embeddings
        self.doc_embeddings = self.embedder.encode(documents)
        
        # Build FAISS index
        dimension = self.doc_embeddings.shape[1]
        self.index = faiss.IndexFlatL2(dimension)
        self.index.add(self.doc_embeddings.astype('float32'))
    
    def retrieve(self, query: str, k: int = 3) -> list[str]:
        """Retrieve k most relevant documents for query."""
        query_embedding = self.embedder.encode([query])
        
        distances, indices = self.index.search(
            query_embedding.astype('float32'), k
        )
        
        return [self.documents[i] for i in indices[0]]
    
    def generate(self, query: str, k: int = 3) -> str:
        """Generate response using retrieved context."""
        retrieved_docs = self.retrieve(query, k)
        
        prompt = f"""Answer the question using the provided context.
If the context doesn't contain relevant information, say so.
 
Context:
{chr(10).join(f'[{i+1}] {doc}' for i, doc in enumerate(retrieved_docs))}
 
Question: {query}
 
Answer (cite sources using [1], [2], etc.):"""
        
        response = self.llm.generate(prompt)
        return response
 
# Usage
rag = SimpleRAG(documents=my_document_corpus, llm_client=llm)
answer = rag.generate("What are the symptoms of condition X?")

RAG Design Decisions

Decision	Options	Tradeoffs
Chunk size	100-2000 tokens	Smaller = precise, larger = more context
Chunk overlap	0-50%	More overlap = redundancy but fewer boundary issues
Number of chunks (k)	1-20	More = more context, but dilutes relevance
Retrieval method	Dense, sparse, hybrid	Dense = semantic, sparse = keyword, hybrid = both
Reranking	None, cross-encoder	Improves relevance but adds latency

Advanced RAG Techniques

Query Transformation

Expand query with synonyms/related terms
Decompose complex queries into sub-queries
Hypothetical Document Embedding (HyDE): generate hypothetical answer, embed that

Hierarchical Retrieval

First-stage: fast, approximate retrieval (dense or sparse)
Second-stage: cross-encoder reranking for precision
Final: LLM generation on top-k after rerank

Self-Reflective RAG

Generate initial response
Identify gaps or uncertainties
Retrieve additional information for gaps
Regenerate with augmented context

RAG Failure Modes

Advanced Techniques and Agentic Patterns

Beyond basic prompting, advanced techniques enable complex, multi-step AI applications.

Tool Use and Function Calling

Modern LLMs can be taught to invoke external tools:

tools = [
    {
        "name": "calculator",
        "description": "Perform mathematical calculations",
        "parameters": {
            "expression": {"type": "string", "description": "Math expression"}
        }
    },
    {
        "name": "web_search",
        "description": "Search the web for current information",
        "parameters": {
            "query": {"type": "string", "description": "Search query"}
        }
    }
]

prompt = """
You have access to the following tools: {tools}

To use a tool, respond with: <tool>tool_name(param=value)</tool>

Question: What is the current stock price of AAPL multiplied by 1.15?
"""

The agent loop:

LLM decides whether to use a tool
If yes, outputs structured tool call
System executes tool, returns result
LLM incorporates result, continues
Repeat until task complete

Multi-Turn and Memory

Maintaining context across turns:

conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."},
    {"role": "user", "content": "How does gradient descent work in it?"}
]

# Context window limits require summarization or retrieval
def manage_context(conversation, max_tokens=4000):
    """Keep recent turns, summarize older context."""
    if estimate_tokens(conversation) > max_tokens:
        # Summarize older turns
        summary = llm.summarize(conversation[1:-4])  # Keep system + recent
        return [
            conversation[0],  # System
            {"role": "system", "content": f"Previous context: {summary}"},
            *conversation[-4:]  # Recent turns
        ]
    return conversation

ReAct: Reasoning + Acting

Combines chain-of-thought reasoning with tool use:

Question: What is the hometown of the current president of France's population?

Thought: I need to find who the current president of France is.
Action: web_search(query="current president of France 2024")
Observation: Emmanuel Macron is the current president of France.

Thought: Now I need to find Macron's hometown.
Action: web_search(query="Emmanuel Macron hometown")
Observation: Emmanuel Macron was born in Amiens, France.

Thought: Now I need to find the population of Amiens.
Action: web_search(query="Amiens France population")
Observation: Amiens has a population of approximately 135,000.

Thought: I now have all the information needed.
Answer: Amiens, the hometown of French President Emmanuel Macron,
        has a population of approximately 135,000.

Agentic Design Patterns

•Planning: Decompose complex tasks into steps before execution
•Reflection: Critique own outputs and iterate to improve
•Tool Use: Access external systems (calculators, APIs, databases)
•Multi-Agent: Specialized agents collaborate on complex tasks
•Self-Verification: Generate tests/checks for own outputs
•Iterative Refinement: Draft → critique → revise cycles

The Reliability Challenge

Systematic Prompt Optimization

For production systems, prompts should be systematically optimized rather than hand-crafted.

Evaluation-Driven Development

class PromptEvaluator:
    def __init__(self, test_cases: list[dict], metrics: list[callable]):
        self.test_cases = test_cases  # [{input, expected, ...}]
        self.metrics = metrics         # [accuracy_fn, f1_fn, ...]
    
    def evaluate(self, prompt_template: str, model) -> dict:
        results = []
        for case in self.test_cases:
            prompt = prompt_template.format(**case)
            output = model.generate(prompt)
            
            scores = {m.__name__: m(output, case['expected']) 
                     for m in self.metrics}
            results.append(scores)
        
        return aggregate_scores(results)

Evaluation workflow:

Define success criteria (accuracy, format compliance, latency)
Create diverse test cases covering edge cases
Establish baseline with initial prompt
Systematically vary prompt components
Measure impact on metrics
Select best-performing variant

Prompt Optimization Techniques
Technique	Description	When to Use
Manual iteration	Human refinement based on failure analysis	Initial development, understanding failure modes
A/B testing	Compare variants on traffic/test set	Production optimization
DSPy	Automatic prompt optimization via LLM	Scalable, reproducible optimization
OPRO	LLM generates and scores prompt variants	When many iterations affordable
Gradient-based	Soft prompt tuning, continuous optimization	When model access available

DSPy: Programmatic Prompt Optimization

DSPy treats prompts as programs that can be automatically optimized:

import dspy

class RAGSignature(dspy.Signature):
    """Answer questions using retrieved context."""
    context = dspy.InputField(desc="retrieved documents")
    question = dspy.InputField(desc="user question")
    answer = dspy.OutputField(desc="answer based on context")

class RAGModule(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought(RAGSignature)
    
    def forward(self, context, question):
        return self.generate(context=context, question=question)

# Compile (optimize) the module
from dspy.teleprompt import BootstrapFewShot

teleprompter = BootstrapFewShot(metric=accuracy_metric)
optimized_rag = teleprompter.compile(RAGModule(), trainset=examples)

# Optimized module has learned effective prompts

Key insight: Rather than manually crafting prompts, define the structure (inputs, outputs, modules) and let optimization find the best prompts.

Production Considerations

Version control prompts like code
A/B test changes before full rollout
Monitor prompt drift (model updates may invalidate prompts)
Handle failures gracefully (retries, fallbacks)
Cache expensive calls where outputs are deterministic

Prompts Are Technical Debt

Summary: The Art and Science of Prompting

Key Takeaways

•Prompts condition output distribution — The context determines what the model generates
•In-context learning is emergent — Models learn from examples without weight updates, but can't learn fundamentally new capabilities
•Chain-of-thought transforms reasoning — Explicit intermediate steps dramatically improve multi-step problem solving
•Prompt engineering is systematic — Clear structure, specific instructions, and appropriate examples beat clever tricks
•RAG provides grounding — External retrieval addresses knowledge cutoffs and reduces hallucination
•Tool use enables agents — LLMs can orchestrate external systems for complex, multi-step tasks
•Optimization is empirical — Test systematically, measure rigorously, iterate based on data

Module Complete:

This knowledge enables you to:

Reason about model capabilities and limitations at their source
Fine-tune and align models for specific applications
Design effective prompts and build LLM-powered applications
Evaluate and deploy language models responsibly

The field continues to evolve rapidly, but these fundamentals provide a foundation for understanding new developments as they emerge.

Module Complete

5 / 5