What's few-shot vs. zero-shot prompting?

Question

Accepted Answer

## Few-Shot vs Zero-Shot Prompting

**Prompting strategy** is one of the first decisions you make when deploying an LLM. Choosing between zero-shot and few-shot affects accuracy, cost, and latency.

### Zero-Shot Prompting

Gives the model **only the task description** — no examples.

```python
from openai import OpenAI
client = OpenAI()

zero_shot_prompt = '''Classify the sentiment of the following customer review as Positive, Negative, or Neutral.

Review: "The product arrived on time but the packaging was damaged."
Sentiment:'''

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a sentiment classifier. Respond with only: Positive, Negative, or Neutral."},
        {"role": "user", "content": zero_shot_prompt}
    ],
    temperature=0
)
print(response.choices[0].message.content)  # "Neutral"
```

**When to use:** Task is well-understood by the model, output format is simple, tokens are limited.

### Few-Shot Prompting

Provides **2–10 input/output examples** before the actual query.

```python
few_shot_prompt = '''Classify the sentiment of customer reviews as Positive, Negative, or Neutral.

Review: "Absolutely love this! Works perfectly."
Sentiment: Positive

Review: "Terrible quality. Broke after one use."
Sentiment: Negative

Review: "It's okay, nothing special."
Sentiment: Neutral

Review: "The product arrived on time but the packaging was damaged."
Sentiment:'''

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": few_shot_prompt}],
    temperature=0
)
```

**When to use:** Task has a specific format the model might miss, domain-specific outputs, edge cases to cover.

### Comparison

| Factor | Zero-Shot | Few-Shot |
|--------|-----------|----------|
| **Tokens used** | Low | Higher (examples add tokens) |
| **Cost** | Lower | Higher |
| **Latency** | Lower | Higher |
| **Accuracy** | Good for general tasks | Better for specific formats/domains |
| **Setup effort** | None | Needs curated examples |
| **Example dependency** | None | Poor examples hurt accuracy |

### Few-Shot Example Selection Tips

The quality of few-shot examples matters enormously:

```python
# ❌ Bad: all examples from the same class
examples = [
    ("Great product!", "Positive"),
    ("Amazing quality!", "Positive"),
    ("Loved it!", "Positive"),
    # Model over-predicts Positive
]

# ✅ Good: balanced, covering edge cases
examples = [
    ("Great product!", "Positive"),           # Clear positive
    ("Terrible quality, broke immediately.", "Negative"),  # Clear negative
    ("It's okay, nothing special.", "Neutral"), # Neutral
    ("Late delivery but product is good.", "Neutral"),  # Ambiguous case
]
```

### Dynamic Few-Shot Selection

In production, retrieve the most relevant examples for each query using embeddings:

```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# Pre-embed your example library
example_texts = [e[0] for e in example_library]
example_embeddings = model.encode(example_texts)

def get_relevant_examples(query: str, k: int = 3) -> list:
    query_embedding = model.encode([query])
    similarities = cosine_similarity(query_embedding, example_embeddings)[0]
    top_k_indices = np.argsort(similarities)[::-1][:k]
    return [example_library[i] for i in top_k_indices]
```

> **Production tip:** Start zero-shot. If accuracy is insufficient, add few-shot examples. If still insufficient, consider fine-tuning. Always measure the accuracy delta — sometimes zero-shot with a better system prompt beats few-shot.

What's few-shot vs. zero-shot prompting?

Answer

Few-Shot vs Zero-Shot Prompting

Zero-Shot Prompting

Few-Shot Prompting

Comparison

Few-Shot Example Selection Tips

Dynamic Few-Shot Selection

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Explain quantization in LLMs. Why is it important?

Factor	Zero-Shot	Few-Shot
Tokens used	Low	Higher (examples add tokens)
Cost	Lower	Higher
Latency	Lower	Higher
Accuracy	Good for general tasks	Better for specific formats/domains
Setup effort	None	Needs curated examples
Example dependency	None	Poor examples hurt accuracy