What's few-shot vs. zero-shot prompting?
Answer
Few-Shot vs Zero-Shot Prompting
Prompting strategy is one of the first decisions you make when deploying an LLM. Choosing between zero-shot and few-shot affects accuracy, cost, and latency.
Zero-Shot Prompting
Gives the model only the task description — no examples.
pythonfrom openai import OpenAI client = OpenAI() zero_shot_prompt = '''Classify the sentiment of the following customer review as Positive, Negative, or Neutral. Review: "The product arrived on time but the packaging was damaged." Sentiment:''' response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "You are a sentiment classifier. Respond with only: Positive, Negative, or Neutral."}, {"role": "user", "content": zero_shot_prompt} ], temperature=0 ) print(response.choices[0].message.content) # "Neutral"
When to use: Task is well-understood by the model, output format is simple, tokens are limited.
Few-Shot Prompting
Provides 2–10 input/output examples before the actual query.
pythonfew_shot_prompt = '''Classify the sentiment of customer reviews as Positive, Negative, or Neutral. Review: "Absolutely love this! Works perfectly." Sentiment: Positive Review: "Terrible quality. Broke after one use." Sentiment: Negative Review: "It's okay, nothing special." Sentiment: Neutral Review: "The product arrived on time but the packaging was damaged." Sentiment:''' response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": few_shot_prompt}], temperature=0 )
When to use: Task has a specific format the model might miss, domain-specific outputs, edge cases to cover.
Comparison
| Factor | Zero-Shot | Few-Shot |
|---|---|---|
| Tokens used | Low | Higher (examples add tokens) |
| Cost | Lower | Higher |
| Latency | Lower | Higher |
| Accuracy | Good for general tasks | Better for specific formats/domains |
| Setup effort | None | Needs curated examples |
| Example dependency | None | Poor examples hurt accuracy |
Few-Shot Example Selection Tips
The quality of few-shot examples matters enormously:
python# ❌ Bad: all examples from the same class examples = [ ("Great product!", "Positive"), ("Amazing quality!", "Positive"), ("Loved it!", "Positive"), # Model over-predicts Positive ] # ✅ Good: balanced, covering edge cases examples = [ ("Great product!", "Positive"), # Clear positive ("Terrible quality, broke immediately.", "Negative"), # Clear negative ("It's okay, nothing special.", "Neutral"), # Neutral ("Late delivery but product is good.", "Neutral"), # Ambiguous case ]
Dynamic Few-Shot Selection
In production, retrieve the most relevant examples for each query using embeddings:
pythonfrom sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity import numpy as np model = SentenceTransformer("all-MiniLM-L6-v2") # Pre-embed your example library example_texts = [e[0] for e in example_library] example_embeddings = model.encode(example_texts) def get_relevant_examples(query: str, k: int = 3) -> list: query_embedding = model.encode([query]) similarities = cosine_similarity(query_embedding, example_embeddings)[0] top_k_indices = np.argsort(similarities)[::-1][:k] return [example_library[i] for i in top_k_indices]
Production tip: Start zero-shot. If accuracy is insufficient, add few-shot examples. If still insufficient, consider fine-tuning. Always measure the accuracy delta — sometimes zero-shot with a better system prompt beats few-shot.