Design a prompt for sentiment analysis. What could go wrong?

Question

Accepted Answer

## Designing Effective Prompts for Sentiment Analysis Sentiment analysis is a common first task when deploying LLMs. Here's how to design robust prompts and handle the failure modes. ### Basic Prompt Design ```python from openai import OpenAI from enum import Enum import json client = OpenAI() class Sentiment(str, Enum): POSITIVE = "positive" NEGATIVE = "negative" NEUTRAL = "neutral" MIXED = "mixed" SYSTEM_PROMPT = '''You are a sentiment analysis expert for an e-commerce platform. Classify the sentiment of customer reviews. Consider: - Overall tone, not just individual words - Sarcasm and irony (e.g. "Oh great, another broken product" = Negative) - Mixed sentiments (praise one aspect, criticise another = Mixed) Respond ONLY with valid JSON matching this schema: {"sentiment": "positive|negative|neutral|mixed", "confidence": 0.0-1.0, "reasoning": "brief explanation"}''' def analyze_sentiment(review: str) -> dict: response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": f"Review: {review}"} ], temperature=0, response_format={"type": "json_object"} ) return json.loads(response.choices[0].message.content) # Test result = analyze_sentiment("Product is amazing but delivery took 3 weeks — unacceptable!") print(result) # {"sentiment": "mixed", "confidence": 0.95, "reasoning": "Positive product quality, negative delivery experience"} ``` ### Handling Failure Modes | Failure Mode | Example | Fix | |-------------|---------|-----| | **Sarcasm misclassified** | "Oh great, another defect 🙄" → Positive | Add sarcasm instruction + examples | | **Domain-specific terms** | "This knife has terrible flex" (flex = good for bakers) | Add domain context to system prompt | | **Mixed sentiment collapsed** | "Love the product, hate the price" → Positive | Explicitly define Mixed class | | **JSON parsing failure** | LLM outputs extra text | Use `response_format=json_object` + try/except | | **Multilingual input** | French review misclassified | Add: "Reviews may be in any language" | | **Emoji-heavy reviews** | "😍😍😍" | Include emoji examples in few-shot | ### Production-Ready Implementation ```python import json from typing import Optional def safe_analyze_sentiment(review: str, fallback: Optional[str] = None) -> dict: try: result = analyze_sentiment(review) # Validate response schema assert result["sentiment"] in ["positive", "negative", "neutral", "mixed"] assert 0.0 <= result["confidence"] <= 1.0 return result except (json.JSONDecodeError, KeyError, AssertionError) as e: # Retry with simplified prompt simplified = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "user", "content": f"Rate this review as positive, negative, or neutral ONLY: {review}" }], temperature=0 ) sentiment = simplified.choices[0].message.content.strip().lower() return {"sentiment": sentiment, "confidence": 0.7, "reasoning": "Simplified fallback"} ``` ### Prompt Engineering Best Practices for Classification * **Be explicit about edge cases** (sarcasm, mixed sentiment, emojis) * **Define your classes precisely** — what separates Neutral from Mixed? * **Use structured output** (JSON) to prevent parsing errors * **Set `temperature=0`** for deterministic classification * **Include few-shot examples** for ambiguous cases * **Version your prompts** — small changes can significantly affect accuracy > **Key lesson:** Always validate and sanitise user input before embedding it in a prompt. A user can inject "Ignore all previous instructions" — treat user content as untrusted data, not trusted instructions.

Design a prompt for sentiment analysis. What could go wrong?

Answer

Designing Effective Prompts for Sentiment Analysis

Basic Prompt Design

Handling Failure Modes

Production-Ready Implementation

Prompt Engineering Best Practices for Classification

Related Concepts

Explain the Transformer architecture. What are attention mechanisms and why are they important?

What's the difference between a Large Language Model (LLM) and other ML models?

Explain these LLM concepts: Tokens, Context window, Temperature & Top-p sampling, Beam search.

What's the difference between encoder-only, decoder-only, and encoder-decoder models?

Explain quantization in LLMs. Why is it important?

Failure Mode	Example	Fix
Sarcasm misclassified	"Oh great, another defect 🙄" → Positive	Add sarcasm instruction + examples
Domain-specific terms	"This knife has terrible flex" (flex = good for bakers)	Add domain context to system prompt
Mixed sentiment collapsed	"Love the product, hate the price" → Positive	Explicitly define Mixed class
JSON parsing failure	LLM outputs extra text	Use text `response_format=json_object` + try/except
Multilingual input	French review misclassified	Add: "Reviews may be in any language"
Emoji-heavy reviews	"😍😍😍"	Include emoji examples in few-shot