Concept #149Mediumextended-ai-concepts

How to prepare dataset for training LLM?

#dataset#fine-tuning#data-preparation#llm#training

Answer

How to Prepare a Dataset for Training an LLM

Dataset quality is the single most important factor in fine-tuning. A small, clean, well-formatted dataset consistently outperforms a large, noisy one.


Dataset Types by Training Goal

Training TypeDataset FormatUse Case
Instruction tuningInstruction + Input + ResponseChat, Q&A, task following
Continued pre-trainingRaw textDomain adaptation
RLHF (reward model)Prompt + chosen + rejectedPreference alignment
DPOPrompt + chosen + rejectedDirect preference optimization

Step 1 — Choose Format (Instruction Tuning)

The most common fine-tuning format is instruction + input + output:

json
{
  "instruction": "Summarize the following customer complaint.",
  "input": "I ordered a laptop 3 weeks ago and it still hasn't arrived...",
  "output": "Customer reports a delayed laptop order (3+ weeks) and is requesting status update."
}

For chat models, use the conversation format:

json
{
  "messages": [
    {"role": "system",    "content": "You are a helpful Gen AI assistant."},
    {"role": "user",      "content": "What is LoRA?"},
    {"role": "assistant", "content": "LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method..."}
  ]
}

Step 2 — Collect and Clean Data

python
import re
import json

def clean_text(text: str) -> str:
    text = text.strip()
    text = re.sub(r'\s+', ' ', text)          # normalize whitespace
    text = re.sub(r'[^�-]+', '', text) # remove non-ASCII if needed
    return text

def is_valid_example(example: dict) -> bool:
    # Filter out low-quality examples
    instruction = example.get("instruction", "")
    output = example.get("output", "")
    if len(instruction) < 10 or len(output) < 20:
        return False
    if len(output) > 4000:   # avoid truncation issues
        return False
    return True

raw_data = [...]  # your raw list of examples
cleaned = [
    {k: clean_text(v) for k, v in ex.items()}
    for ex in raw_data
    if is_valid_example(ex)
]

print(f"Kept {len(cleaned)}/{len(raw_data)} examples after cleaning")

Step 3 — Deduplicate

python
def deduplicate(data: list[dict], key: str = "instruction") -> list[dict]:
    seen = set()
    unique = []
    for ex in data:
        sig = ex[key].lower().strip()
        if sig not in seen:
            seen.add(sig)
            unique.append(ex)
    return unique

cleaned = deduplicate(cleaned, key="instruction")
print(f"After dedup: {len(cleaned)} examples")

Step 4 — Format and Tokenize Check

Make sure examples fit within your model's context window:

python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")
MAX_LENGTH = 2048

def format_prompt(ex: dict) -> str:
    return (
        f"### Instruction:
{ex['instruction']}

"
        f"### Input:
{ex.get('input', '')}

"
        f"### Response:
{ex['output']}"
    )

valid = []
too_long = 0
for ex in cleaned:
    text = format_prompt(ex)
    tokens = tokenizer(text, return_length=True)["length"]
    if tokens <= MAX_LENGTH:
        valid.append({"text": text})
    else:
        too_long += 1

print(f"Valid: {len(valid)}, Too long (skipped): {too_long}")

Step 5 — Train / Validation Split

python
import random

random.seed(42)
random.shuffle(valid)

split = int(len(valid) * 0.9)
train_data = valid[:split]
val_data   = valid[split:]

print(f"Train: {len(train_data)}, Val: {len(val_data)}")

# Save as JSONL (most common format for LLM training)
with open("train.jsonl", "w") as f:
    for ex in train_data:
        f.write(json.dumps(ex) + "
")

with open("val.jsonl", "w") as f:
    for ex in val_data:
        f.write(json.dumps(ex) + "
")

Step 6 — Load with HuggingFace Datasets

python
from datasets import load_dataset

dataset = load_dataset("json", data_files={
    "train": "train.jsonl",
    "validation": "val.jsonl"
})

print(dataset)
# DatasetDict({
#     train: Dataset({features: ['text'], num_rows: 900})
#     validation: Dataset({features: ['text'], num_rows: 100})
# })

Dataset Size Guidelines

Task ComplexityMin ExamplesIdeal Examples
Simple classification5002,000+
Instruction following1,00010,000+
Domain adaptation5,00050,000+
Full pre-training1B tokens1T+ tokens

Common Mistakes to Avoid

  • Duplicate examples — inflate metrics without improving generalization
  • Inconsistent formatting — model learns format variance instead of task
  • Label leakage — validation data appears in training set
  • Too short outputs — model learns to be terse even when detail is needed
  • No system prompt variety — model only works with one exact system prompt

Rule of thumb: 1,000 high-quality, diverse examples beats 100,000 scraped, noisy ones. Invest time in curation, not just collection.