How to prepare dataset for training LLM?
#dataset#fine-tuning#data-preparation#llm#training
Answer
How to Prepare a Dataset for Training an LLM
Dataset quality is the single most important factor in fine-tuning. A small, clean, well-formatted dataset consistently outperforms a large, noisy one.
Dataset Types by Training Goal
| Training Type | Dataset Format | Use Case |
|---|---|---|
| Instruction tuning | Instruction + Input + Response | Chat, Q&A, task following |
| Continued pre-training | Raw text | Domain adaptation |
| RLHF (reward model) | Prompt + chosen + rejected | Preference alignment |
| DPO | Prompt + chosen + rejected | Direct preference optimization |
Step 1 — Choose Format (Instruction Tuning)
The most common fine-tuning format is instruction + input + output:
json{ "instruction": "Summarize the following customer complaint.", "input": "I ordered a laptop 3 weeks ago and it still hasn't arrived...", "output": "Customer reports a delayed laptop order (3+ weeks) and is requesting status update." }
For chat models, use the conversation format:
json{ "messages": [ {"role": "system", "content": "You are a helpful Gen AI assistant."}, {"role": "user", "content": "What is LoRA?"}, {"role": "assistant", "content": "LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method..."} ] }
Step 2 — Collect and Clean Data
pythonimport re import json def clean_text(text: str) -> str: text = text.strip() text = re.sub(r'\s+', ' ', text) # normalize whitespace text = re.sub(r'[^�-]+', '', text) # remove non-ASCII if needed return text def is_valid_example(example: dict) -> bool: # Filter out low-quality examples instruction = example.get("instruction", "") output = example.get("output", "") if len(instruction) < 10 or len(output) < 20: return False if len(output) > 4000: # avoid truncation issues return False return True raw_data = [...] # your raw list of examples cleaned = [ {k: clean_text(v) for k, v in ex.items()} for ex in raw_data if is_valid_example(ex) ] print(f"Kept {len(cleaned)}/{len(raw_data)} examples after cleaning")
Step 3 — Deduplicate
pythondef deduplicate(data: list[dict], key: str = "instruction") -> list[dict]: seen = set() unique = [] for ex in data: sig = ex[key].lower().strip() if sig not in seen: seen.add(sig) unique.append(ex) return unique cleaned = deduplicate(cleaned, key="instruction") print(f"After dedup: {len(cleaned)} examples")
Step 4 — Format and Tokenize Check
Make sure examples fit within your model's context window:
pythonfrom transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B") MAX_LENGTH = 2048 def format_prompt(ex: dict) -> str: return ( f"### Instruction: {ex['instruction']} " f"### Input: {ex.get('input', '')} " f"### Response: {ex['output']}" ) valid = [] too_long = 0 for ex in cleaned: text = format_prompt(ex) tokens = tokenizer(text, return_length=True)["length"] if tokens <= MAX_LENGTH: valid.append({"text": text}) else: too_long += 1 print(f"Valid: {len(valid)}, Too long (skipped): {too_long}")
Step 5 — Train / Validation Split
pythonimport random random.seed(42) random.shuffle(valid) split = int(len(valid) * 0.9) train_data = valid[:split] val_data = valid[split:] print(f"Train: {len(train_data)}, Val: {len(val_data)}") # Save as JSONL (most common format for LLM training) with open("train.jsonl", "w") as f: for ex in train_data: f.write(json.dumps(ex) + " ") with open("val.jsonl", "w") as f: for ex in val_data: f.write(json.dumps(ex) + " ")
Step 6 — Load with HuggingFace Datasets
pythonfrom datasets import load_dataset dataset = load_dataset("json", data_files={ "train": "train.jsonl", "validation": "val.jsonl" }) print(dataset) # DatasetDict({ # train: Dataset({features: ['text'], num_rows: 900}) # validation: Dataset({features: ['text'], num_rows: 100}) # })
Dataset Size Guidelines
| Task Complexity | Min Examples | Ideal Examples |
|---|---|---|
| Simple classification | 500 | 2,000+ |
| Instruction following | 1,000 | 10,000+ |
| Domain adaptation | 5,000 | 50,000+ |
| Full pre-training | 1B tokens | 1T+ tokens |
Common Mistakes to Avoid
- Duplicate examples — inflate metrics without improving generalization
- Inconsistent formatting — model learns format variance instead of task
- Label leakage — validation data appears in training set
- Too short outputs — model learns to be terse even when detail is needed
- No system prompt variety — model only works with one exact system prompt
Rule of thumb: 1,000 high-quality, diverse examples beats 100,000 scraped, noisy ones. Invest time in curation, not just collection.