How to prepare dataset for training LLM?

Question

Accepted Answer

## How to Prepare a Dataset for Training an LLM Dataset quality is the single most important factor in fine-tuning. A small, clean, well-formatted dataset consistently outperforms a large, noisy one. --- ## Dataset Types by Training Goal | Training Type | Dataset Format | Use Case | |--------------|----------------|----------| | **Instruction tuning** | Instruction + Input + Response | Chat, Q&A, task following | | **Continued pre-training** | Raw text | Domain adaptation | | **RLHF (reward model)** | Prompt + chosen + rejected | Preference alignment | | **DPO** | Prompt + chosen + rejected | Direct preference optimization | --- ## Step 1 — Choose Format (Instruction Tuning) The most common fine-tuning format is **instruction + input + output**: ```json { "instruction": "Summarize the following customer complaint.", "input": "I ordered a laptop 3 weeks ago and it still hasn't arrived...", "output": "Customer reports a delayed laptop order (3+ weeks) and is requesting status update." } ``` For chat models, use the **conversation format**: ```json { "messages": [ {"role": "system", "content": "You are a helpful Gen AI assistant."}, {"role": "user", "content": "What is LoRA?"}, {"role": "assistant", "content": "LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method..."} ] } ``` --- ## Step 2 — Collect and Clean Data ```python import re import json def clean_text(text: str) -> str: text = text.strip() text = re.sub(r'\s+', ' ', text) # normalize whitespace text = re.sub(r'[^-]+', '', text) # remove non-ASCII if needed return text def is_valid_example(example: dict) -> bool: # Filter out low-quality examples instruction = example.get("instruction", "") output = example.get("output", "") if len(instruction) < 10 or len(output) < 20: return False if len(output) > 4000: # avoid truncation issues return False return True raw_data = [...] # your raw list of examples cleaned = [ {k: clean_text(v) for k, v in ex.items()} for ex in raw_data if is_valid_example(ex) ] print(f"Kept {len(cleaned)}/{len(raw_data)} examples after cleaning") ``` --- ## Step 3 — Deduplicate ```python def deduplicate(data: list[dict], key: str = "instruction") -> list[dict]: seen = set() unique = [] for ex in data: sig = ex[key].lower().strip() if sig not in seen: seen.add(sig) unique.append(ex) return unique cleaned = deduplicate(cleaned, key="instruction") print(f"After dedup: {len(cleaned)} examples") ``` --- ## Step 4 — Format and Tokenize Check Make sure examples fit within your model's context window: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B") MAX_LENGTH = 2048 def format_prompt(ex: dict) -> str: return ( f"### Instruction: {ex['instruction']} " f"### Input: {ex.get('input', '')} " f"### Response: {ex['output']}" ) valid = [] too_long = 0 for ex in cleaned: text = format_prompt(ex) tokens = tokenizer(text, return_length=True)["length"] if tokens <= MAX_LENGTH: valid.append({"text": text}) else: too_long += 1 print(f"Valid: {len(valid)}, Too long (skipped): {too_long}") ``` --- ## Step 5 — Train / Validation Split ```python import random random.seed(42) random.shuffle(valid) split = int(len(valid) * 0.9) train_data = valid[:split] val_data = valid[split:] print(f"Train: {len(train_data)}, Val: {len(val_data)}") # Save as JSONL (most common format for LLM training) with open("train.jsonl", "w") as f: for ex in train_data: f.write(json.dumps(ex) + " ") with open("val.jsonl", "w") as f: for ex in val_data: f.write(json.dumps(ex) + " ") ``` --- ## Step 6 — Load with HuggingFace Datasets ```python from datasets import load_dataset dataset = load_dataset("json", data_files={ "train": "train.jsonl", "validation": "val.jsonl" }) print(dataset) # DatasetDict({ # train: Dataset({features: ['text'], num_rows: 900}) # validation: Dataset({features: ['text'], num_rows: 100}) # }) ``` --- ## Dataset Size Guidelines | Task Complexity | Min Examples | Ideal Examples | |----------------|-------------|----------------| | Simple classification | 500 | 2,000+ | | Instruction following | 1,000 | 10,000+ | | Domain adaptation | 5,000 | 50,000+ | | Full pre-training | 1B tokens | 1T+ tokens | --- ## Common Mistakes to Avoid * **Duplicate examples** — inflate metrics without improving generalization * **Inconsistent formatting** — model learns format variance instead of task * **Label leakage** — validation data appears in training set * **Too short outputs** — model learns to be terse even when detail is needed * **No system prompt variety** — model only works with one exact system prompt > **Rule of thumb:** 1,000 high-quality, diverse examples beats 100,000 scraped, noisy ones. Invest time in curation, not just collection.

How to prepare dataset for training LLM?

Answer

How to Prepare a Dataset for Training an LLM

Dataset Types by Training Goal

Step 1 — Choose Format (Instruction Tuning)

Step 2 — Collect and Clean Data

Step 3 — Deduplicate

Step 4 — Format and Tokenize Check

Step 5 — Train / Validation Split

Step 6 — Load with HuggingFace Datasets

Dataset Size Guidelines

Common Mistakes to Avoid

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Training Type	Dataset Format	Use Case
Instruction tuning	Instruction + Input + Response	Chat, Q&A, task following
Continued pre-training	Raw text	Domain adaptation
RLHF (reward model)	Prompt + chosen + rejected	Preference alignment
DPO	Prompt + chosen + rejected	Direct preference optimization

Task Complexity	Min Examples	Ideal Examples
Simple classification	500	2,000+
Instruction following	1,000	10,000+
Domain adaptation	5,000	50,000+
Full pre-training	1B tokens	1T+ tokens