What are billions of parameter trained models (e.g., 7B, 120B)?

Question

Accepted Answer

## Billions of Parameters in AI Models (7B, 70B, 120B, etc.)

**Parameters** are the learned numerical values in a neural network — specifically the weights and biases that determine how the model transforms inputs to outputs. "7B" means 7 billion such numbers.

### What Are Parameters?

In a simple linear layer:
```python
import torch.nn as nn

# A linear layer with 1000 input features → 500 output features
layer = nn.Linear(1000, 500)

# Parameters = weights (1000×500) + biases (500)
total = 1000 * 500 + 500  # = 500,500 parameters
print(f"Parameters: {sum(p.numel() for p in layer.parameters()):,}")
```

A model with 7 billion parameters has 7,000,000,000 such numbers, arranged across thousands of layers.

### Why Parameter Count Matters

| More Parameters → | Effect |
|-------------------|--------|
| More knowledge | Can store more facts and patterns |
| Better reasoning | More capacity for complex computation |
| Higher compute | More GPU memory and processing needed |
| Slower inference | More calculations per token |
| More expensive | Higher hardware and API costs |

### Common Model Sizes

| Size | Approx VRAM needed | Example Models |
|------|-------------------|---------------|
| **1-3B** | 2-6 GB | Phi-3 Mini, Llama 3.2 3B |
| **7-8B** | 8-16 GB | Llama 3.1 8B, Mistral 7B, Gemma 7B |
| **13-14B** | 16-28 GB | Llama 2 13B, Phi-3 Medium |
| **34-35B** | 40-70 GB | CodeLlama 34B |
| **70-72B** | 80-140 GB | Llama 3.1 70B, Qwen 72B |
| **120-180B** | 2-4× A100s | DeepSeek-V2 |
| **400B+** | Multi-node GPU cluster | Llama 3.1 405B, GPT-4 (est.) |

### Parameter Count vs Actual Storage

Each parameter is typically stored as:
- **FP32**: 4 bytes → 7B model = ~28 GB
- **FP16/BF16**: 2 bytes → 7B model = ~14 GB
- **INT8 quantized**: 1 byte → 7B model = ~7 GB
- **INT4 quantized**: 0.5 bytes → 7B model = ~3.5 GB

```python
# Calculate model size
def estimate_model_size_gb(params_billions: float, bytes_per_param: float = 2) -> float:
    return params_billions * 1e9 * bytes_per_param / (1024**3)

print(f"7B at FP16:  {estimate_model_size_gb(7):.1f} GB")    # 13.0 GB
print(f"70B at FP16: {estimate_model_size_gb(70):.1f} GB")   # 130.2 GB
print(f"7B at INT4:  {estimate_model_size_gb(7, 0.5):.1f} GB")  # 3.3 GB
```

### Scaling Laws

Research shows predictable relationships:
- **More parameters** → better performance (up to a point)
- **More training data** → better performance
- **Optimal: balance** parameters and tokens (Chinchilla law: ~20 tokens per parameter)

### Quality vs Size Tradeoff

```
Size         │ Quality  │ Speed   │ Cost
─────────────┼──────────┼─────────┼──────
1-3B         │ Basic    │ Fast    │ Low
7-8B         │ Good     │ Fast    │ Low
13-14B       │ Better   │ Medium  │ Medium
70B          │ Excellent│ Slow    │ High
400B+        │ Best     │ Very slow│ Very high
```

### Running Models Locally

```bash
# Ollama - easily run models by size
ollama pull llama3.1:8b    # 8B — runs on 8GB RAM
ollama pull llama3.1:70b   # 70B — needs 64GB+ RAM
ollama pull phi3:mini       # 3.8B — great for laptops
```

The 7B-8B size sweet spot is popular for local inference — good quality with practical hardware requirements.

What are billions of parameter trained models (e.g., 7B, 120B)?

Answer

Billions of Parameters in AI Models (7B, 70B, 120B, etc.)

What Are Parameters?

Why Parameter Count Matters

Common Model Sizes

Parameter Count vs Actual Storage

Scaling Laws

Quality vs Size Tradeoff

Running Models Locally

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

More Parameters →	Effect
More knowledge	Can store more facts and patterns
Better reasoning	More capacity for complex computation
Higher compute	More GPU memory and processing needed
Slower inference	More calculations per token
More expensive	Higher hardware and API costs

Size	Approx VRAM needed	Example Models
1-3B	2-6 GB	Phi-3 Mini, Llama 3.2 3B
7-8B	8-16 GB	Llama 3.1 8B, Mistral 7B, Gemma 7B
13-14B	16-28 GB	Llama 2 13B, Phi-3 Medium
34-35B	40-70 GB	CodeLlama 34B
70-72B	80-140 GB	Llama 3.1 70B, Qwen 72B
120-180B	2-4× A100s	DeepSeek-V2
400B+	Multi-node GPU cluster	Llama 3.1 405B, GPT-4 (est.)