How to increase the performance of a Python application? Explain key optimization techniques with examples.

Question

Accepted Answer

## How to Increase Python Application Performance Python is powerful but inherently slower than compiled languages. Optimizing a Python app requires a **systematic approach** — profile first, then apply the right technique at the right layer. --- ### Performance Optimization Hierarchy ```mermaid graph TD A["1. Algorithm & Data Structure Biggest impact"] --> B["2. Built-in Functions & Libraries Use C-optimized internals"] B --> C["3. Concurrency & Parallelism threading / multiprocessing / asyncio"] C --> D["4. Caching & Memoization Avoid redundant computation"] D --> E["5. Memory Optimization Reduce allocations & GC pressure"] E --> F["6. C Extensions & JIT Cython / Numba / PyPy"] F --> G["7. Infrastructure Scaling, CDN, load balancing"] style A fill:#fee2e2,stroke:#dc2626 style B fill:#fef3c7,stroke:#d97706 style C fill:#dbeafe,stroke:#2563eb style D fill:#d1fae5,stroke:#059669 style E fill:#f3e8ff,stroke:#9333ea style F fill:#fce7f3,stroke:#db2777 style G fill:#f0fdf4,stroke:#16a34a ``` --- ### 1. Profile Before Optimizing **Never guess** — always measure first to find the actual bottleneck. ```python import cProfile import time from functools import wraps # ─── Quick timer decorator ───────────────────────────── def timer(func): @wraps(func) def wrapper(*args, **kwargs): start = time.perf_counter() result = func(*args, **kwargs) elapsed = time.perf_counter() - start print(f"{func.__name__}: {elapsed:.4f}s") return result return wrapper # ─── cProfile for detailed breakdown ─────────────────── def slow_function(): total = 0 for i in range(1_000_000): total += i ** 2 return total cProfile.run('slow_function()', sort='cumulative') # ─── line_profiler for line-by-line analysis ─────────── # pip install line_profiler # @profile <-- add this decorator # kernprof -l -v script.py ``` | Profiling Tool | What It Measures | Best For | |---------------|-----------------|----------| | `time.perf_counter()` | Wall clock time | Quick checks | | `cProfile` | Function-level CPU time | Finding slow functions | | `line_profiler` | Line-by-line execution time | Pinpointing exact slow lines | | `memory_profiler` | Memory usage per line | Finding memory leaks | | `py-spy` | Sampling profiler (no code change) | Production profiling | | `tracemalloc` | Memory allocation tracking | Tracking allocation sources | --- ### 2. Algorithm & Data Structure Optimization The **single biggest impact** — choosing the right data structure can be 100x–1000x faster. ```python import time # ─── BAD: O(n) lookup in a list ──────────────────────── large_list = list(range(1_000_000)) start = time.perf_counter() result = 999_999 in large_list # Scans entire list print(f"List lookup: {time.perf_counter() - start:.6f}s") # ~0.015s # ─── GOOD: O(1) lookup in a set ──────────────────────── large_set = set(range(1_000_000)) start = time.perf_counter() result = 999_999 in large_set # Hash-based instant lookup print(f"Set lookup: {time.perf_counter() - start:.6f}s") # ~0.000001s (10,000x faster!) ``` | Operation | `list` | `set` | `dict` | `deque` | |-----------|--------|-------|--------|--------| | **Lookup** (`in`) | O(n) | O(1) | O(1) | O(n) | | **Append** | O(1) | O(1) | O(1) | O(1) | | **Insert at front** | O(n) | N/A | N/A | O(1) | | **Delete by value** | O(n) | O(1) | O(1) | O(n) | | **Sort** | O(n log n) | N/A | N/A | N/A | --- ### 3. Use Built-in Functions & C-Optimized Libraries Python's built-ins are implemented in C — always prefer them over manual loops. ```python import time numbers = list(range(1_000_000)) # ─── SLOW: Python loop ──────────────────────────────── start = time.perf_counter() total = 0 for n in numbers: total += n print(f"Python loop: {time.perf_counter() - start:.4f}s") # ─── FAST: Built-in sum() (C implementation) ────────── start = time.perf_counter() total = sum(numbers) print(f"Built-in sum: {time.perf_counter() - start:.4f}s") # 5-10x faster # ─── FASTEST: NumPy (vectorized C/Fortran) ───────────── import numpy as np arr = np.arange(1_000_000) start = time.perf_counter() total = np.sum(arr) print(f"NumPy sum: {time.perf_counter() - start:.4f}s") # 50-100x faster than Python loop ``` | Instead of | Use | Speedup | |------------|-----|--------| | `for` loop to sum | `sum()` | 5–10x | | `for` loop to filter | `filter()` / list comprehension | 2–5x | | `for` loop to map | `map()` / list comprehension | 2–5x | | Manual string concat (`+=`) | `''.join(list)` | 10–100x | | Manual math on lists | `numpy` vectorized ops | 50–100x | | `json.loads/dumps` | `orjson` or `ujson` | 3–10x | | `re.match` in a loop | Pre-compile with `re.compile()` | 2–5x | --- ### 4. List Comprehensions & Generator Expressions ```python import sys # ─── List comprehension (faster than loop) ───────────── squares = [x ** 2 for x in range(1_000_000)] print(f"List size: {sys.getsizeof(squares) / 1024 / 1024:.1f} MB") # ~8 MB — all values in memory # ─── Generator expression (memory efficient) ────────── squares_gen = (x ** 2 for x in range(1_000_000)) print(f"Generator size: {sys.getsizeof(squares_gen)} bytes") # ~200 bytes — values computed lazily on demand # Use generators for large datasets you iterate once total = sum(x ** 2 for x in range(1_000_000)) # No list created ``` --- ### 5. Caching & Memoization Avoid recomputing the same result. ```python from functools import lru_cache, cache import time # ─── Without caching ─────────────────────────────────── def fibonacci_slow(n: int) -> int: if n < 2: return n return fibonacci_slow(n - 1) + fibonacci_slow(n - 2) # fibonacci_slow(35) takes ~3 seconds (exponential calls) # ─── With lru_cache (memoization) ────────────────────── @lru_cache(maxsize=256) def fibonacci_fast(n: int) -> int: if n < 2: return n return fibonacci_fast(n - 1) + fibonacci_fast(n - 2) start = time.perf_counter() result = fibonacci_fast(100) # Instant! Cached results reused print(f"fib(100): {time.perf_counter() - start:.6f}s") # ─── Cache for API responses ────────────────────────── from functools import lru_cache import hashlib @lru_cache(maxsize=1000) def get_embedding_cached(text: str) -> tuple: """Cache embedding API calls to avoid re-computation.""" # Expensive API call happens only once per unique text import openai client = openai.OpenAI() response = client.embeddings.create( model="text-embedding-3-small", input=text ) return tuple(response.data[0].embedding) ``` --- ### 6. Concurrency & Parallelism Choose the right model based on your bottleneck. ```python import asyncio import aiohttp from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor import time urls = [f"https://httpbin.org/delay/1" for _ in range(10)] # ─── I/O-bound: asyncio (best for API calls) ────────── async def fetch_all_async(urls: list[str]) -> list[str]: async with aiohttp.ClientSession() as session: tasks = [session.get(url) for url in urls] responses = await asyncio.gather(*tasks) return [await r.text() for r in responses] # 10 requests in ~1s instead of ~10s # ─── CPU-bound: ProcessPoolExecutor ──────────────────── def cpu_task(n: int) -> int: return sum(i * i for i in range(n)) with ProcessPoolExecutor(max_workers=4) as pool: results = list(pool.map(cpu_task, [5_000_000] * 4)) # 4x speedup on 4 cores # ─── I/O-bound: ThreadPoolExecutor ───────────────────── with ThreadPoolExecutor(max_workers=10) as pool: # Good for blocking I/O libraries that don't support async results = list(pool.map(requests_get_wrapper, urls)) ``` | Bottleneck | Solution | Example Use Case | |-----------|----------|------------------| | **I/O-bound (async lib)** | `asyncio` | LLM API calls, web scraping | | **I/O-bound (sync lib)** | `ThreadPoolExecutor` | Database queries, file reads | | **CPU-bound** | `ProcessPoolExecutor` | Embedding generation, data processing | | **Mixed** | `asyncio` + `ProcessPoolExecutor` | RAG pipeline (API + preprocessing) | --- ### 7. Memory Optimization ```python import sys from dataclasses import dataclass # ─── HEAVY: Regular class ────────────────────────────── class PointHeavy: def __init__(self, x, y, z): self.x = x self.y = y self.z = z # ─── LIGHT: __slots__ (no __dict__, fixed attributes) ── class PointLight: __slots__ = ['x', 'y', 'z'] def __init__(self, x, y, z): self.x = x self.y = y self.z = z heavy = PointHeavy(1, 2, 3) light = PointLight(1, 2, 3) print(f"Regular: {sys.getsizeof(heavy) + sys.getsizeof(heavy.__dict__)} bytes") print(f"Slots: {sys.getsizeof(light)} bytes") # Slots uses ~40% less memory per instance # ─── Use generators for large data pipelines ────────── def process_large_file(filepath: str): """Process line by line — never loads entire file.""" with open(filepath) as f: for line in f: # Generator — one line in memory yield line.strip() # Lazy evaluation # ─── Use array instead of list for numeric data ──────── from array import array nums_list = list(range(1_000_000)) # ~8 MB nums_array = array('i', range(1_000_000)) # ~4 MB (typed, compact) ``` --- ### 8. String Optimization ```python import time names = ["Alice", "Bob", "Charlie"] * 100_000 # ─── SLOW: String concatenation in loop ──────────────── start = time.perf_counter() result = "" for name in names: result += name + ", " # Creates new string each time! print(f"Concat: {time.perf_counter() - start:.4f}s") # ─── FAST: join() (single allocation) ───────────────── start = time.perf_counter() result = ", ".join(names) # One allocation, C-optimized print(f"Join: {time.perf_counter() - start:.4f}s") # 10-100x faster for large strings # ─── FAST: f-strings (fastest formatting) ────────────── name, age = "Alice", 30 # Slowest: "Name: " + name + " Age: " + str(age) # Slow: "Name: %s Age: %d" % (name, age) # Fast: "Name: {} Age: {}".format(name, age) # Fastest: f"Name: {name} Age: {age}" ``` --- ### 9. Faster Libraries (Drop-in Replacements) | Standard Library | Faster Alternative | Speedup | Install | |-----------------|-------------------|---------|--------| | `json` | `orjson` | 3–10x | `pip install orjson` | | `json` | `ujson` | 2–5x | `pip install ujson` | | `requests` | `httpx` (async) | 5–20x (concurrent) | `pip install httpx` | | `csv` | `polars` | 10–50x | `pip install polars` | | `pandas` | `polars` | 2–10x | `pip install polars` | | `re` | `regex` | 1.5–3x | `pip install regex` | | `pickle` | `msgpack` | 2–5x | `pip install msgpack` | | `datetime` | `pendulum` | Faster parsing | `pip install pendulum` | ```python # ─── orjson: 10x faster JSON ────────────────────────── import orjson import json data = {"embeddings": [0.1] * 1536, "model": "text-embedding-3-small"} # Standard json json_bytes = json.dumps(data).encode() # orjson (returns bytes directly, much faster) orjson_bytes = orjson.dumps(data) parsed = orjson.loads(orjson_bytes) ``` --- ### 10. JIT Compilation with Numba For numerical code, Numba compiles Python to machine code at runtime. ```python from numba import njit import numpy as np import time # ─── Pure Python (slow) ─────────────────────────────── def cosine_similarity_python(a, b): dot = sum(x * y for x, y in zip(a, b)) norm_a = sum(x ** 2 for x in a) ** 0.5 norm_b = sum(x ** 2 for x in b) ** 0.5 return dot / (norm_a * norm_b) # ─── Numba JIT (near C speed) ───────────────────────── @njit def cosine_similarity_numba(a, b): dot = 0.0 norm_a = 0.0 norm_b = 0.0 for i in range(len(a)): dot += a[i] * b[i] norm_a += a[i] ** 2 norm_b += b[i] ** 2 return dot / (norm_a ** 0.5 * norm_b ** 0.5) a = np.random.rand(1536) # Typical embedding dimension b = np.random.rand(1536) # Warm up JIT cosine_similarity_numba(a, b) start = time.perf_counter() for _ in range(10_000): cosine_similarity_python(a.tolist(), b.tolist()) print(f"Python: {time.perf_counter() - start:.4f}s") start = time.perf_counter() for _ in range(10_000): cosine_similarity_numba(a, b) print(f"Numba: {time.perf_counter() - start:.4f}s") # Numba is 50-100x faster ``` --- ### Performance Optimization Cheat Sheet | Technique | Impact | Effort | When to Use | |-----------|--------|--------|-------------| | **Profile first** | Critical | Low | Always — before any optimization | | **Better algorithm/data structure** | 10–1000x | Medium | When complexity is wrong (O(n²) → O(n)) | | **Built-in functions** | 5–10x | Low | Replace manual loops with `sum`, `map`, `filter` | | **List comprehensions** | 2–5x | Low | Replace `for` loop + `append` | | **Generators** | Memory savings | Low | Large datasets iterated once | | **`lru_cache`** | Huge (avoids recompute) | Low | Pure functions called with same args | | **`asyncio`** | 5–50x for I/O | Medium | Multiple API calls, network requests | | **`multiprocessing`** | Linear with cores | Medium | CPU-heavy tasks (embeddings, parsing) | | **`__slots__`** | 30–40% memory | Low | Many instances of same class | | **`orjson`** | 3–10x JSON speed | Low | JSON-heavy applications | | **NumPy vectorization** | 50–100x | Medium | Numerical computations | | **Numba JIT** | 50–100x | Medium | Tight numerical loops | | **Cython** | 10–100x | High | Performance-critical modules | > **Golden Rule:** Profile → fix the bottleneck → measure again. Don't optimize code that isn't slow. In Gen AI applications, the biggest wins usually come from **async I/O for API calls**, **caching embeddings**, and **batch processing** rather than micro-optimizing Python code. **Resources:** - [Python Performance Tips](https://docs.python.org/3/howto/performance.html) - [cProfile documentation](https://docs.python.org/3/library/profile.html) - [Numba documentation](https://numba.readthedocs.io/) - [High Performance Python (O'Reilly)](https://www.oreilly.com/library/view/high-performance-python/9781492055013/)

Instead of	Use	Speedup
text `for` loop to sum	text `sum()`	5–10x
text `for` loop to filter	text `filter()` / list comprehension	2–5x
text `for` loop to map	text `map()` / list comprehension	2–5x
Manual string concat ( text `+=` )	text `''.join(list)`	10–100x
Manual math on lists	text `numpy` vectorized ops	50–100x
text `json.loads/dumps`	text `orjson` or text `ujson`	3–10x
text `re.match` in a loop	Pre-compile with text `re.compile()`	2–5x

Bottleneck	Solution	Example Use Case
I/O-bound (async lib)	text `asyncio`	LLM API calls, web scraping
I/O-bound (sync lib)	text `ThreadPoolExecutor`	Database queries, file reads
CPU-bound	text `ProcessPoolExecutor`	Embedding generation, data processing
Mixed	text `asyncio` + text `ProcessPoolExecutor`	RAG pipeline (API + preprocessing)

Standard Library	Faster Alternative	Speedup	Install
text `json`	text `orjson`	3–10x	text `pip install orjson`
text `json`	text `ujson`	2–5x	text `pip install ujson`
text `requests`	text `httpx` (async)	5–20x (concurrent)	text `pip install httpx`
text `csv`	text `polars`	10–50x	text `pip install polars`
text `pandas`	text `polars`	2–10x	text `pip install polars`
text `re`	text `regex`	1.5–3x	text `pip install regex`
text `pickle`	text `msgpack`	2–5x	text `pip install msgpack`
text `datetime`	text `pendulum`	Faster parsing	text `pip install pendulum`

How to increase the performance of a Python application? Explain key optimization techniques with examples.

Answer

How to Increase Python Application Performance

Performance Optimization Hierarchy

1. Profile Before Optimizing

2. Algorithm & Data Structure Optimization

3. Use Built-in Functions & C-Optimized Libraries

4. List Comprehensions & Generator Expressions

5. Caching & Memoization

6. Concurrency & Parallelism

7. Memory Optimization

8. String Optimization

9. Faster Libraries (Drop-in Replacements)

10. JIT Compilation with Numba

Performance Optimization Cheat Sheet

Additional Resources

Related Concepts

Explain decorators in Python. How would you use them in an LLM application?

What are context managers? How would you use them for LLM resource management?

Explain async/await in Python. Why is it important for API-heavy applications?

What are generators in Python? How are they used in streaming LLM responses?

Explain list comprehensions vs. loops in Python. When is each appropriate?

Profiling Tool	What It Measures	Best For
text `time.perf_counter()`	Wall clock time	Quick checks
text `cProfile`	Function-level CPU time	Finding slow functions
text `line_profiler`	Line-by-line execution time	Pinpointing exact slow lines
text `memory_profiler`	Memory usage per line	Finding memory leaks
text `py-spy`	Sampling profiler (no code change)	Production profiling
text `tracemalloc`	Memory allocation tracking	Tracking allocation sources

Operation	text `list`	text `set`	text `dict`	text `deque`
Lookup ( text `in` )	O(n)	O(1)	O(1)	O(n)
Append	O(1)	O(1)	O(1)	O(1)
Insert at front	O(n)	N/A	N/A	O(1)
Delete by value	O(n)	O(1)	O(1)	O(n)
Sort	O(n log n)	N/A	N/A	N/A

Technique	Impact	Effort	When to Use
Profile first	Critical	Low	Always — before any optimization
Better algorithm/data structure	10–1000x	Medium	When complexity is wrong (O(n²) → O(n))
Built-in functions	5–10x	Low	Replace manual loops with text `sum` , text `map` , text `filter`
List comprehensions	2–5x	Low	Replace text `for` loop + text `append`
Generators	Memory savings	Low	Large datasets iterated once
text `lru_cache`	Huge (avoids recompute)	Low	Pure functions called with same args
text `asyncio`	5–50x for I/O	Medium	Multiple API calls, network requests
text `multiprocessing`	Linear with cores	Medium	CPU-heavy tasks (embeddings, parsing)
text `__slots__`	30–40% memory	Low	Many instances of same class
text `orjson`	3–10x JSON speed	Low	JSON-heavy applications
NumPy vectorization	50–100x	Medium	Numerical computations
Numba JIT	50–100x	Medium	Tight numerical loops
Cython	10–100x	High	Performance-critical modules