How to increase the performance of a Python application? Explain key optimization techniques with examples.
Answer
How to Increase Python Application Performance
Python is powerful but inherently slower than compiled languages. Optimizing a Python app requires a systematic approach ā profile first, then apply the right technique at the right layer.
Performance Optimization Hierarchy
1. Profile Before Optimizing
Never guess ā always measure first to find the actual bottleneck.
pythonimport cProfile import time from functools import wraps # āāā Quick timer decorator āāāāāāāāāāāāāāāāāāāāāāāāāāāāā def timer(func): @wraps(func) def wrapper(*args, **kwargs): start = time.perf_counter() result = func(*args, **kwargs) elapsed = time.perf_counter() - start print(f"{func.__name__}: {elapsed:.4f}s") return result return wrapper # āāā cProfile for detailed breakdown āāāāāāāāāāāāāāāāāāā def slow_function(): total = 0 for i in range(1_000_000): total += i ** 2 return total cProfile.run('slow_function()', sort='cumulative') # āāā line_profiler for line-by-line analysis āāāāāāāāāāā # pip install line_profiler # @profile <-- add this decorator # kernprof -l -v script.py
| Profiling Tool | What It Measures | Best For |
|---|---|---|
text | Wall clock time | Quick checks |
text | Function-level CPU time | Finding slow functions |
text | Line-by-line execution time | Pinpointing exact slow lines |
text | Memory usage per line | Finding memory leaks |
text | Sampling profiler (no code change) | Production profiling |
text | Memory allocation tracking | Tracking allocation sources |
2. Algorithm & Data Structure Optimization
The single biggest impact ā choosing the right data structure can be 100xā1000x faster.
pythonimport time # āāā BAD: O(n) lookup in a list āāāāāāāāāāāāāāāāāāāāāāāā large_list = list(range(1_000_000)) start = time.perf_counter() result = 999_999 in large_list # Scans entire list print(f"List lookup: {time.perf_counter() - start:.6f}s") # ~0.015s # āāā GOOD: O(1) lookup in a set āāāāāāāāāāāāāāāāāāāāāāāā large_set = set(range(1_000_000)) start = time.perf_counter() result = 999_999 in large_set # Hash-based instant lookup print(f"Set lookup: {time.perf_counter() - start:.6f}s") # ~0.000001s (10,000x faster!)
| Operation | text | text | text | text |
|---|---|---|---|---|
| Lookup ( text | O(n) | O(1) | O(1) | O(n) |
| Append | O(1) | O(1) | O(1) | O(1) |
| Insert at front | O(n) | N/A | N/A | O(1) |
| Delete by value | O(n) | O(1) | O(1) | O(n) |
| Sort | O(n log n) | N/A | N/A | N/A |
3. Use Built-in Functions & C-Optimized Libraries
Python's built-ins are implemented in C ā always prefer them over manual loops.
pythonimport time numbers = list(range(1_000_000)) # āāā SLOW: Python loop āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā start = time.perf_counter() total = 0 for n in numbers: total += n print(f"Python loop: {time.perf_counter() - start:.4f}s") # āāā FAST: Built-in sum() (C implementation) āāāāāāāāāā start = time.perf_counter() total = sum(numbers) print(f"Built-in sum: {time.perf_counter() - start:.4f}s") # 5-10x faster # āāā FASTEST: NumPy (vectorized C/Fortran) āāāāāāāāāāāāā import numpy as np arr = np.arange(1_000_000) start = time.perf_counter() total = np.sum(arr) print(f"NumPy sum: {time.perf_counter() - start:.4f}s") # 50-100x faster than Python loop
| Instead of | Use | Speedup |
|---|---|---|
text | text | 5ā10x |
text | text | 2ā5x |
text | text | 2ā5x |
| Manual string concat ( text | text | 10ā100x |
| Manual math on lists | text | 50ā100x |
text | text text | 3ā10x |
text | Pre-compile with text | 2ā5x |
4. List Comprehensions & Generator Expressions
pythonimport sys # āāā List comprehension (faster than loop) āāāāāāāāāāāāā squares = [x ** 2 for x in range(1_000_000)] print(f"List size: {sys.getsizeof(squares) / 1024 / 1024:.1f} MB") # ~8 MB ā all values in memory # āāā Generator expression (memory efficient) āāāāāāāāāā squares_gen = (x ** 2 for x in range(1_000_000)) print(f"Generator size: {sys.getsizeof(squares_gen)} bytes") # ~200 bytes ā values computed lazily on demand # Use generators for large datasets you iterate once total = sum(x ** 2 for x in range(1_000_000)) # No list created
5. Caching & Memoization
Avoid recomputing the same result.
pythonfrom functools import lru_cache, cache import time # āāā Without caching āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā def fibonacci_slow(n: int) -> int: if n < 2: return n return fibonacci_slow(n - 1) + fibonacci_slow(n - 2) # fibonacci_slow(35) takes ~3 seconds (exponential calls) # āāā With lru_cache (memoization) āāāāāāāāāāāāāāāāāāāāāā @lru_cache(maxsize=256) def fibonacci_fast(n: int) -> int: if n < 2: return n return fibonacci_fast(n - 1) + fibonacci_fast(n - 2) start = time.perf_counter() result = fibonacci_fast(100) # Instant! Cached results reused print(f"fib(100): {time.perf_counter() - start:.6f}s") # āāā Cache for API responses āāāāāāāāāāāāāāāāāāāāāāāāāā from functools import lru_cache import hashlib @lru_cache(maxsize=1000) def get_embedding_cached(text: str) -> tuple: """Cache embedding API calls to avoid re-computation.""" # Expensive API call happens only once per unique text import openai client = openai.OpenAI() response = client.embeddings.create( model="text-embedding-3-small", input=text ) return tuple(response.data[0].embedding)
6. Concurrency & Parallelism
Choose the right model based on your bottleneck.
pythonimport asyncio import aiohttp from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor import time urls = [f"https://httpbin.org/delay/1" for _ in range(10)] # āāā I/O-bound: asyncio (best for API calls) āāāāāāāāāā async def fetch_all_async(urls: list[str]) -> list[str]: async with aiohttp.ClientSession() as session: tasks = [session.get(url) for url in urls] responses = await asyncio.gather(*tasks) return [await r.text() for r in responses] # 10 requests in ~1s instead of ~10s # āāā CPU-bound: ProcessPoolExecutor āāāāāāāāāāāāāāāāāāāā def cpu_task(n: int) -> int: return sum(i * i for i in range(n)) with ProcessPoolExecutor(max_workers=4) as pool: results = list(pool.map(cpu_task, [5_000_000] * 4)) # 4x speedup on 4 cores # āāā I/O-bound: ThreadPoolExecutor āāāāāāāāāāāāāāāāāāāāā with ThreadPoolExecutor(max_workers=10) as pool: # Good for blocking I/O libraries that don't support async results = list(pool.map(requests_get_wrapper, urls))
| Bottleneck | Solution | Example Use Case |
|---|---|---|
| I/O-bound (async lib) | text | LLM API calls, web scraping |
| I/O-bound (sync lib) | text | Database queries, file reads |
| CPU-bound | text | Embedding generation, data processing |
| Mixed | text text | RAG pipeline (API + preprocessing) |
7. Memory Optimization
pythonimport sys from dataclasses import dataclass # āāā HEAVY: Regular class āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā class PointHeavy: def __init__(self, x, y, z): self.x = x self.y = y self.z = z # āāā LIGHT: __slots__ (no __dict__, fixed attributes) āā class PointLight: __slots__ = ['x', 'y', 'z'] def __init__(self, x, y, z): self.x = x self.y = y self.z = z heavy = PointHeavy(1, 2, 3) light = PointLight(1, 2, 3) print(f"Regular: {sys.getsizeof(heavy) + sys.getsizeof(heavy.__dict__)} bytes") print(f"Slots: {sys.getsizeof(light)} bytes") # Slots uses ~40% less memory per instance # āāā Use generators for large data pipelines āāāāāāāāāā def process_large_file(filepath: str): """Process line by line ā never loads entire file.""" with open(filepath) as f: for line in f: # Generator ā one line in memory yield line.strip() # Lazy evaluation # āāā Use array instead of list for numeric data āāāāāāāā from array import array nums_list = list(range(1_000_000)) # ~8 MB nums_array = array('i', range(1_000_000)) # ~4 MB (typed, compact)
8. String Optimization
pythonimport time names = ["Alice", "Bob", "Charlie"] * 100_000 # āāā SLOW: String concatenation in loop āāāāāāāāāāāāāāāā start = time.perf_counter() result = "" for name in names: result += name + ", " # Creates new string each time! print(f"Concat: {time.perf_counter() - start:.4f}s") # āāā FAST: join() (single allocation) āāāāāāāāāāāāāāāāā start = time.perf_counter() result = ", ".join(names) # One allocation, C-optimized print(f"Join: {time.perf_counter() - start:.4f}s") # 10-100x faster for large strings # āāā FAST: f-strings (fastest formatting) āāāāāāāāāāāāāā name, age = "Alice", 30 # Slowest: "Name: " + name + " Age: " + str(age) # Slow: "Name: %s Age: %d" % (name, age) # Fast: "Name: {} Age: {}".format(name, age) # Fastest: f"Name: {name} Age: {age}"
9. Faster Libraries (Drop-in Replacements)
| Standard Library | Faster Alternative | Speedup | Install |
|---|---|---|---|
text | text | 3ā10x | text |
text | text | 2ā5x | text |
text | text | 5ā20x (concurrent) | text |
text | text | 10ā50x | text |
text | text | 2ā10x | text |
text | text | 1.5ā3x | text |
text | text | 2ā5x | text |
text | text | Faster parsing | text |
python# āāā orjson: 10x faster JSON āāāāāāāāāāāāāāāāāāāāāāāāāā import orjson import json data = {"embeddings": [0.1] * 1536, "model": "text-embedding-3-small"} # Standard json json_bytes = json.dumps(data).encode() # orjson (returns bytes directly, much faster) orjson_bytes = orjson.dumps(data) parsed = orjson.loads(orjson_bytes)
10. JIT Compilation with Numba
For numerical code, Numba compiles Python to machine code at runtime.
pythonfrom numba import njit import numpy as np import time # āāā Pure Python (slow) āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā def cosine_similarity_python(a, b): dot = sum(x * y for x, y in zip(a, b)) norm_a = sum(x ** 2 for x in a) ** 0.5 norm_b = sum(x ** 2 for x in b) ** 0.5 return dot / (norm_a * norm_b) # āāā Numba JIT (near C speed) āāāāāāāāāāāāāāāāāāāāāāāāā @njit def cosine_similarity_numba(a, b): dot = 0.0 norm_a = 0.0 norm_b = 0.0 for i in range(len(a)): dot += a[i] * b[i] norm_a += a[i] ** 2 norm_b += b[i] ** 2 return dot / (norm_a ** 0.5 * norm_b ** 0.5) a = np.random.rand(1536) # Typical embedding dimension b = np.random.rand(1536) # Warm up JIT cosine_similarity_numba(a, b) start = time.perf_counter() for _ in range(10_000): cosine_similarity_python(a.tolist(), b.tolist()) print(f"Python: {time.perf_counter() - start:.4f}s") start = time.perf_counter() for _ in range(10_000): cosine_similarity_numba(a, b) print(f"Numba: {time.perf_counter() - start:.4f}s") # Numba is 50-100x faster
Performance Optimization Cheat Sheet
| Technique | Impact | Effort | When to Use |
|---|---|---|---|
| Profile first | Critical | Low | Always ā before any optimization |
| Better algorithm/data structure | 10ā1000x | Medium | When complexity is wrong (O(n²) ā O(n)) |
| Built-in functions | 5ā10x | Low | Replace manual loops with text text text |
| List comprehensions | 2ā5x | Low | Replace text text |
| Generators | Memory savings | Low | Large datasets iterated once |
text | Huge (avoids recompute) | Low | Pure functions called with same args |
text | 5ā50x for I/O | Medium | Multiple API calls, network requests |
text | Linear with cores | Medium | CPU-heavy tasks (embeddings, parsing) |
text | 30ā40% memory | Low | Many instances of same class |
text | 3ā10x JSON speed | Low | JSON-heavy applications |
| NumPy vectorization | 50ā100x | Medium | Numerical computations |
| Numba JIT | 50ā100x | Medium | Tight numerical loops |
| Cython | 10ā100x | High | Performance-critical modules |
Golden Rule: Profile ā fix the bottleneck ā measure again. Don't optimize code that isn't slow. In Gen AI applications, the biggest wins usually come from async I/O for API calls, caching embeddings, and batch processing rather than micro-optimizing Python code.
Resources: