Concept #196Mediumpython-for-gen-aiimportant

How to increase the performance of a Python application? Explain key optimization techniques with examples.

#python#performance#optimization#profiling#caching#concurrency#memory#numba#asyncio#multiprocessing

Answer

How to Increase Python Application Performance

Python is powerful but inherently slower than compiled languages. Optimizing a Python app requires a systematic approach — profile first, then apply the right technique at the right layer.


Performance Optimization Hierarchy


1. Profile Before Optimizing

Never guess — always measure first to find the actual bottleneck.

python
import cProfile
import time
from functools import wraps

# ─── Quick timer decorator ─────────────────────────────
def timer(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        elapsed = time.perf_counter() - start
        print(f"{func.__name__}: {elapsed:.4f}s")
        return result
    return wrapper

# ─── cProfile for detailed breakdown ───────────────────
def slow_function():
    total = 0
    for i in range(1_000_000):
        total += i ** 2
    return total

cProfile.run('slow_function()', sort='cumulative')

# ─── line_profiler for line-by-line analysis ───────────
# pip install line_profiler
# @profile    <-- add this decorator
# kernprof -l -v script.py
Profiling ToolWhat It MeasuresBest For
text
time.perf_counter()
Wall clock timeQuick checks
text
cProfile
Function-level CPU timeFinding slow functions
text
line_profiler
Line-by-line execution timePinpointing exact slow lines
text
memory_profiler
Memory usage per lineFinding memory leaks
text
py-spy
Sampling profiler (no code change)Production profiling
text
tracemalloc
Memory allocation trackingTracking allocation sources

2. Algorithm & Data Structure Optimization

The single biggest impact — choosing the right data structure can be 100x–1000x faster.

python
import time

# ─── BAD: O(n) lookup in a list ────────────────────────
large_list = list(range(1_000_000))
start = time.perf_counter()
result = 999_999 in large_list  # Scans entire list
print(f"List lookup: {time.perf_counter() - start:.6f}s")
# ~0.015s

# ─── GOOD: O(1) lookup in a set ────────────────────────
large_set = set(range(1_000_000))
start = time.perf_counter()
result = 999_999 in large_set  # Hash-based instant lookup
print(f"Set lookup: {time.perf_counter() - start:.6f}s")
# ~0.000001s  (10,000x faster!)
Operation
text
list
text
set
text
dict
text
deque
Lookup (
text
in
)
O(n)O(1)O(1)O(n)
AppendO(1)O(1)O(1)O(1)
Insert at frontO(n)N/AN/AO(1)
Delete by valueO(n)O(1)O(1)O(n)
SortO(n log n)N/AN/AN/A

3. Use Built-in Functions & C-Optimized Libraries

Python's built-ins are implemented in C — always prefer them over manual loops.

python
import time

numbers = list(range(1_000_000))

# ─── SLOW: Python loop ────────────────────────────────
start = time.perf_counter()
total = 0
for n in numbers:
    total += n
print(f"Python loop: {time.perf_counter() - start:.4f}s")

# ─── FAST: Built-in sum() (C implementation) ──────────
start = time.perf_counter()
total = sum(numbers)
print(f"Built-in sum: {time.perf_counter() - start:.4f}s")
# 5-10x faster

# ─── FASTEST: NumPy (vectorized C/Fortran) ─────────────
import numpy as np
arr = np.arange(1_000_000)
start = time.perf_counter()
total = np.sum(arr)
print(f"NumPy sum: {time.perf_counter() - start:.4f}s")
# 50-100x faster than Python loop
Instead ofUseSpeedup
text
for
loop to sum
text
sum()
5–10x
text
for
loop to filter
text
filter()
/ list comprehension
2–5x
text
for
loop to map
text
map()
/ list comprehension
2–5x
Manual string concat (
text
+=
)
text
''.join(list)
10–100x
Manual math on lists
text
numpy
vectorized ops
50–100x
text
json.loads/dumps
text
orjson
or
text
ujson
3–10x
text
re.match
in a loop
Pre-compile with
text
re.compile()
2–5x

4. List Comprehensions & Generator Expressions

python
import sys

# ─── List comprehension (faster than loop) ─────────────
squares = [x ** 2 for x in range(1_000_000)]
print(f"List size: {sys.getsizeof(squares) / 1024 / 1024:.1f} MB")
# ~8 MB — all values in memory

# ─── Generator expression (memory efficient) ──────────
squares_gen = (x ** 2 for x in range(1_000_000))
print(f"Generator size: {sys.getsizeof(squares_gen)} bytes")
# ~200 bytes — values computed lazily on demand

# Use generators for large datasets you iterate once
total = sum(x ** 2 for x in range(1_000_000))  # No list created

5. Caching & Memoization

Avoid recomputing the same result.

python
from functools import lru_cache, cache
import time

# ─── Without caching ───────────────────────────────────
def fibonacci_slow(n: int) -> int:
    if n < 2:
        return n
    return fibonacci_slow(n - 1) + fibonacci_slow(n - 2)

# fibonacci_slow(35) takes ~3 seconds (exponential calls)

# ─── With lru_cache (memoization) ──────────────────────
@lru_cache(maxsize=256)
def fibonacci_fast(n: int) -> int:
    if n < 2:
        return n
    return fibonacci_fast(n - 1) + fibonacci_fast(n - 2)

start = time.perf_counter()
result = fibonacci_fast(100)  # Instant! Cached results reused
print(f"fib(100): {time.perf_counter() - start:.6f}s")

# ─── Cache for API responses ──────────────────────────
from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def get_embedding_cached(text: str) -> tuple:
    """Cache embedding API calls to avoid re-computation."""
    # Expensive API call happens only once per unique text
    import openai
    client = openai.OpenAI()
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return tuple(response.data[0].embedding)

6. Concurrency & Parallelism

Choose the right model based on your bottleneck.

python
import asyncio
import aiohttp
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import time

urls = [f"https://httpbin.org/delay/1" for _ in range(10)]

# ─── I/O-bound: asyncio (best for API calls) ──────────
async def fetch_all_async(urls: list[str]) -> list[str]:
    async with aiohttp.ClientSession() as session:
        tasks = [session.get(url) for url in urls]
        responses = await asyncio.gather(*tasks)
        return [await r.text() for r in responses]

# 10 requests in ~1s instead of ~10s

# ─── CPU-bound: ProcessPoolExecutor ────────────────────
def cpu_task(n: int) -> int:
    return sum(i * i for i in range(n))

with ProcessPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(cpu_task, [5_000_000] * 4))
# 4x speedup on 4 cores

# ─── I/O-bound: ThreadPoolExecutor ─────────────────────
with ThreadPoolExecutor(max_workers=10) as pool:
    # Good for blocking I/O libraries that don't support async
    results = list(pool.map(requests_get_wrapper, urls))
BottleneckSolutionExample Use Case
I/O-bound (async lib)
text
asyncio
LLM API calls, web scraping
I/O-bound (sync lib)
text
ThreadPoolExecutor
Database queries, file reads
CPU-bound
text
ProcessPoolExecutor
Embedding generation, data processing
Mixed
text
asyncio
+
text
ProcessPoolExecutor
RAG pipeline (API + preprocessing)

7. Memory Optimization

python
import sys
from dataclasses import dataclass

# ─── HEAVY: Regular class ──────────────────────────────
class PointHeavy:
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z

# ─── LIGHT: __slots__ (no __dict__, fixed attributes) ──
class PointLight:
    __slots__ = ['x', 'y', 'z']
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z

heavy = PointHeavy(1, 2, 3)
light = PointLight(1, 2, 3)

print(f"Regular: {sys.getsizeof(heavy) + sys.getsizeof(heavy.__dict__)} bytes")
print(f"Slots:   {sys.getsizeof(light)} bytes")
# Slots uses ~40% less memory per instance

# ─── Use generators for large data pipelines ──────────
def process_large_file(filepath: str):
    """Process line by line — never loads entire file."""
    with open(filepath) as f:
        for line in f:          # Generator — one line in memory
            yield line.strip()  # Lazy evaluation

# ─── Use array instead of list for numeric data ────────
from array import array

nums_list = list(range(1_000_000))          # ~8 MB
nums_array = array('i', range(1_000_000))   # ~4 MB (typed, compact)

8. String Optimization

python
import time

names = ["Alice", "Bob", "Charlie"] * 100_000

# ─── SLOW: String concatenation in loop ────────────────
start = time.perf_counter()
result = ""
for name in names:
    result += name + ", "  # Creates new string each time!
print(f"Concat: {time.perf_counter() - start:.4f}s")

# ─── FAST: join() (single allocation) ─────────────────
start = time.perf_counter()
result = ", ".join(names)  # One allocation, C-optimized
print(f"Join: {time.perf_counter() - start:.4f}s")
# 10-100x faster for large strings

# ─── FAST: f-strings (fastest formatting) ──────────────
name, age = "Alice", 30
# Slowest:  "Name: " + name + " Age: " + str(age)
# Slow:     "Name: %s Age: %d" % (name, age)
# Fast:     "Name: {} Age: {}".format(name, age)
# Fastest:  f"Name: {name} Age: {age}"

9. Faster Libraries (Drop-in Replacements)

Standard LibraryFaster AlternativeSpeedupInstall
text
json
text
orjson
3–10x
text
pip install orjson
text
json
text
ujson
2–5x
text
pip install ujson
text
requests
text
httpx
(async)
5–20x (concurrent)
text
pip install httpx
text
csv
text
polars
10–50x
text
pip install polars
text
pandas
text
polars
2–10x
text
pip install polars
text
re
text
regex
1.5–3x
text
pip install regex
text
pickle
text
msgpack
2–5x
text
pip install msgpack
text
datetime
text
pendulum
Faster parsing
text
pip install pendulum
python
# ─── orjson: 10x faster JSON ──────────────────────────
import orjson
import json

data = {"embeddings": [0.1] * 1536, "model": "text-embedding-3-small"}

# Standard json
json_bytes = json.dumps(data).encode()

# orjson (returns bytes directly, much faster)
orjson_bytes = orjson.dumps(data)
parsed = orjson.loads(orjson_bytes)

10. JIT Compilation with Numba

For numerical code, Numba compiles Python to machine code at runtime.

python
from numba import njit
import numpy as np
import time

# ─── Pure Python (slow) ───────────────────────────────
def cosine_similarity_python(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = sum(x ** 2 for x in a) ** 0.5
    norm_b = sum(x ** 2 for x in b) ** 0.5
    return dot / (norm_a * norm_b)

# ─── Numba JIT (near C speed) ─────────────────────────
@njit
def cosine_similarity_numba(a, b):
    dot = 0.0
    norm_a = 0.0
    norm_b = 0.0
    for i in range(len(a)):
        dot += a[i] * b[i]
        norm_a += a[i] ** 2
        norm_b += b[i] ** 2
    return dot / (norm_a ** 0.5 * norm_b ** 0.5)

a = np.random.rand(1536)  # Typical embedding dimension
b = np.random.rand(1536)

# Warm up JIT
cosine_similarity_numba(a, b)

start = time.perf_counter()
for _ in range(10_000):
    cosine_similarity_python(a.tolist(), b.tolist())
print(f"Python: {time.perf_counter() - start:.4f}s")

start = time.perf_counter()
for _ in range(10_000):
    cosine_similarity_numba(a, b)
print(f"Numba: {time.perf_counter() - start:.4f}s")
# Numba is 50-100x faster

Performance Optimization Cheat Sheet

TechniqueImpactEffortWhen to Use
Profile firstCriticalLowAlways — before any optimization
Better algorithm/data structure10–1000xMediumWhen complexity is wrong (O(n²) → O(n))
Built-in functions5–10xLowReplace manual loops with
text
sum
,
text
map
,
text
filter
List comprehensions2–5xLowReplace
text
for
loop +
text
append
GeneratorsMemory savingsLowLarge datasets iterated once
text
lru_cache
Huge (avoids recompute)LowPure functions called with same args
text
asyncio
5–50x for I/OMediumMultiple API calls, network requests
text
multiprocessing
Linear with coresMediumCPU-heavy tasks (embeddings, parsing)
text
__slots__
30–40% memoryLowMany instances of same class
text
orjson
3–10x JSON speedLowJSON-heavy applications
NumPy vectorization50–100xMediumNumerical computations
Numba JIT50–100xMediumTight numerical loops
Cython10–100xHighPerformance-critical modules

Golden Rule: Profile → fix the bottleneck → measure again. Don't optimize code that isn't slow. In Gen AI applications, the biggest wins usually come from async I/O for API calls, caching embeddings, and batch processing rather than micro-optimizing Python code.

Resources: