What are all the Python libraries used for AI Engineering (Agent Development, Fine-Tuning LLM, etc.) and what are they used for?

Question

Accepted Answer

## Python Libraries for AI Engineering

The AI engineering ecosystem is built on a rich set of Python libraries. Here is a comprehensive breakdown by use case.

---

## 1. LLM API Clients

Libraries for calling hosted LLM APIs directly.

| Library | Install | What It Does |
|---------|---------|--------------|
| `openai` | `pip install openai` | Official client for GPT-4o, o1, Whisper, DALL-E, Embeddings |
| `anthropic` | `pip install anthropic` | Official client for Claude 3.5/4 models |
| `google-generativeai` | `pip install google-generativeai` | Official client for Gemini models |
| `boto3` | `pip install boto3` | AWS SDK — access Claude, Llama, Titan via Amazon Bedrock |
| `litellm` | `pip install litellm` | Unified interface for 100+ LLM providers with one API |

```python
# litellm — call any provider with the same code
from litellm import completion

response = completion(model="gpt-4o", messages=[{"role": "user", "content": "Hello"}])
response = completion(model="claude-3-5-sonnet", messages=[{"role": "user", "content": "Hello"}])
response = completion(model="gemini/gemini-1.5-pro", messages=[{"role": "user", "content": "Hello"}])
```

---

## 2. LLM Orchestration Frameworks

High-level frameworks for chaining LLM calls, tools, and data sources.

| Library | What It Does |
|---------|--------------|
| `langchain` | Most popular framework — chains, agents, RAG, tools, memory |
| `llama-index` | Optimised for data ingestion and RAG pipelines over documents |
| `haystack` | Production-grade NLP pipelines with components for RAG and search |
| `dspy` | Replaces prompt engineering with compiled, optimised prompt programs |
| `guidance` | Structured generation — enforces output format constraints (JSON, regex) |

```python
# LangChain — simple RAG chain
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import FAISS

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o"),
    retriever=FAISS.load_local("index", embeddings).as_retriever()
)
answer = qa_chain.invoke("What is LoRA?")
```

```python
# LlamaIndex — document Q&A
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the key findings")
```

---

## 3. Agent Development

Libraries for building autonomous, multi-step AI agents.

| Library | What It Does |
|---------|--------------|
| `langgraph` | Graph-based agent framework — nodes, edges, state machines for complex workflows |
| `autogen` | Microsoft's multi-agent framework — agents collaborate via conversation |
| `crewai` | Role-based multi-agent teams with task delegation and memory |
| `smolagents` | HuggingFace's lightweight agent library — code-executing agents |
| `agno` | Fast multi-modal agent framework with built-in memory and knowledge stores |
| `openai-agents` | OpenAI's official agent SDK with handoffs, tracing, and guardrails |

```python
# LangGraph — stateful agent with tools
from langgraph.graph import StateGraph, END
from typing import TypedDict

class AgentState(TypedDict):
    messages: list
    tool_output: str

def call_llm(state: AgentState):
    # call LLM with current messages
    return {"messages": state["messages"] + [llm_response]}

def use_tool(state: AgentState):
    # execute tool based on LLM decision
    return {"tool_output": tool_result}

graph = StateGraph(AgentState)
graph.add_node("llm", call_llm)
graph.add_node("tool", use_tool)
graph.add_edge("llm", "tool")
graph.add_edge("tool", END)
agent = graph.compile()
```

```python
# CrewAI — role-based multi-agent team
from crewai import Agent, Task, Crew

researcher = Agent(role="Researcher", goal="Find relevant info", llm="gpt-4o")
writer = Agent(role="Writer", goal="Write clear summaries", llm="gpt-4o")

task = Task(description="Research and summarise PEFT methods", agent=researcher)
crew = Crew(agents=[researcher, writer], tasks=[task])
result = crew.kickoff()
```

---

## 4. Fine-Tuning LLMs

Libraries for training and adapting LLM weights.

| Library | What It Does |
|---------|--------------|
| `torch` | PyTorch — core deep learning framework for tensor ops, model building, and training |
| `torch.cuda` | CUDA GPU support built into PyTorch — `.to("cuda")` moves tensors/models to GPU |
| `transformers` | HuggingFace — load, train, and inference any open-source LLM |
| `peft` | Parameter-Efficient Fine-Tuning — LoRA, QLoRA, prefix tuning, adapters |
| `trl` | TRL (Transformer Reinforcement Learning) — SFT, DPO, RLHF, PPO training |
| `bitsandbytes` | 4-bit / 8-bit quantization — enables QLoRA on consumer GPUs |
| `unsloth` | 2x faster LoRA/QLoRA fine-tuning with 70% less VRAM |
| `axolotl` | Config-driven fine-tuning wrapper around HuggingFace + PEFT |
| `datasets` | HuggingFace datasets library — load, process, and stream training data |

```python
# Full fine-tuning pipeline with TRL + PEFT
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", num_train_epochs=3),
    train_dataset=dataset,
    peft_config=LoraConfig(r=16, lora_alpha=32, task_type="CAUSAL_LM")
)
trainer.train()
```

```python
# PyTorch + CUDA — device-agnostic training setup
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")          # cuda
print(torch.cuda.get_device_name(0))      # NVIDIA A100-SXM4-80GB

# Move model and tensors to GPU
model = model.to(device)
input_ids = input_ids.to(device)

# Mixed precision (faster training, less VRAM)
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
    outputs = model(input_ids)
    loss = outputs.loss
```

```python
# Unsloth — faster fine-tuning
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)
```

---

## 5. Embeddings & Vector Stores

Libraries for generating embeddings and storing/searching them.

| Library | What It Does |
|---------|--------------|
| `sentence-transformers` | Generate dense text embeddings locally (all-MiniLM, BGE, E5 models) |
| `tiktoken` | OpenAI's fast tokenizer — count tokens before API calls |
| `faiss-cpu` / `faiss-gpu` | Meta's high-speed vector similarity search library |
| `chromadb` | Open-source local vector database — easy setup, metadata filtering |
| `pinecone-client` | Managed cloud vector database — production scale |
| `weaviate-client` | Open-source vector DB with hybrid search (BM25 + semantic) |
| `qdrant-client` | High-performance vector DB with filtering and payload support |

```python
# sentence-transformers + FAISS local vector search
from sentence_transformers import SentenceTransformer
import faiss, numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")
docs = ["LoRA is a PEFT method", "RAG retrieves context", "Transformers use attention"]

embeddings = model.encode(docs).astype(np.float32)
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

query_vec = model.encode(["what is LoRA?"]).astype(np.float32)
distances, indices = index.search(query_vec, k=2)
print([docs[i] for i in indices[0]])
```

---

## 6. Model Serving & Inference

Libraries for deploying and serving LLMs in production.

| Library | What It Does |
|---------|--------------|
| `vllm` | High-throughput LLM serving with PagedAttention — OpenAI-compatible API |
| `ollama` | Run quantized GGUF models locally via a simple REST API |
| `llama-cpp-python` | Python bindings for llama.cpp — CPU/GPU inference of GGUF models |
| `bentoml` | Build and containerize ML model APIs with GPU support |
| `ray[serve]` | Distributed model serving and autoscaling across multiple nodes |

```python
# vLLM — serve and query
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.2-3B")
outputs = llm.generate(
    ["Explain attention mechanisms"],
    SamplingParams(temperature=0.7, max_tokens=200)
)
print(outputs[0].outputs[0].text)
```

---

## 7. Evaluation & Observability

Libraries for testing, evaluating, and monitoring LLM applications.

| Library | What It Does |
|---------|--------------|
| `ragas` | RAG evaluation — faithfulness, answer relevancy, context recall |
| `deepeval` | LLM unit testing framework — assert hallucination, toxicity, correctness |
| `langsmith` | LangChain's observability platform — trace every LLM call, debug chains, dataset evals |
| `langfuse` | Open-source LLM observability — tracing, cost tracking, prompt versioning |
| `wandb` | Weights & Biases — track fine-tuning runs, log metrics, compare experiments |
| `mlflow` | MLOps platform — experiment tracking, model registry, serving |

```python
# LangSmith — trace every LLM call automatically
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-rag-app"

# All LangChain calls are now automatically traced in LangSmith dashboard
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
response = llm.invoke("What is LoRA?")
# → Trace appears in LangSmith with latency, tokens, cost, inputs/outputs
```

```python
# RAGAS — evaluate RAG pipeline quality
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset

eval_data = Dataset.from_dict({
    "question": ["What is LoRA?"],
    "answer": ["LoRA is a PEFT method using low-rank matrices."],
    "contexts": [["LoRA adds trainable low-rank matrices to frozen attention layers."]],
    "ground_truth": ["LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method."]
})

results = evaluate(eval_data, metrics=[faithfulness, answer_relevancy, context_recall])
print(results)
```

```python
# DeepEval — LLM unit tests
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What is LoRA?",
    actual_output="LoRA is a PEFT method.",
    context=["LoRA adds trainable matrices to frozen layers."]
)
assert_test(test_case, [HallucinationMetric(threshold=0.5), AnswerRelevancyMetric(threshold=0.7)])
```

---

## 8. Data Processing for AI

Libraries for preparing training data and working with tokens.

| Library | What It Does |
|---------|--------------|
| `datasets` | HuggingFace datasets — load 50K+ public ML datasets, stream large files |
| `tokenizers` | HuggingFace fast tokenizer library — BPE, WordPiece, Unigram |
| `tiktoken` | OpenAI's tokenizer — estimate token counts for GPT models |
| `docling` | IBM's document parser — PDFs, DOCX, HTML to clean markdown for RAG |
| `unstructured` | Extract text from any file type (PDF, email, images, slides) for ingestion |

```python
# datasets — load and preprocess training data
from datasets import load_dataset

dataset = load_dataset("tatsu-lab/alpaca")        # 52K instruction examples
dataset = load_dataset("json", data_files="custom.jsonl")

# Tokenize for fine-tuning
def tokenize(example):
    return tokenizer(example["text"], truncation=True, max_length=512)

tokenized = dataset.map(tokenize, batched=True)
```

---

## Quick Reference by Task

| Task | Recommended Libraries |
|------|-----------------------|
| **Call GPT/Claude/Gemini** | `openai`, `anthropic`, `litellm` |
| **Build RAG pipelines** | `langchain`, `llama-index`, `haystack` |
| **Build agents** | `langgraph`, `crewai`, `autogen`, `smolagents` |
| **Core deep learning / GPU** | `torch`, `torch.cuda` |
| **Fine-tune LLMs** | `transformers` + `peft` + `trl` |
| **Fine-tune fast / low VRAM** | `unsloth` |
| **Generate embeddings** | `sentence-transformers` |
| **Vector search** | `faiss`, `chromadb`, `qdrant-client` |
| **Serve models** | `vllm`, `ollama`, `bentoml` |
| **Evaluate RAG** | `ragas`, `deepeval` |
| **Track experiments** | `wandb`, `mlflow`, `langfuse`, `langsmith` |
| **Prepare datasets** | `datasets`, `unstructured`, `docling` |

> **Interview tip:** Know the distinction between **orchestration** (`langchain`, `llama-index`) and **agent** (`langgraph`, `crewai`) libraries — they are often confused. Also know that `trl` + `peft` is the standard fine-tuning stack, and `vllm` is the go-to serving framework for production.

Learn more at [HuggingFace Docs](https://huggingface.co/docs), [LangChain Docs](https://docs.langchain.com), and [LangGraph Docs](https://langchain-ai.github.io/langgraph/).

Library	Install	What It Does
text `openai`	text `pip install openai`	Official client for GPT-4o, o1, Whisper, DALL-E, Embeddings
text `anthropic`	text `pip install anthropic`	Official client for Claude 3.5/4 models
text `google-generativeai`	text `pip install google-generativeai`	Official client for Gemini models
text `boto3`	text `pip install boto3`	AWS SDK — access Claude, Llama, Titan via Amazon Bedrock
text `litellm`	text `pip install litellm`	Unified interface for 100+ LLM providers with one API

Task	Recommended Libraries
Call GPT/Claude/Gemini	text `openai` , text `anthropic` , text `litellm`
Build RAG pipelines	text `langchain` , text `llama-index` , text `haystack`
Build agents	text `langgraph` , text `crewai` , text `autogen` , text `smolagents`
Core deep learning / GPU	text `torch` , text `torch.cuda`
Fine-tune LLMs	text `transformers` + text `peft` + text `trl`
Fine-tune fast / low VRAM	text `unsloth`
Generate embeddings	text `sentence-transformers`
Vector search	text `faiss` , text `chromadb` , text `qdrant-client`
Serve models	text `vllm` , text `ollama` , text `bentoml`
Evaluate RAG	text `ragas` , text `deepeval`
Track experiments	text `wandb` , text `mlflow` , text `langfuse` , text `langsmith`
Prepare datasets	text `datasets` , text `unstructured` , text `docling`

What are all the Python libraries used for AI Engineering (Agent Development, Fine-Tuning LLM, etc.) and what are they used for?

Answer

Python Libraries for AI Engineering

1. LLM API Clients

2. LLM Orchestration Frameworks

3. Agent Development

4. Fine-Tuning LLMs

5. Embeddings & Vector Stores

6. Model Serving & Inference

7. Evaluation & Observability

8. Data Processing for AI

Quick Reference by Task

Related Concepts

Explain decorators in Python. How would you use them in an LLM application?

What are context managers? How would you use them for LLM resource management?

Explain async/await in Python. Why is it important for API-heavy applications?

What are generators in Python? How are they used in streaming LLM responses?

Explain list comprehensions vs. loops in Python. When is each appropriate?

Library	What It Does
text `langchain`	Most popular framework — chains, agents, RAG, tools, memory
text `llama-index`	Optimised for data ingestion and RAG pipelines over documents
text `haystack`	Production-grade NLP pipelines with components for RAG and search
text `dspy`	Replaces prompt engineering with compiled, optimised prompt programs
text `guidance`	Structured generation — enforces output format constraints (JSON, regex)

Library	What It Does
text `langgraph`	Graph-based agent framework — nodes, edges, state machines for complex workflows
text `autogen`	Microsoft's multi-agent framework — agents collaborate via conversation
text `crewai`	Role-based multi-agent teams with task delegation and memory
text `smolagents`	HuggingFace's lightweight agent library — code-executing agents
text `agno`	Fast multi-modal agent framework with built-in memory and knowledge stores
text `openai-agents`	OpenAI's official agent SDK with handoffs, tracing, and guardrails

Library	What It Does
text `torch`	PyTorch — core deep learning framework for tensor ops, model building, and training
text `torch.cuda`	CUDA GPU support built into PyTorch — text `.to("cuda")` moves tensors/models to GPU
text `transformers`	HuggingFace — load, train, and inference any open-source LLM
text `peft`	Parameter-Efficient Fine-Tuning — LoRA, QLoRA, prefix tuning, adapters
text `trl`	TRL (Transformer Reinforcement Learning) — SFT, DPO, RLHF, PPO training
text `bitsandbytes`	4-bit / 8-bit quantization — enables QLoRA on consumer GPUs
text `unsloth`	2x faster LoRA/QLoRA fine-tuning with 70% less VRAM
text `axolotl`	Config-driven fine-tuning wrapper around HuggingFace + PEFT
text `datasets`	HuggingFace datasets library — load, process, and stream training data

Library	What It Does
text `sentence-transformers`	Generate dense text embeddings locally (all-MiniLM, BGE, E5 models)
text `tiktoken`	OpenAI's fast tokenizer — count tokens before API calls
text `faiss-cpu` / text `faiss-gpu`	Meta's high-speed vector similarity search library
text `chromadb`	Open-source local vector database — easy setup, metadata filtering
text `pinecone-client`	Managed cloud vector database — production scale
text `weaviate-client`	Open-source vector DB with hybrid search (BM25 + semantic)
text `qdrant-client`	High-performance vector DB with filtering and payload support

Library	What It Does
text `vllm`	High-throughput LLM serving with PagedAttention — OpenAI-compatible API
text `ollama`	Run quantized GGUF models locally via a simple REST API
text `llama-cpp-python`	Python bindings for llama.cpp — CPU/GPU inference of GGUF models
text `bentoml`	Build and containerize ML model APIs with GPU support
text `ray[serve]`	Distributed model serving and autoscaling across multiple nodes

Library	What It Does
text `ragas`	RAG evaluation — faithfulness, answer relevancy, context recall
text `deepeval`	LLM unit testing framework — assert hallucination, toxicity, correctness
text `langsmith`	LangChain's observability platform — trace every LLM call, debug chains, dataset evals
text `langfuse`	Open-source LLM observability — tracing, cost tracking, prompt versioning
text `wandb`	Weights & Biases — track fine-tuning runs, log metrics, compare experiments
text `mlflow`	MLOps platform — experiment tracking, model registry, serving

Library	What It Does
text `datasets`	HuggingFace datasets — load 50K+ public ML datasets, stream large files
text `tokenizers`	HuggingFace fast tokenizer library — BPE, WordPiece, Unigram
text `tiktoken`	OpenAI's tokenizer — estimate token counts for GPT models
text `docling`	IBM's document parser — PDFs, DOCX, HTML to clean markdown for RAG
text `unstructured`	Extract text from any file type (PDF, email, images, slides) for ingestion