Concept #157Mediumpython-for-gen-ai

What are all the Python libraries used for AI Engineering (Agent Development, Fine-Tuning LLM, etc.) and what are they used for?

#python#libraries#langchain#llama-index#peft#trl#langgraph#crewai#vllm#ragas#transformers#torch#cuda#langsmith

Answer

Python Libraries for AI Engineering

The AI engineering ecosystem is built on a rich set of Python libraries. Here is a comprehensive breakdown by use case.


1. LLM API Clients

Libraries for calling hosted LLM APIs directly.

LibraryInstallWhat It Does
text
openai
text
pip install openai
Official client for GPT-4o, o1, Whisper, DALL-E, Embeddings
text
anthropic
text
pip install anthropic
Official client for Claude 3.5/4 models
text
google-generativeai
text
pip install google-generativeai
Official client for Gemini models
text
boto3
text
pip install boto3
AWS SDK — access Claude, Llama, Titan via Amazon Bedrock
text
litellm
text
pip install litellm
Unified interface for 100+ LLM providers with one API
python
# litellm — call any provider with the same code
from litellm import completion

response = completion(model="gpt-4o", messages=[{"role": "user", "content": "Hello"}])
response = completion(model="claude-3-5-sonnet", messages=[{"role": "user", "content": "Hello"}])
response = completion(model="gemini/gemini-1.5-pro", messages=[{"role": "user", "content": "Hello"}])

2. LLM Orchestration Frameworks

High-level frameworks for chaining LLM calls, tools, and data sources.

LibraryWhat It Does
text
langchain
Most popular framework — chains, agents, RAG, tools, memory
text
llama-index
Optimised for data ingestion and RAG pipelines over documents
text
haystack
Production-grade NLP pipelines with components for RAG and search
text
dspy
Replaces prompt engineering with compiled, optimised prompt programs
text
guidance
Structured generation — enforces output format constraints (JSON, regex)
python
# LangChain — simple RAG chain
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import FAISS

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o"),
    retriever=FAISS.load_local("index", embeddings).as_retriever()
)
answer = qa_chain.invoke("What is LoRA?")
python
# LlamaIndex — document Q&A
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the key findings")

3. Agent Development

Libraries for building autonomous, multi-step AI agents.

LibraryWhat It Does
text
langgraph
Graph-based agent framework — nodes, edges, state machines for complex workflows
text
autogen
Microsoft's multi-agent framework — agents collaborate via conversation
text
crewai
Role-based multi-agent teams with task delegation and memory
text
smolagents
HuggingFace's lightweight agent library — code-executing agents
text
agno
Fast multi-modal agent framework with built-in memory and knowledge stores
text
openai-agents
OpenAI's official agent SDK with handoffs, tracing, and guardrails
python
# LangGraph — stateful agent with tools
from langgraph.graph import StateGraph, END
from typing import TypedDict

class AgentState(TypedDict):
    messages: list
    tool_output: str

def call_llm(state: AgentState):
    # call LLM with current messages
    return {"messages": state["messages"] + [llm_response]}

def use_tool(state: AgentState):
    # execute tool based on LLM decision
    return {"tool_output": tool_result}

graph = StateGraph(AgentState)
graph.add_node("llm", call_llm)
graph.add_node("tool", use_tool)
graph.add_edge("llm", "tool")
graph.add_edge("tool", END)
agent = graph.compile()
python
# CrewAI — role-based multi-agent team
from crewai import Agent, Task, Crew

researcher = Agent(role="Researcher", goal="Find relevant info", llm="gpt-4o")
writer = Agent(role="Writer", goal="Write clear summaries", llm="gpt-4o")

task = Task(description="Research and summarise PEFT methods", agent=researcher)
crew = Crew(agents=[researcher, writer], tasks=[task])
result = crew.kickoff()

4. Fine-Tuning LLMs

Libraries for training and adapting LLM weights.

LibraryWhat It Does
text
torch
PyTorch — core deep learning framework for tensor ops, model building, and training
text
torch.cuda
CUDA GPU support built into PyTorch —
text
.to("cuda")
moves tensors/models to GPU
text
transformers
HuggingFace — load, train, and inference any open-source LLM
text
peft
Parameter-Efficient Fine-Tuning — LoRA, QLoRA, prefix tuning, adapters
text
trl
TRL (Transformer Reinforcement Learning) — SFT, DPO, RLHF, PPO training
text
bitsandbytes
4-bit / 8-bit quantization — enables QLoRA on consumer GPUs
text
unsloth
2x faster LoRA/QLoRA fine-tuning with 70% less VRAM
text
axolotl
Config-driven fine-tuning wrapper around HuggingFace + PEFT
text
datasets
HuggingFace datasets library — load, process, and stream training data
python
# Full fine-tuning pipeline with TRL + PEFT
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", num_train_epochs=3),
    train_dataset=dataset,
    peft_config=LoraConfig(r=16, lora_alpha=32, task_type="CAUSAL_LM")
)
trainer.train()
python
# PyTorch + CUDA — device-agnostic training setup
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")          # cuda
print(torch.cuda.get_device_name(0))      # NVIDIA A100-SXM4-80GB

# Move model and tensors to GPU
model = model.to(device)
input_ids = input_ids.to(device)

# Mixed precision (faster training, less VRAM)
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
    outputs = model(input_ids)
    loss = outputs.loss
python
# Unsloth — faster fine-tuning
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)

5. Embeddings & Vector Stores

Libraries for generating embeddings and storing/searching them.

LibraryWhat It Does
text
sentence-transformers
Generate dense text embeddings locally (all-MiniLM, BGE, E5 models)
text
tiktoken
OpenAI's fast tokenizer — count tokens before API calls
text
faiss-cpu
/
text
faiss-gpu
Meta's high-speed vector similarity search library
text
chromadb
Open-source local vector database — easy setup, metadata filtering
text
pinecone-client
Managed cloud vector database — production scale
text
weaviate-client
Open-source vector DB with hybrid search (BM25 + semantic)
text
qdrant-client
High-performance vector DB with filtering and payload support
python
# sentence-transformers + FAISS local vector search
from sentence_transformers import SentenceTransformer
import faiss, numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")
docs = ["LoRA is a PEFT method", "RAG retrieves context", "Transformers use attention"]

embeddings = model.encode(docs).astype(np.float32)
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

query_vec = model.encode(["what is LoRA?"]).astype(np.float32)
distances, indices = index.search(query_vec, k=2)
print([docs[i] for i in indices[0]])

6. Model Serving & Inference

Libraries for deploying and serving LLMs in production.

LibraryWhat It Does
text
vllm
High-throughput LLM serving with PagedAttention — OpenAI-compatible API
text
ollama
Run quantized GGUF models locally via a simple REST API
text
llama-cpp-python
Python bindings for llama.cpp — CPU/GPU inference of GGUF models
text
bentoml
Build and containerize ML model APIs with GPU support
text
ray[serve]
Distributed model serving and autoscaling across multiple nodes
python
# vLLM — serve and query
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.2-3B")
outputs = llm.generate(
    ["Explain attention mechanisms"],
    SamplingParams(temperature=0.7, max_tokens=200)
)
print(outputs[0].outputs[0].text)

7. Evaluation & Observability

Libraries for testing, evaluating, and monitoring LLM applications.

LibraryWhat It Does
text
ragas
RAG evaluation — faithfulness, answer relevancy, context recall
text
deepeval
LLM unit testing framework — assert hallucination, toxicity, correctness
text
langsmith
LangChain's observability platform — trace every LLM call, debug chains, dataset evals
text
langfuse
Open-source LLM observability — tracing, cost tracking, prompt versioning
text
wandb
Weights & Biases — track fine-tuning runs, log metrics, compare experiments
text
mlflow
MLOps platform — experiment tracking, model registry, serving
python
# LangSmith — trace every LLM call automatically
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-rag-app"

# All LangChain calls are now automatically traced in LangSmith dashboard
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
response = llm.invoke("What is LoRA?")
# → Trace appears in LangSmith with latency, tokens, cost, inputs/outputs
python
# RAGAS — evaluate RAG pipeline quality
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset

eval_data = Dataset.from_dict({
    "question": ["What is LoRA?"],
    "answer": ["LoRA is a PEFT method using low-rank matrices."],
    "contexts": [["LoRA adds trainable low-rank matrices to frozen attention layers."]],
    "ground_truth": ["LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method."]
})

results = evaluate(eval_data, metrics=[faithfulness, answer_relevancy, context_recall])
print(results)
python
# DeepEval — LLM unit tests
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What is LoRA?",
    actual_output="LoRA is a PEFT method.",
    context=["LoRA adds trainable matrices to frozen layers."]
)
assert_test(test_case, [HallucinationMetric(threshold=0.5), AnswerRelevancyMetric(threshold=0.7)])

8. Data Processing for AI

Libraries for preparing training data and working with tokens.

LibraryWhat It Does
text
datasets
HuggingFace datasets — load 50K+ public ML datasets, stream large files
text
tokenizers
HuggingFace fast tokenizer library — BPE, WordPiece, Unigram
text
tiktoken
OpenAI's tokenizer — estimate token counts for GPT models
text
docling
IBM's document parser — PDFs, DOCX, HTML to clean markdown for RAG
text
unstructured
Extract text from any file type (PDF, email, images, slides) for ingestion
python
# datasets — load and preprocess training data
from datasets import load_dataset

dataset = load_dataset("tatsu-lab/alpaca")        # 52K instruction examples
dataset = load_dataset("json", data_files="custom.jsonl")

# Tokenize for fine-tuning
def tokenize(example):
    return tokenizer(example["text"], truncation=True, max_length=512)

tokenized = dataset.map(tokenize, batched=True)

Quick Reference by Task

TaskRecommended Libraries
Call GPT/Claude/Gemini
text
openai
,
text
anthropic
,
text
litellm
Build RAG pipelines
text
langchain
,
text
llama-index
,
text
haystack
Build agents
text
langgraph
,
text
crewai
,
text
autogen
,
text
smolagents
Core deep learning / GPU
text
torch
,
text
torch.cuda
Fine-tune LLMs
text
transformers
+
text
peft
+
text
trl
Fine-tune fast / low VRAM
text
unsloth
Generate embeddings
text
sentence-transformers
Vector search
text
faiss
,
text
chromadb
,
text
qdrant-client
Serve models
text
vllm
,
text
ollama
,
text
bentoml
Evaluate RAG
text
ragas
,
text
deepeval
Track experiments
text
wandb
,
text
mlflow
,
text
langfuse
,
text
langsmith
Prepare datasets
text
datasets
,
text
unstructured
,
text
docling

Interview tip: Know the distinction between orchestration (

text
langchain
,
text
llama-index
) and agent (
text
langgraph
,
text
crewai
) libraries — they are often confused. Also know that
text
trl
+
text
peft
is the standard fine-tuning stack, and
text
vllm
is the go-to serving framework for production.

Learn more at HuggingFace Docs, LangChain Docs, and LangGraph Docs.