What are all the Python libraries used for AI Engineering (Agent Development, Fine-Tuning LLM, etc.) and what are they used for?
#python#libraries#langchain#llama-index#peft#trl#langgraph#crewai#vllm#ragas#transformers#torch#cuda#langsmith
Answer
Python Libraries for AI Engineering
The AI engineering ecosystem is built on a rich set of Python libraries. Here is a comprehensive breakdown by use case.
1. LLM API Clients
Libraries for calling hosted LLM APIs directly.
| Library | Install | What It Does |
|---|---|---|
text | text | Official client for GPT-4o, o1, Whisper, DALL-E, Embeddings |
text | text | Official client for Claude 3.5/4 models |
text | text | Official client for Gemini models |
text | text | AWS SDK — access Claude, Llama, Titan via Amazon Bedrock |
text | text | Unified interface for 100+ LLM providers with one API |
python# litellm — call any provider with the same code from litellm import completion response = completion(model="gpt-4o", messages=[{"role": "user", "content": "Hello"}]) response = completion(model="claude-3-5-sonnet", messages=[{"role": "user", "content": "Hello"}]) response = completion(model="gemini/gemini-1.5-pro", messages=[{"role": "user", "content": "Hello"}])
2. LLM Orchestration Frameworks
High-level frameworks for chaining LLM calls, tools, and data sources.
| Library | What It Does |
|---|---|
text | Most popular framework — chains, agents, RAG, tools, memory |
text | Optimised for data ingestion and RAG pipelines over documents |
text | Production-grade NLP pipelines with components for RAG and search |
text | Replaces prompt engineering with compiled, optimised prompt programs |
text | Structured generation — enforces output format constraints (JSON, regex) |
python# LangChain — simple RAG chain from langchain.chains import RetrievalQA from langchain_openai import ChatOpenAI from langchain_community.vectorstores import FAISS qa_chain = RetrievalQA.from_chain_type( llm=ChatOpenAI(model="gpt-4o"), retriever=FAISS.load_local("index", embeddings).as_retriever() ) answer = qa_chain.invoke("What is LoRA?")
python# LlamaIndex — document Q&A from llama_index.core import VectorStoreIndex, SimpleDirectoryReader documents = SimpleDirectoryReader("./docs").load_data() index = VectorStoreIndex.from_documents(documents) query_engine = index.as_query_engine() response = query_engine.query("Summarize the key findings")
3. Agent Development
Libraries for building autonomous, multi-step AI agents.
| Library | What It Does |
|---|---|
text | Graph-based agent framework — nodes, edges, state machines for complex workflows |
text | Microsoft's multi-agent framework — agents collaborate via conversation |
text | Role-based multi-agent teams with task delegation and memory |
text | HuggingFace's lightweight agent library — code-executing agents |
text | Fast multi-modal agent framework with built-in memory and knowledge stores |
text | OpenAI's official agent SDK with handoffs, tracing, and guardrails |
python# LangGraph — stateful agent with tools from langgraph.graph import StateGraph, END from typing import TypedDict class AgentState(TypedDict): messages: list tool_output: str def call_llm(state: AgentState): # call LLM with current messages return {"messages": state["messages"] + [llm_response]} def use_tool(state: AgentState): # execute tool based on LLM decision return {"tool_output": tool_result} graph = StateGraph(AgentState) graph.add_node("llm", call_llm) graph.add_node("tool", use_tool) graph.add_edge("llm", "tool") graph.add_edge("tool", END) agent = graph.compile()
python# CrewAI — role-based multi-agent team from crewai import Agent, Task, Crew researcher = Agent(role="Researcher", goal="Find relevant info", llm="gpt-4o") writer = Agent(role="Writer", goal="Write clear summaries", llm="gpt-4o") task = Task(description="Research and summarise PEFT methods", agent=researcher) crew = Crew(agents=[researcher, writer], tasks=[task]) result = crew.kickoff()
4. Fine-Tuning LLMs
Libraries for training and adapting LLM weights.
| Library | What It Does |
|---|---|
text | PyTorch — core deep learning framework for tensor ops, model building, and training |
text | CUDA GPU support built into PyTorch — text |
text | HuggingFace — load, train, and inference any open-source LLM |
text | Parameter-Efficient Fine-Tuning — LoRA, QLoRA, prefix tuning, adapters |
text | TRL (Transformer Reinforcement Learning) — SFT, DPO, RLHF, PPO training |
text | 4-bit / 8-bit quantization — enables QLoRA on consumer GPUs |
text | 2x faster LoRA/QLoRA fine-tuning with 70% less VRAM |
text | Config-driven fine-tuning wrapper around HuggingFace + PEFT |
text | HuggingFace datasets library — load, process, and stream training data |
python# Full fine-tuning pipeline with TRL + PEFT from trl import SFTTrainer, SFTConfig from peft import LoraConfig from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B") trainer = SFTTrainer( model=model, args=SFTConfig(output_dir="./output", num_train_epochs=3), train_dataset=dataset, peft_config=LoraConfig(r=16, lora_alpha=32, task_type="CAUSAL_LM") ) trainer.train()
python# PyTorch + CUDA — device-agnostic training setup import torch device = "cuda" if torch.cuda.is_available() else "cpu" print(f"Using device: {device}") # cuda print(torch.cuda.get_device_name(0)) # NVIDIA A100-SXM4-80GB # Move model and tensors to GPU model = model.to(device) input_ids = input_ids.to(device) # Mixed precision (faster training, less VRAM) with torch.autocast(device_type="cuda", dtype=torch.bfloat16): outputs = model(input_ids) loss = outputs.loss
python# Unsloth — faster fine-tuning from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/llama-3-8b-bnb-4bit", max_seq_length=2048, load_in_4bit=True ) model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)
5. Embeddings & Vector Stores
Libraries for generating embeddings and storing/searching them.
| Library | What It Does |
|---|---|
text | Generate dense text embeddings locally (all-MiniLM, BGE, E5 models) |
text | OpenAI's fast tokenizer — count tokens before API calls |
text text | Meta's high-speed vector similarity search library |
text | Open-source local vector database — easy setup, metadata filtering |
text | Managed cloud vector database — production scale |
text | Open-source vector DB with hybrid search (BM25 + semantic) |
text | High-performance vector DB with filtering and payload support |
python# sentence-transformers + FAISS local vector search from sentence_transformers import SentenceTransformer import faiss, numpy as np model = SentenceTransformer("all-MiniLM-L6-v2") docs = ["LoRA is a PEFT method", "RAG retrieves context", "Transformers use attention"] embeddings = model.encode(docs).astype(np.float32) index = faiss.IndexFlatL2(embeddings.shape[1]) index.add(embeddings) query_vec = model.encode(["what is LoRA?"]).astype(np.float32) distances, indices = index.search(query_vec, k=2) print([docs[i] for i in indices[0]])
6. Model Serving & Inference
Libraries for deploying and serving LLMs in production.
| Library | What It Does |
|---|---|
text | High-throughput LLM serving with PagedAttention — OpenAI-compatible API |
text | Run quantized GGUF models locally via a simple REST API |
text | Python bindings for llama.cpp — CPU/GPU inference of GGUF models |
text | Build and containerize ML model APIs with GPU support |
text | Distributed model serving and autoscaling across multiple nodes |
python# vLLM — serve and query from vllm import LLM, SamplingParams llm = LLM(model="meta-llama/Llama-3.2-3B") outputs = llm.generate( ["Explain attention mechanisms"], SamplingParams(temperature=0.7, max_tokens=200) ) print(outputs[0].outputs[0].text)
7. Evaluation & Observability
Libraries for testing, evaluating, and monitoring LLM applications.
| Library | What It Does |
|---|---|
text | RAG evaluation — faithfulness, answer relevancy, context recall |
text | LLM unit testing framework — assert hallucination, toxicity, correctness |
text | LangChain's observability platform — trace every LLM call, debug chains, dataset evals |
text | Open-source LLM observability — tracing, cost tracking, prompt versioning |
text | Weights & Biases — track fine-tuning runs, log metrics, compare experiments |
text | MLOps platform — experiment tracking, model registry, serving |
python# LangSmith — trace every LLM call automatically import os os.environ["LANGCHAIN_TRACING_V2"] = "true" os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key" os.environ["LANGCHAIN_PROJECT"] = "my-rag-app" # All LangChain calls are now automatically traced in LangSmith dashboard from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4o") response = llm.invoke("What is LoRA?") # → Trace appears in LangSmith with latency, tokens, cost, inputs/outputs
python# RAGAS — evaluate RAG pipeline quality from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_recall from datasets import Dataset eval_data = Dataset.from_dict({ "question": ["What is LoRA?"], "answer": ["LoRA is a PEFT method using low-rank matrices."], "contexts": [["LoRA adds trainable low-rank matrices to frozen attention layers."]], "ground_truth": ["LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method."] }) results = evaluate(eval_data, metrics=[faithfulness, answer_relevancy, context_recall]) print(results)
python# DeepEval — LLM unit tests from deepeval import assert_test from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric from deepeval.test_case import LLMTestCase test_case = LLMTestCase( input="What is LoRA?", actual_output="LoRA is a PEFT method.", context=["LoRA adds trainable matrices to frozen layers."] ) assert_test(test_case, [HallucinationMetric(threshold=0.5), AnswerRelevancyMetric(threshold=0.7)])
8. Data Processing for AI
Libraries for preparing training data and working with tokens.
| Library | What It Does |
|---|---|
text | HuggingFace datasets — load 50K+ public ML datasets, stream large files |
text | HuggingFace fast tokenizer library — BPE, WordPiece, Unigram |
text | OpenAI's tokenizer — estimate token counts for GPT models |
text | IBM's document parser — PDFs, DOCX, HTML to clean markdown for RAG |
text | Extract text from any file type (PDF, email, images, slides) for ingestion |
python# datasets — load and preprocess training data from datasets import load_dataset dataset = load_dataset("tatsu-lab/alpaca") # 52K instruction examples dataset = load_dataset("json", data_files="custom.jsonl") # Tokenize for fine-tuning def tokenize(example): return tokenizer(example["text"], truncation=True, max_length=512) tokenized = dataset.map(tokenize, batched=True)
Quick Reference by Task
| Task | Recommended Libraries |
|---|---|
| Call GPT/Claude/Gemini | text text text |
| Build RAG pipelines | text text text |
| Build agents | text text text text |
| Core deep learning / GPU | text text |
| Fine-tune LLMs | text text text |
| Fine-tune fast / low VRAM | text |
| Generate embeddings | text |
| Vector search | text text text |
| Serve models | text text text |
| Evaluate RAG | text text |
| Track experiments | text text text text |
| Prepare datasets | text text text |
Interview tip: Know the distinction between orchestration (
,textlangchain) and agent (textllama-index,textlanggraph) libraries — they are often confused. Also know thattextcrewai+texttrlis the standard fine-tuning stack, andtextpeftis the go-to serving framework for production.textvllm
Learn more at HuggingFace Docs, LangChain Docs, and LangGraph Docs.