What are all the model serving frameworks that a fine tuned model can be added and accessed across?
Answer
Model Serving Frameworks for Fine-Tuned LLMs
After fine-tuning an LLM, you need a serving framework to load the model and expose it as an API so applications can access it. Here are all the major frameworks, how they work, and when to use each.
Framework Comparison at a Glance
| Framework | Best For | Protocol | GPU Support | Ease of Use |
|---|---|---|---|---|
| vLLM | High-throughput production | OpenAI-compatible REST | ✅ Multi-GPU | ⭐⭐⭐⭐ |
| TGI (HuggingFace) | HuggingFace models | REST + gRPC | ✅ Multi-GPU | ⭐⭐⭐⭐ |
| Ollama | Local / dev use | REST | ✅ CPU + GPU | ⭐⭐⭐⭐⭐ |
| BentoML | Custom ML pipelines | REST + gRPC | ✅ Multi-GPU | ⭐⭐⭐⭐ |
| Ray Serve | Distributed / scaled | REST | ✅ Multi-GPU | ⭐⭐⭐ |
| Triton | NVIDIA GPU clusters | gRPC + HTTP | ✅ NVIDIA only | ⭐⭐ |
| TorchServe | PyTorch models | REST | ✅ GPU | ⭐⭐⭐ |
| LiteLLM | Proxy / multi-model gateway | OpenAI-compatible | ✅ Any backend | ⭐⭐⭐⭐⭐ |
| FastAPI (custom) | Full control | Any | ✅ GPU | ⭐⭐⭐⭐ |
| Seldon Core | Kubernetes / enterprise | REST + gRPC | ✅ Multi-GPU | ⭐⭐ |
1. vLLM — Best for High-Throughput Production
The most popular serving framework for LLMs. Uses PagedAttention to efficiently manage KV cache memory, enabling continuous batching and very high throughput.
bash# Install pip install vllm # Serve your fine-tuned model (OpenAI-compatible API) python -m vllm.entrypoints.openai.api_server --model ./merged-fine-tuned-model --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 # use 2 GPUs
python# Access via OpenAI SDK (drop-in compatible) from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="none") response = client.chat.completions.create( model="./merged-fine-tuned-model", messages=[{"role": "user", "content": "Explain LoRA fine-tuning"}] ) print(response.choices[0].message.content)
Supports: LoRA adapters directly (no merging needed)
bashpython -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.2-3B --enable-lora --lora-modules my-adapter=./lora-adapters
2. TGI (Text Generation Inference) — HuggingFace's Official Server
HuggingFace's production-grade serving framework. Best when your fine-tuned model is on the HuggingFace Hub.
bash# Run with Docker (easiest) docker run --gpus all --shm-size 1g -p 8080:80 -v ./fine-tuned-model:/model ghcr.io/huggingface/text-generation-inference:latest --model-id /model --num-shard 1
python# Access via Python client from huggingface_hub import InferenceClient client = InferenceClient(base_url="http://localhost:8080") response = client.text_generation( "What is the capital of France?", max_new_tokens=100, temperature=0.7 ) print(response)
Key features: Flash Attention 2, continuous batching, quantization (GPTQ, AWQ, bitsandbytes), streaming.
3. Ollama — Easiest for Local & Dev Use
The simplest way to run fine-tuned models locally. Uses GGUF format (quantized). Great for development, testing, and local apps.
bash# Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Create a Modelfile for your fine-tuned model (GGUF format) cat > Modelfile <<EOF FROM ./fine-tuned-model-Q4_K_M.gguf SYSTEM "You are a specialized Gen AI learning assistant." PARAMETER temperature 0.7 PARAMETER num_ctx 4096 EOF # Create and run the model ollama create my-fine-tuned-model -f Modelfile ollama run my-fine-tuned-model
python# Access via Python import requests response = requests.post("http://localhost:11434/api/chat", json={ "model": "my-fine-tuned-model", "messages": [{"role": "user", "content": "What is RAG?"}], "stream": False }) print(response.json()["message"]["content"]) # Or use OpenAI-compatible endpoint from openai import OpenAI client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") response = client.chat.completions.create( model="my-fine-tuned-model", messages=[{"role": "user", "content": "What is RAG?"}] )
4. BentoML — Custom ML Pipelines
Best when you need custom preprocessing, post-processing, or chaining multiple models together alongside your fine-tuned LLM.
python# service.py import bentoml from transformers import pipeline @bentoml.service( resources={"gpu": 1}, traffic={"timeout": 60} ) class FineTunedLLMService: def __init__(self): self.pipe = pipeline( "text-generation", model="./fine-tuned-model", device=0 ) @bentoml.api def generate(self, prompt: str) -> str: result = self.pipe(prompt, max_new_tokens=200, temperature=0.7) return result[0]["generated_text"]
bash# Serve locally bentoml serve service:FineTunedLLMService # Build and containerize bentoml build bentoml containerize my_llm_service:latest
5. Ray Serve — Distributed Scaling
Best for horizontally scaling across multiple nodes or combining LLM serving with other ML models.
pythonfrom ray import serve from transformers import pipeline @serve.deployment( ray_actor_options={"num_gpus": 1}, autoscaling_config={"min_replicas": 1, "max_replicas": 4} ) class FineTunedModel: def __init__(self): self.pipe = pipeline("text-generation", model="./fine-tuned-model", device=0) async def __call__(self, request): data = await request.json() result = self.pipe(data["prompt"], max_new_tokens=200) return {"response": result[0]["generated_text"]} app = FineTunedModel.bind() # Run # serve run service:app
6. LiteLLM — Unified Gateway / Proxy
If you want to switch between multiple fine-tuned models or providers without changing your application code. Acts as a unified OpenAI-compatible proxy.
python# proxy_config.yaml model_list: - model_name: my-fine-tuned-llama litellm_params: model: openai/fine-tuned-llama api_base: http://localhost:8000/v1 api_key: none - model_name: my-fine-tuned-mistral litellm_params: model: openai/fine-tuned-mistral api_base: http://localhost:8001/v1 api_key: none
bash# Start proxy litellm --config proxy_config.yaml --port 4000
python# Your app always uses the same interface regardless of backend from openai import OpenAI client = OpenAI(base_url="http://localhost:4000", api_key="any") # Swap models without touching app code response = client.chat.completions.create( model="my-fine-tuned-llama", messages=[{"role": "user", "content": "Hello"}] )
7. FastAPI — Custom Serving (Full Control)
When you need complete control over the API design, authentication, logging, or custom logic.
pythonfrom fastapi import FastAPI from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline from pydantic import BaseModel import torch app = FastAPI(title="Fine-Tuned LLM API") # Load model once at startup tokenizer = AutoTokenizer.from_pretrained("./fine-tuned-model") model = AutoModelForCausalLM.from_pretrained( "./fine-tuned-model", torch_dtype=torch.float16, device_map="auto" ) pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) class ChatRequest(BaseModel): prompt: str max_tokens: int = 200 temperature: float = 0.7 @app.post("/v1/chat") async def chat(request: ChatRequest): result = pipe( request.prompt, max_new_tokens=request.max_tokens, temperature=request.temperature ) return {"response": result[0]["generated_text"]} # Run: uvicorn service:app --host 0.0.0.0 --port 8000
How to Choose the Right Framework
textAre you in development / testing locally? → Ollama (simplest, runs GGUF on CPU/GPU) Do you need OpenAI-compatible API in production? → vLLM (best throughput, PagedAttention) Is your model on HuggingFace Hub? → TGI (native HuggingFace support) Do you need custom preprocessing or multi-model pipelines? → BentoML Do you need to scale across multiple nodes? → Ray Serve Do you need a single gateway for multiple models/providers? → LiteLLM Do you need full control over API logic? → FastAPI (custom) Are you on Kubernetes in enterprise? → Seldon Core or Triton
Serving a LoRA Adapter (Without Merging)
You don't always need to merge LoRA weights back into the base model. Both vLLM and TGI can load adapters on-the-fly:
python# vLLM: load multiple LoRA adapters dynamically # Start server with --enable-lora flag, then switch adapters per request from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="none") # Use specific LoRA adapter by name response = client.chat.completions.create( model="customer-support-adapter", # name given to --lora-modules messages=[{"role": "user", "content": "What is your return policy?"}] )
Recommended stack for most teams: Use Ollama locally during development → deploy with vLLM in production → wrap with LiteLLM if you need multi-model routing.