What are all the model serving frameworks that a fine tuned model can be added and accessed across?

Question

Accepted Answer

## Model Serving Frameworks for Fine-Tuned LLMs After fine-tuning an LLM, you need a **serving framework** to load the model and expose it as an API so applications can access it. Here are all the major frameworks, how they work, and when to use each. --- ## Framework Comparison at a Glance | Framework | Best For | Protocol | GPU Support | Ease of Use | |-----------|----------|----------|-------------|-------------| | **vLLM** | High-throughput production | OpenAI-compatible REST | ✅ Multi-GPU | ⭐⭐⭐⭐ | | **TGI (HuggingFace)** | HuggingFace models | REST + gRPC | ✅ Multi-GPU | ⭐⭐⭐⭐ | | **Ollama** | Local / dev use | REST | ✅ CPU + GPU | ⭐⭐⭐⭐⭐ | | **BentoML** | Custom ML pipelines | REST + gRPC | ✅ Multi-GPU | ⭐⭐⭐⭐ | | **Ray Serve** | Distributed / scaled | REST | ✅ Multi-GPU | ⭐⭐⭐ | | **Triton** | NVIDIA GPU clusters | gRPC + HTTP | ✅ NVIDIA only | ⭐⭐ | | **TorchServe** | PyTorch models | REST | ✅ GPU | ⭐⭐⭐ | | **LiteLLM** | Proxy / multi-model gateway | OpenAI-compatible | ✅ Any backend | ⭐⭐⭐⭐⭐ | | **FastAPI (custom)** | Full control | Any | ✅ GPU | ⭐⭐⭐⭐ | | **Seldon Core** | Kubernetes / enterprise | REST + gRPC | ✅ Multi-GPU | ⭐⭐ | --- ## 1. vLLM — Best for High-Throughput Production The most popular serving framework for LLMs. Uses **PagedAttention** to efficiently manage KV cache memory, enabling continuous batching and very high throughput. ```bash # Install pip install vllm # Serve your fine-tuned model (OpenAI-compatible API) python -m vllm.entrypoints.openai.api_server --model ./merged-fine-tuned-model --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 # use 2 GPUs ``` ```python # Access via OpenAI SDK (drop-in compatible) from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="none") response = client.chat.completions.create( model="./merged-fine-tuned-model", messages=[{"role": "user", "content": "Explain LoRA fine-tuning"}] ) print(response.choices[0].message.content) ``` **Supports:** LoRA adapters directly (no merging needed) ```bash python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.2-3B --enable-lora --lora-modules my-adapter=./lora-adapters ``` --- ## 2. TGI (Text Generation Inference) — HuggingFace's Official Server HuggingFace's production-grade serving framework. Best when your fine-tuned model is on the HuggingFace Hub. ```bash # Run with Docker (easiest) docker run --gpus all --shm-size 1g -p 8080:80 -v ./fine-tuned-model:/model ghcr.io/huggingface/text-generation-inference:latest --model-id /model --num-shard 1 ``` ```python # Access via Python client from huggingface_hub import InferenceClient client = InferenceClient(base_url="http://localhost:8080") response = client.text_generation( "What is the capital of France?", max_new_tokens=100, temperature=0.7 ) print(response) ``` **Key features:** Flash Attention 2, continuous batching, quantization (GPTQ, AWQ, bitsandbytes), streaming. --- ## 3. Ollama — Easiest for Local & Dev Use The simplest way to run fine-tuned models locally. Uses GGUF format (quantized). Great for development, testing, and local apps. ```bash # Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Create a Modelfile for your fine-tuned model (GGUF format) cat > Modelfile < str: result = self.pipe(prompt, max_new_tokens=200, temperature=0.7) return result[0]["generated_text"] ``` ```bash # Serve locally bentoml serve service:FineTunedLLMService # Build and containerize bentoml build bentoml containerize my_llm_service:latest ``` --- ## 5. Ray Serve — Distributed Scaling Best for horizontally scaling across multiple nodes or combining LLM serving with other ML models. ```python from ray import serve from transformers import pipeline @serve.deployment( ray_actor_options={"num_gpus": 1}, autoscaling_config={"min_replicas": 1, "max_replicas": 4} ) class FineTunedModel: def __init__(self): self.pipe = pipeline("text-generation", model="./fine-tuned-model", device=0) async def __call__(self, request): data = await request.json() result = self.pipe(data["prompt"], max_new_tokens=200) return {"response": result[0]["generated_text"]} app = FineTunedModel.bind() # Run # serve run service:app ``` --- ## 6. LiteLLM — Unified Gateway / Proxy If you want to switch between multiple fine-tuned models or providers without changing your application code. Acts as a unified OpenAI-compatible proxy. ```python # proxy_config.yaml model_list: - model_name: my-fine-tuned-llama litellm_params: model: openai/fine-tuned-llama api_base: http://localhost:8000/v1 api_key: none - model_name: my-fine-tuned-mistral litellm_params: model: openai/fine-tuned-mistral api_base: http://localhost:8001/v1 api_key: none ``` ```bash # Start proxy litellm --config proxy_config.yaml --port 4000 ``` ```python # Your app always uses the same interface regardless of backend from openai import OpenAI client = OpenAI(base_url="http://localhost:4000", api_key="any") # Swap models without touching app code response = client.chat.completions.create( model="my-fine-tuned-llama", messages=[{"role": "user", "content": "Hello"}] ) ``` --- ## 7. FastAPI — Custom Serving (Full Control) When you need complete control over the API design, authentication, logging, or custom logic. ```python from fastapi import FastAPI from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline from pydantic import BaseModel import torch app = FastAPI(title="Fine-Tuned LLM API") # Load model once at startup tokenizer = AutoTokenizer.from_pretrained("./fine-tuned-model") model = AutoModelForCausalLM.from_pretrained( "./fine-tuned-model", torch_dtype=torch.float16, device_map="auto" ) pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) class ChatRequest(BaseModel): prompt: str max_tokens: int = 200 temperature: float = 0.7 @app.post("/v1/chat") async def chat(request: ChatRequest): result = pipe( request.prompt, max_new_tokens=request.max_tokens, temperature=request.temperature ) return {"response": result[0]["generated_text"]} # Run: uvicorn service:app --host 0.0.0.0 --port 8000 ``` --- ## How to Choose the Right Framework ``` Are you in development / testing locally? → Ollama (simplest, runs GGUF on CPU/GPU) Do you need OpenAI-compatible API in production? → vLLM (best throughput, PagedAttention) Is your model on HuggingFace Hub? → TGI (native HuggingFace support) Do you need custom preprocessing or multi-model pipelines? → BentoML Do you need to scale across multiple nodes? → Ray Serve Do you need a single gateway for multiple models/providers? → LiteLLM Do you need full control over API logic? → FastAPI (custom) Are you on Kubernetes in enterprise? → Seldon Core or Triton ``` --- ## Serving a LoRA Adapter (Without Merging) You don't always need to merge LoRA weights back into the base model. Both vLLM and TGI can load adapters on-the-fly: ```python # vLLM: load multiple LoRA adapters dynamically # Start server with --enable-lora flag, then switch adapters per request from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="none") # Use specific LoRA adapter by name response = client.chat.completions.create( model="customer-support-adapter", # name given to --lora-modules messages=[{"role": "user", "content": "What is your return policy?"}] ) ``` > **Recommended stack for most teams:** Use **Ollama** locally during development → deploy with **vLLM** in production → wrap with **LiteLLM** if you need multi-model routing.

What are all the model serving frameworks that a fine tuned model can be added and accessed across?

Answer

Model Serving Frameworks for Fine-Tuned LLMs

Framework Comparison at a Glance

1. vLLM — Best for High-Throughput Production

2. TGI (Text Generation Inference) — HuggingFace's Official Server

3. Ollama — Easiest for Local & Dev Use

4. BentoML — Custom ML Pipelines

5. Ray Serve — Distributed Scaling

6. LiteLLM — Unified Gateway / Proxy

7. FastAPI — Custom Serving (Full Control)

How to Choose the Right Framework

Serving a LoRA Adapter (Without Merging)

Related Concepts

How would you monitor a deployed LLM application?

What's your strategy for handling model updates in production?

How would you reduce inference latency for an LLM application?

How would you estimate costs for a large-scale LLM application?

What's your testing strategy for Gen AI applications?

Framework	Best For	Protocol	GPU Support	Ease of Use
vLLM	High-throughput production	OpenAI-compatible REST	✅ Multi-GPU	⭐⭐⭐⭐
TGI (HuggingFace)	HuggingFace models	REST + gRPC	✅ Multi-GPU	⭐⭐⭐⭐
Ollama	Local / dev use	REST	✅ CPU + GPU	⭐⭐⭐⭐⭐
BentoML	Custom ML pipelines	REST + gRPC	✅ Multi-GPU	⭐⭐⭐⭐
Ray Serve	Distributed / scaled	REST	✅ Multi-GPU	⭐⭐⭐
Triton	NVIDIA GPU clusters	gRPC + HTTP	✅ NVIDIA only	⭐⭐
TorchServe	PyTorch models	REST	✅ GPU	⭐⭐⭐
LiteLLM	Proxy / multi-model gateway	OpenAI-compatible	✅ Any backend	⭐⭐⭐⭐⭐
FastAPI (custom)	Full control	Any	✅ GPU	⭐⭐⭐⭐
Seldon Core	Kubernetes / enterprise	REST + gRPC	✅ Multi-GPU	⭐⭐