Concept #153Mediumproduction-mlops

What are all the model serving frameworks that a fine tuned model can be added and accessed across?

#model-serving#vllm#tgi#ollama#bentoml#production#mlops#deployment#fine-tuning

Answer

Model Serving Frameworks for Fine-Tuned LLMs

After fine-tuning an LLM, you need a serving framework to load the model and expose it as an API so applications can access it. Here are all the major frameworks, how they work, and when to use each.


Framework Comparison at a Glance

FrameworkBest ForProtocolGPU SupportEase of Use
vLLMHigh-throughput productionOpenAI-compatible REST✅ Multi-GPU⭐⭐⭐⭐
TGI (HuggingFace)HuggingFace modelsREST + gRPC✅ Multi-GPU⭐⭐⭐⭐
OllamaLocal / dev useREST✅ CPU + GPU⭐⭐⭐⭐⭐
BentoMLCustom ML pipelinesREST + gRPC✅ Multi-GPU⭐⭐⭐⭐
Ray ServeDistributed / scaledREST✅ Multi-GPU⭐⭐⭐
TritonNVIDIA GPU clustersgRPC + HTTP✅ NVIDIA only⭐⭐
TorchServePyTorch modelsREST✅ GPU⭐⭐⭐
LiteLLMProxy / multi-model gatewayOpenAI-compatible✅ Any backend⭐⭐⭐⭐⭐
FastAPI (custom)Full controlAny✅ GPU⭐⭐⭐⭐
Seldon CoreKubernetes / enterpriseREST + gRPC✅ Multi-GPU⭐⭐

1. vLLM — Best for High-Throughput Production

The most popular serving framework for LLMs. Uses PagedAttention to efficiently manage KV cache memory, enabling continuous batching and very high throughput.

bash
# Install
pip install vllm

# Serve your fine-tuned model (OpenAI-compatible API)
python -m vllm.entrypoints.openai.api_server     --model ./merged-fine-tuned-model     --host 0.0.0.0     --port 8000     --tensor-parallel-size 2    # use 2 GPUs
python
# Access via OpenAI SDK (drop-in compatible)
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="./merged-fine-tuned-model",
    messages=[{"role": "user", "content": "Explain LoRA fine-tuning"}]
)
print(response.choices[0].message.content)

Supports: LoRA adapters directly (no merging needed)

bash
python -m vllm.entrypoints.openai.api_server     --model meta-llama/Llama-3.2-3B     --enable-lora     --lora-modules my-adapter=./lora-adapters

2. TGI (Text Generation Inference) — HuggingFace's Official Server

HuggingFace's production-grade serving framework. Best when your fine-tuned model is on the HuggingFace Hub.

bash
# Run with Docker (easiest)
docker run --gpus all --shm-size 1g     -p 8080:80     -v ./fine-tuned-model:/model     ghcr.io/huggingface/text-generation-inference:latest     --model-id /model     --num-shard 1
python
# Access via Python client
from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://localhost:8080")

response = client.text_generation(
    "What is the capital of France?",
    max_new_tokens=100,
    temperature=0.7
)
print(response)

Key features: Flash Attention 2, continuous batching, quantization (GPTQ, AWQ, bitsandbytes), streaming.


3. Ollama — Easiest for Local & Dev Use

The simplest way to run fine-tuned models locally. Uses GGUF format (quantized). Great for development, testing, and local apps.

bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Create a Modelfile for your fine-tuned model (GGUF format)
cat > Modelfile <<EOF
FROM ./fine-tuned-model-Q4_K_M.gguf
SYSTEM "You are a specialized Gen AI learning assistant."
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF

# Create and run the model
ollama create my-fine-tuned-model -f Modelfile
ollama run my-fine-tuned-model
python
# Access via Python
import requests

response = requests.post("http://localhost:11434/api/chat", json={
    "model": "my-fine-tuned-model",
    "messages": [{"role": "user", "content": "What is RAG?"}],
    "stream": False
})
print(response.json()["message"]["content"])

# Or use OpenAI-compatible endpoint
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="my-fine-tuned-model",
    messages=[{"role": "user", "content": "What is RAG?"}]
)

4. BentoML — Custom ML Pipelines

Best when you need custom preprocessing, post-processing, or chaining multiple models together alongside your fine-tuned LLM.

python
# service.py
import bentoml
from transformers import pipeline

@bentoml.service(
    resources={"gpu": 1},
    traffic={"timeout": 60}
)
class FineTunedLLMService:

    def __init__(self):
        self.pipe = pipeline(
            "text-generation",
            model="./fine-tuned-model",
            device=0
        )

    @bentoml.api
    def generate(self, prompt: str) -> str:
        result = self.pipe(prompt, max_new_tokens=200, temperature=0.7)
        return result[0]["generated_text"]
bash
# Serve locally
bentoml serve service:FineTunedLLMService

# Build and containerize
bentoml build
bentoml containerize my_llm_service:latest

5. Ray Serve — Distributed Scaling

Best for horizontally scaling across multiple nodes or combining LLM serving with other ML models.

python
from ray import serve
from transformers import pipeline

@serve.deployment(
    ray_actor_options={"num_gpus": 1},
    autoscaling_config={"min_replicas": 1, "max_replicas": 4}
)
class FineTunedModel:
    def __init__(self):
        self.pipe = pipeline("text-generation", model="./fine-tuned-model", device=0)

    async def __call__(self, request):
        data = await request.json()
        result = self.pipe(data["prompt"], max_new_tokens=200)
        return {"response": result[0]["generated_text"]}

app = FineTunedModel.bind()

# Run
# serve run service:app

6. LiteLLM — Unified Gateway / Proxy

If you want to switch between multiple fine-tuned models or providers without changing your application code. Acts as a unified OpenAI-compatible proxy.

python
# proxy_config.yaml
model_list:
  - model_name: my-fine-tuned-llama
    litellm_params:
      model: openai/fine-tuned-llama
      api_base: http://localhost:8000/v1
      api_key: none

  - model_name: my-fine-tuned-mistral
    litellm_params:
      model: openai/fine-tuned-mistral
      api_base: http://localhost:8001/v1
      api_key: none
bash
# Start proxy
litellm --config proxy_config.yaml --port 4000
python
# Your app always uses the same interface regardless of backend
from openai import OpenAI

client = OpenAI(base_url="http://localhost:4000", api_key="any")

# Swap models without touching app code
response = client.chat.completions.create(
    model="my-fine-tuned-llama",
    messages=[{"role": "user", "content": "Hello"}]
)

7. FastAPI — Custom Serving (Full Control)

When you need complete control over the API design, authentication, logging, or custom logic.

python
from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from pydantic import BaseModel
import torch

app = FastAPI(title="Fine-Tuned LLM API")

# Load model once at startup
tokenizer = AutoTokenizer.from_pretrained("./fine-tuned-model")
model = AutoModelForCausalLM.from_pretrained(
    "./fine-tuned-model",
    torch_dtype=torch.float16,
    device_map="auto"
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

class ChatRequest(BaseModel):
    prompt: str
    max_tokens: int = 200
    temperature: float = 0.7

@app.post("/v1/chat")
async def chat(request: ChatRequest):
    result = pipe(
        request.prompt,
        max_new_tokens=request.max_tokens,
        temperature=request.temperature
    )
    return {"response": result[0]["generated_text"]}

# Run: uvicorn service:app --host 0.0.0.0 --port 8000

How to Choose the Right Framework

text
Are you in development / testing locally?
    → Ollama (simplest, runs GGUF on CPU/GPU)

Do you need OpenAI-compatible API in production?
    → vLLM (best throughput, PagedAttention)

Is your model on HuggingFace Hub?
    → TGI (native HuggingFace support)

Do you need custom preprocessing or multi-model pipelines?
    → BentoML

Do you need to scale across multiple nodes?
    → Ray Serve

Do you need a single gateway for multiple models/providers?
    → LiteLLM

Do you need full control over API logic?
    → FastAPI (custom)

Are you on Kubernetes in enterprise?
    → Seldon Core or Triton

Serving a LoRA Adapter (Without Merging)

You don't always need to merge LoRA weights back into the base model. Both vLLM and TGI can load adapters on-the-fly:

python
# vLLM: load multiple LoRA adapters dynamically
# Start server with --enable-lora flag, then switch adapters per request

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# Use specific LoRA adapter by name
response = client.chat.completions.create(
    model="customer-support-adapter",    # name given to --lora-modules
    messages=[{"role": "user", "content": "What is your return policy?"}]
)

Recommended stack for most teams: Use Ollama locally during development → deploy with vLLM in production → wrap with LiteLLM if you need multi-model routing.