Hardware Deployment in AI
Hardware deployment in AI refers to the strategy of selecting, configuring, and optimizing the physical computing infrastructure used to run AI models in production — from cloud GPUs to edge devices.
Deployment Hardware Options
| Hardware | Use Case | Provider |
|---|
| Cloud GPUs | Training + large inference | AWS, GCP, Azure, Lambda Labs |
| On-premise GPUs | Privacy, cost at scale | NVIDIA DGX, consumer RTX |
| Edge devices | Low latency, offline inference | Raspberry Pi, Jetson, Apple M-series |
| Specialized AI chips | High efficiency inference | Google TPU, Groq LPU, Cerebras |
| NPUs | Mobile/laptop AI | Apple Neural Engine, Qualcomm Hexagon |
| CPUs | Small models, preprocessing | Any server |
Cloud GPU Options
# AWS SageMaker deployment example
import boto3
import json
sm_client = boto3.client("sagemaker")
# Deploy a model to an endpoint
response = sm_client.create_endpoint(
EndpointName="llama-inference",
EndpointConfigName="llama-config"
)
# Invoke the endpoint
runtime = boto3.client("sagemaker-runtime")
result = runtime.invoke_endpoint(
EndpointName="llama-inference",
ContentType="application/json",
Body=json.dumps({"inputs": "What is machine learning?"})
)
On-Premise Deployment (NVIDIA GPU)
# Install NVIDIA drivers and CUDA
nvidia-smi # Verify GPU
# Run inference server (vLLM)
pip install vllm
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --dtype bfloat16 --gpu-memory-utilization 0.90 --port 8000
# Use vLLM via OpenAI-compatible API
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}]
)
Edge Deployment (NVIDIA Jetson)
# Optimized inference on Jetson (TensorRT)
import tensorrt as trt
import pycuda.driver as cuda
def build_trt_engine(onnx_path: str) -> trt.ICudaEngine:
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)
with open(onnx_path, "rb") as f:
parser.parse(f.read())
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)
config.set_flag(trt.BuilderFlag.FP16) # Use FP16 on Jetson
return builder.build_serialized_network(network, config)
Selecting Hardware Based on Model Size
| Model Size | Minimum Hardware | Recommended |
|---|
| < 3B params | 4 GB VRAM | RTX 3060 / M2 Mac |
| 7-8B params | 8 GB VRAM | RTX 3080 / A100 40GB |
| 13B params | 16 GB VRAM | RTX 4090 / A100 |
| 70B params | 80 GB VRAM | A100 80GB × 2 |
| 405B+ params | 8× A100 cluster | H100 × 8 |
Deployment Patterns
| Pattern | Latency | Throughput | Cost |
|---|
| Dedicated GPU | Low | High | High fixed |
| Shared GPU (serverless) | Variable | Variable | Pay per use |
| CPU inference | High | Low | Low |
| Edge (NPU) | Very low | Low | Device cost |
| Quantized (INT4) | Low | High | Less VRAM |
Key Tools for Hardware Deployment
| Tool | Purpose |
|---|
| vLLM | High-throughput LLM serving |
| Ollama | Local model serving |
| TensorRT | NVIDIA GPU optimization |
| ONNX Runtime | Cross-platform inference |
| llama.cpp | CPU/GPU inference (GGUF) |
| Triton Inference Server | Enterprise model serving |
| BentoML | Model packaging + serving |