Concept #136Mediumextended-ai-concepts

What is hardware (H/W) deployment in AI?

#gen-ai#mlops

Answer

Hardware Deployment in AI

Hardware deployment in AI refers to the strategy of selecting, configuring, and optimizing the physical computing infrastructure used to run AI models in production — from cloud GPUs to edge devices.

Deployment Hardware Options

HardwareUse CaseProvider
Cloud GPUsTraining + large inferenceAWS, GCP, Azure, Lambda Labs
On-premise GPUsPrivacy, cost at scaleNVIDIA DGX, consumer RTX
Edge devicesLow latency, offline inferenceRaspberry Pi, Jetson, Apple M-series
Specialized AI chipsHigh efficiency inferenceGoogle TPU, Groq LPU, Cerebras
NPUsMobile/laptop AIApple Neural Engine, Qualcomm Hexagon
CPUsSmall models, preprocessingAny server

Cloud GPU Options

python
# AWS SageMaker deployment example
import boto3
import json

sm_client = boto3.client("sagemaker")

# Deploy a model to an endpoint
response = sm_client.create_endpoint(
    EndpointName="llama-inference",
    EndpointConfigName="llama-config"
)

# Invoke the endpoint
runtime = boto3.client("sagemaker-runtime")
result = runtime.invoke_endpoint(
    EndpointName="llama-inference",
    ContentType="application/json",
    Body=json.dumps({"inputs": "What is machine learning?"})
)

On-Premise Deployment (NVIDIA GPU)

bash
# Install NVIDIA drivers and CUDA
nvidia-smi  # Verify GPU

# Run inference server (vLLM)
pip install vllm

python -m vllm.entrypoints.openai.api_server     --model meta-llama/Meta-Llama-3-8B-Instruct     --dtype bfloat16     --gpu-memory-utilization 0.90     --port 8000
python
# Use vLLM via OpenAI-compatible API
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

Edge Deployment (NVIDIA Jetson)

python
# Optimized inference on Jetson (TensorRT)
import tensorrt as trt
import pycuda.driver as cuda

def build_trt_engine(onnx_path: str) -> trt.ICudaEngine:
    logger = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(logger)
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, logger)

    with open(onnx_path, "rb") as f:
        parser.parse(f.read())

    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)
    config.set_flag(trt.BuilderFlag.FP16)  # Use FP16 on Jetson

    return builder.build_serialized_network(network, config)

Selecting Hardware Based on Model Size

Model SizeMinimum HardwareRecommended
< 3B params4 GB VRAMRTX 3060 / M2 Mac
7-8B params8 GB VRAMRTX 3080 / A100 40GB
13B params16 GB VRAMRTX 4090 / A100
70B params80 GB VRAMA100 80GB × 2
405B+ params8× A100 clusterH100 × 8

Deployment Patterns

PatternLatencyThroughputCost
Dedicated GPULowHighHigh fixed
Shared GPU (serverless)VariableVariablePay per use
CPU inferenceHighLowLow
Edge (NPU)Very lowLowDevice cost
Quantized (INT4)LowHighLess VRAM

Key Tools for Hardware Deployment

ToolPurpose
vLLMHigh-throughput LLM serving
OllamaLocal model serving
TensorRTNVIDIA GPU optimization
ONNX RuntimeCross-platform inference
llama.cppCPU/GPU inference (GGUF)
Triton Inference ServerEnterprise model serving
BentoMLModel packaging + serving