🤖

Local LLM

Active

Self-hosted AI inference with Ollama running on 3x NVIDIA Tesla V100 GPUs (96GB VRAM). Serving an 80B parameter model at 42 tokens/sec for development, personal assistant, and content generation — zero cloud dependency.

Ollama AI Self-hosted NVIDIA V100

Configuration

GPUs

3x Tesla V100-PCIE-32GB

Total VRAM

96GB

Speed

~42 tok/s (80B model)

Access

LAN only — not exposed

Case Study: GPU-Accelerated Local AI Infrastructure

The Challenge

Run production AI inference for a personal assistant (Nova), 4 parallel AI coding workers, blog generation, and voice transcription — without recurring cloud API costs eating into a bootstrapped business budget.

The Solution

✓ Deployed 3x NVIDIA Tesla V100 GPUs with Ollama serving 80B MoE model across all 3
✓ Custom model import pipeline (sharded GGUF download, merge, Ollama import) for models too large for standard pull
✓ GPU-accelerated Whisper (large-v3-turbo) for voice transcription on dedicated GPU
✓ Integrated with Claude Code workers — Ollama handles ~90% of queries for free
✓ Nova personal assistant routes queries through 3 tiers: regex (free) → Ollama (free) → Claude API (paid)
✓ Weekly blog posts auto-generated via llama3.3:70b pipeline

The Results

$200-400

Monthly API savings

42 tok/s

Inference speed (80B)

100%

Data stays local

Parallel AI sessions

Available Models

Model	Parameters	Purpose	VRAM
qwen3-coder-next	80B MoE	Primary model — coding, reasoning, tool calling, Nova assistant	~66GB (Q6_K across 3 GPUs)
llama3.3:70b	70B	Long-form content generation, weekly blog posts	~57GB (Q6_K)

Benefits

Cost Reduction

Free inference for simple tasks that would otherwise use paid API calls. Saves money on summarization, status checks, and simple questions.

Privacy

Sensitive code and data never leaves the local network. No cloud provider sees your prompts or responses.

Speed

Local inference with no network latency. Responses start immediately without waiting for API round-trips.

Availability

Works offline and during API outages. Not dependent on external service availability.

Integrations

Claude Code Workers	Workers use Ollama for simple tasks before falling back to paid APIs (~90% handled locally)
Nova Assistant	Personal AI assistant uses Ollama as primary model with Claude API fallback
Ollama MCP Server	Model Context Protocol server for standardized LLM access
Blog Generation	Weekly blog posts auto-generated via llama3.3:70b pipeline
Whisper Transcription	GPU-accelerated speech-to-text for voice input via Nova

Resources

Ollama Qwen3