← Back to Projects
🤖

Local LLM

Active

Self-hosted AI inference with Ollama running on 3x NVIDIA Tesla V100 GPUs (96GB VRAM). Serving an 80B parameter model at 42 tokens/sec for development, personal assistant, and content generation — zero cloud dependency.

Ollama AI Self-hosted NVIDIA V100

Configuration

GPUs

3x Tesla V100-PCIE-32GB

Total VRAM

96GB

Speed

~42 tok/s (80B model)

Access

LAN only — not exposed

Case Study: GPU-Accelerated Local AI Infrastructure

The Challenge

Run production AI inference for a personal assistant (Nova), 4 parallel AI coding workers, blog generation, and voice transcription — without recurring cloud API costs eating into a bootstrapped business budget.

The Solution

  • Deployed 3x NVIDIA Tesla V100 GPUs with Ollama serving 80B MoE model across all 3
  • Custom model import pipeline (sharded GGUF download, merge, Ollama import) for models too large for standard pull
  • GPU-accelerated Whisper (large-v3-turbo) for voice transcription on dedicated GPU
  • Integrated with Claude Code workers — Ollama handles ~90% of queries for free
  • Nova personal assistant routes queries through 3 tiers: regex (free) → Ollama (free) → Claude API (paid)
  • Weekly blog posts auto-generated via llama3.3:70b pipeline

The Results

$200-400
Monthly API savings
42 tok/s
Inference speed (80B)
100%
Data stays local
4
Parallel AI sessions

Available Models

Model Parameters Purpose VRAM
qwen3-coder-next 80B MoE Primary model — coding, reasoning, tool calling, Nova assistant ~66GB (Q6_K across 3 GPUs)
llama3.3:70b 70B Long-form content generation, weekly blog posts ~57GB (Q6_K)

Benefits

Cost Reduction

Free inference for simple tasks that would otherwise use paid API calls. Saves money on summarization, status checks, and simple questions.

Privacy

Sensitive code and data never leaves the local network. No cloud provider sees your prompts or responses.

Speed

Local inference with no network latency. Responses start immediately without waiting for API round-trips.

Availability

Works offline and during API outages. Not dependent on external service availability.

Integrations

Claude Code Workers Workers use Ollama for simple tasks before falling back to paid APIs (~90% handled locally)
Nova Assistant Personal AI assistant uses Ollama as primary model with Claude API fallback
Ollama MCP Server Model Context Protocol server for standardized LLM access
Blog Generation Weekly blog posts auto-generated via llama3.3:70b pipeline
Whisper Transcription GPU-accelerated speech-to-text for voice input via Nova

Resources