EXP · 02 — Experiment

Local LLMs vs API Calls

Feb 15, 2026 • 3:22pm

Status	Duration	Type	Tools
Completed	3 Months	Experiment	Ollama, Python

Hypothesis

Local LLMs could replace API calls for our use case, reducing costs and improving latency. We wanted to test whether open-source models could match GPT-4 quality for our specific workflows.

The goal was to run everything locally on our M3 Mac Studio (192GB RAM) instead of paying $0.03/1K tokens to OpenAI. If we could cut costs by 80% without sacrificing quality, we'd switch permanently.

What We Tested

We tried 6 different models (Llama 3.1, Mistral 7B, Qwen 2.5, Gemma 2, DeepSeek, CodeLlama) across 3 use cases: code generation, data extraction, and summarization.

For each model, we ran 100 test cases per use case (300 total) and measured accuracy against human-validated ground truth, latency (tokens/sec), and total cost (hardware + electricity vs API).

What We Learned

Local models matched API quality for 2/3 use cases (code generation + summarization) but fell short on complex data extraction (78% accuracy vs 94% for GPT-4).

Switching to local for code generation and summarization reduced our monthly costs from $1,200 to $240 (80% savings) while maintaining 90%+ quality. Data extraction stayed on GPT-4.

Latency was actually better locally (120 tokens/sec vs 60 tokens/sec) because we eliminated network round-trips and API rate limits.

Next Steps

We're now running Qwen 2.5 32B for code generation and Llama 3.1 70B for summarization, both locally. Data extraction remains on GPT-4 via API.

Next experiment: fine-tuning Mistral on our historical data extraction tasks to see if we can close the accuracy gap and eliminate the API dependency entirely.

Local LLMs vs API Calls

Hypothesis

What We Tested

What We Learned

Next Steps

Related