EXP · 02 — Experiment

Local LLMs vs API Calls

Hypothesis

Local LLMs could replace API calls for our use case, reducing costs and improving latency. We wanted to test whether open-source models could match GPT-4 quality for our specific workflows.

The goal was to run everything locally on our M3 Mac Studio (192GB RAM) instead of paying $0.03/1K tokens to OpenAI. If we could cut costs by 80% without sacrificing quality, we'd switch permanently.

What We Tested

We tried 6 different models (Llama 3.1, Mistral 7B, Qwen 2.5, Gemma 2, DeepSeek, CodeLlama) across 3 use cases: code generation, data extraction, and summarization.

For each model, we ran 100 test cases per use case (300 total) and measured accuracy against human-validated ground truth, latency (tokens/sec), and total cost (hardware + electricity vs API).

What We Learned

Local models matched API quality for 2/3 use cases (code generation + summarization) but fell short on complex data extraction (78% accuracy vs 94% for GPT-4).

Switching to local for code generation and summarization reduced our monthly costs from $1,200 to $240 (80% savings) while maintaining 90%+ quality. Data extraction stayed on GPT-4.

Latency was actually better locally (120 tokens/sec vs 60 tokens/sec) because we eliminated network round-trips and API rate limits.

Next Steps

We're now running Qwen 2.5 32B for code generation and Llama 3.1 70B for summarization, both locally. Data extraction remains on GPT-4 via API.

Next experiment: fine-tuning Mistral on our historical data extraction tasks to see if we can close the accuracy gap and eliminate the API dependency entirely.

Related