Tiny Lab
Fast AI deployments that ship to production and measure real impact. Documenting what works, what fails, and what we learn along the way.
Customer Support Triage Agent
We built an AI agent to automate customer support ticket triage, shipped it live, and measured 40% reduction in response time. The agent categorizes by urgency, routes to the right team, and suggests responses.
Why Multi-Agent Coordination Failed
Case study: our multi-agent system looked great in demos but collapsed in production. The problem wasn't the agents—it was the coordination layer. Agents were waiting on each other, creating cascading timeouts. Fixed by making agents fully async and adding circuit breakers.
Sales Outreach Automation
Deployed an AI system that personalizes cold emails based on prospect data scraped from LinkedIn and company websites. Measured 25% lift in reply rate and 15% increase in meeting bookings within first week.
Local LLMs vs API Calls
We spent 3 months testing whether local LLMs could replace API calls for our use case. Tried 6 different models (Llama, Mistral, Qwen, Gemma) across code generation, data extraction, and summarization. Local models matched API quality for 2/3 use cases while cutting costs by 80%.
3-Step AI Validation Pattern
After running 5 different AI deployments, we noticed they all required a similar 3-step validation pattern: (1) Schema validation before model call, (2) Output format check after model call, (3) Business logic validation before using results. This catches 95% of model errors.
Model Selection Framework
Framework we use to pick which model for which task: (1) Does it need reasoning? Use o1/o3. (2) Is latency critical? Use Haiku. (3) Is accuracy critical? Use Opus. (4) Is cost critical? Use local. Start with this decision tree before optimizing.
Token Limits Break Everything
Quick observation: models consistently fail when given inputs with special characters. Turns out we were hitting token limits we didn't know existed. Emoji and Unicode characters tokenize into way more tokens than expected. Always use tiktoken to count tokens before sending.
Always Validate JSON Schema Early
Model output parsing was our biggest source of errors until we added strict JSON schema validation. Now we validate schema before even calling the model (catches bad prompts) and after (catches bad outputs). Cut production errors by 70%.
Data Pipeline Builder
Shipped a system that auto-generates ETL pipelines from natural language descriptions. Reduced data pipeline setup time from 3 days to 45 minutes for our analytics team.
Can AI Replace Manual Data Entry?
Tested whether vision models (GPT-4V, Claude Vision, Gemini Pro Vision) could extract structured data from scanned invoices and receipts. Ran 500 test cases across 3 document types. Vision models matched human accuracy (98%) but were 10x slower and 5x more expensive than expected.