Conceptual landing page. A lot needs building. Numbers and feature claims are aspirational, PRFAQ-style.
Run LLM experiments in days, not weeks

The LLM training codebase
your agents wish you had.

Most LLM projects are duct tape — scattered scripts, no tests, configs agents can't follow. Fleaberry is typed, tested, and built for agents: fine-tuning recipes, GPU provisioning across clouds, model registry and storage, annotation UIs, vibe-check playgrounds, and deployment patterns. Go from idea to deployed inference in one codebase.

30+
Optimized recipes
<5m
CI time
0%
Bash scripts
$0
Platform Fees
Terminal
$ git clone <repo> && cd fleaberry
$ claude
You: Find the fastest small model for on-device inference. <2B params, best MMLU in that range, latency benchmarks on iPhone 15 Pro. Compare 5+ candidates.
query
├─ model search scanning HuggingFace...
├─ MMLU benchmarks aggregating scores...
├─ CoreML export converting Qwen3-0.6B...
├─ CoreML export converting SmolLM2-360M...
└─ latency tests waiting for exports...

The pace of your project determines its outcome

ML projects are time-bound. Sequential execution means fewer experiments, fewer insights, fewer course corrections. Parallel execution means you learn faster and adapt before it's too late.

// human_only 3 weeks
Active
Sleep
Meeting
Blocked
$4,200
Project stalls while you take meetings, fiddle with infra, sleep, wait for each other...
// with_agents 5 days
Your time
Sleep
Agent
GPU
$380
Agents work while you sleep. Training runs overnight. You review results over coffee.

See it in action: a real project

Watch how an ML engineer and AI agents collaborate. The agent handles performance tuning, cloud selection, and experiment infrastructure — you focus on the ML.

fleaberry — tool-calling-finetuning

// the_scenario

You're building a SaaS for sales pipeline management. Users ask analytical questions in natural language: "Show me deals closing this quarter over $50k" or "Which reps have the lowest conversion rate?" Your system needs to convert these to the right tool calls with correct parameters. You're using GPT-4.1 but it's only ~70% accurate on your domain-specific tools — users complain about failed queries. It's also costing $8k/month. Time to train a specialized model.

Y
You Day 1, 9:00 AM
GPT-4.1 is only hitting 70% accuracy on our tool-calling task and it's costing $8k/month. Before we fine-tune, I want to create a benchmark with ~300 examples so we can figure out how well different models do at this. Can you set up an annotation interface? Use Qwen-4B to generate candidates — easier for licensing.
A
Agent Day 1, 9:15 AM
Done. The interface is live. You enter queries, I'll generate candidate tool calls using Qwen-4B, and you correct them. Keyboard shortcuts: j/k to navigate, e to edit, Enter to approve. Ready when you are.
Y
Your turn Day 1–3
Label benchmark examples
Examples labeled 300/300
A
Agent Day 3, 4:00 PM
Labeling complete — all 300 examples done. Averaged about 30 seconds per sample, total time about 2.5 hours. Great — now we have a ground truth dataset. Let me benchmark existing models to establish baselines. I'll run your test set against Claude Sonnet 4.5, Gemini Flash, and Qwen3 8B. This will help us understand what accuracy is achievable.
A
Agent Day 3, 4:30 PM
Benchmark complete. Here's what I found:
Model Benchmark Results
Model
Tool
+Params
Claude Sonnet 4.5
94%
91%
Gemini Flash
82%
76%
Qwen3 8B
68%
51%
Qwen3 8B is our best small OSS model candidate at 51% accuracy with params. We need to close the gap to Claude Sonnet 4.5's 91%. Want me to try prompt optimization first before we commit to fine-tuning?
Y
You Day 3, 4:35 PM
Yeah, run DSPy optimization first. In my experience you can sometimes close a 20-point gap with the right prompt structure. If that doesn't work we'll fine-tune.
A
Agent Day 3, 6:00 PM
Ran prompt optimization experiments on Qwen3 8B:
Prompt Optimization Results
Zero-shot baseline 51%
Few-shot (5 examples) 54%
Chain-of-thought 56%
Structured output + CoT 58%
Best result: 58% (+7%). Not enough to close the gap to 94%. We'll need to fine-tune. I recommend GRPO — it works well for tool-calling tasks. Want me to set up the verifier for reward modeling?
Y
You Day 3, 6:05 PM
Good call on GRPO over DPO — parsing correctness isn't an ideal verifier, but we have a lot of input data and not many labels, so it's perfect for GRPO. We think it'll generalize better than training on just the labeled set. Make sure you use a KL penalty though, I've seen RL training collapse into degenerate outputs without it. What do you need from me?
A
Agent Day 4, 10:00 AM
Verifier built. Using your gold labels as positives and the model's wrong outputs as negatives. KL penalty is in the config as you suggested. I need you to verify about 50 samples where the verifier disagrees with the label — this calibrates the reward signal.
Verify disagreements 0/50
Y
You Day 4, 11:00 AM
Done. Took about an hour. Most of the verifier decisions were correct.
A
Agent Day 4, 11:02 AM
Verifier calibrated. 87% agreement with human judgment — good enough for GRPO. Finding the most cost-effective GPU for training.
Y
You Day 4, 11:05 AM
For the hyperparam search — start with lr around 1e-6 to 5e-6. GRPO is sensitive to learning rate and 8B models can destabilize fast. Also use a warmup scheduler, maybe 100 steps.
A
Agent Day 4, 2:00 PM
Ran identical training jobs across providers. For an 8B model, here's the cost/performance tradeoff:
H100
$3.99/hr
~$48 total
L40S
$0.99/hr
~$12 total
Best value ✓
A10
$0.44/hr
~$27 total
Recommending L40S at $0.99/hr — best throughput per dollar for this model size. Total estimated cost to train: $12. Starting hyperparameter search now.
A
Agent Day 5, 9:00 AM
Running hyperparameter search with your suggested lr range and 100-step warmup. First few runs already showing why — model collapsed at 5e-5.
Hyperparameter Search
Run 1: lr=5e-5, kl=0.01 collapsed
Run 2: lr=1e-5, kl=0.01 reward hack
Run 3: lr=2e-6, kl=0.03 stable, 78%
Run 4: lr=2e-6, kl=0.05 stable, 84%
Found stable config: lr=2e-6, kl_coef=0.05. Ready to run full training with 3 seeds. You can monitor from the iOS app.
Y
You Day 5, 9:30 AM
Let's do it. I'll watch from my phone.
A
Agent Day 5–7
Training in progress. Here's the live view:
Run #6 — seed 3 RUNNING
0.85
Reward
91%
Eval Acc
2.1k
Steps
A
Agent Day 7, 3:00 PM
Training complete. Best checkpoint from seed 3 achieves 91% accuracy on held-out test set — matching Claude Sonnet 4.5. Model pushed to Hugging Face Hub.
Final Results
51% → 91%
Accuracy
~$70
Training Cost
$0.0001
Per Query
900x
Cost Savings
From zero to production-ready in 2 weeks, with about 8 hours of your time (labeling + verification). Ready to deploy?
Y
You Day 7, 4:00 PM
Before we deploy — I'm wondering if we could make the model smaller. Can you try some quantization techniques and see what the accuracy tradeoff looks like?
A
Agent Day 7, 6:00 PM
Tested several quantization approaches on the fine-tuned model:
Quantization Results
FP16 baseline 91% / 16GB
INT8 (bitsandbytes) 90% / 8GB
INT4 (GPTQ) 88% / 4GB
INT4 (AWQ) 89% / 4GB
AWQ INT4 is the best tradeoff — only 2% accuracy drop while cutting memory by 75%. The errors it introduces are mostly on complex multi-parameter tool calls:
Error bucket: multi-param calls (3 examples)
× "deals closing Q1 over 50k by region" → missed region param
× "conversion by rep last 90 days" → wrong date range format
× "pipeline value by stage weighted" → dropped weighted flag
Full analysis in the report: quantization_analysis.html
Y
You Day 7, 6:15 PM
89% is acceptable. Let's go with AWQ. Deploy that version.
A
Agent Day 7, 6:30 PM
Deployed. Here's a preview link to the demo: https://demo.fleaberry.dev/tool-calling
Y
You Day 8, 10:00 AM
Deployed. Working great. I wonder if it could be faster though. Our Python serving code is fine but I'm wondering if a C++ inference server would help on our hardware. Can you try rewriting it and comparing?
A
Agent Day 8, 2:30 PM
Done. Rewrote the inference server in C++ using llama.cpp as the backend. Ran the same benchmark on your hardware:
Inference Benchmark (your hardware)
Server
tok/s
p99 lat
Python (vLLM)
847
142ms
C++ (llama.cpp)
1,016
118ms
+20% throughput and 17% lower p99 latency. The C++ server is in services/inference-cpp/. Both versions share the same model weights and API contract — you can A/B test or switch over completely.
A
Agent Day 8, 2:35 PM
Production REST endpoint deployed: https://api.fleaberry.dev/v1/tool-calling
To integrate into your application, send POST requests with your user queries:
Integration example
POST https://api.fleaberry.dev/v1/tool-calling
Content-Type: application/json

{
  "query": "Show me deals closing this quarter over $50k",
  "context": { "user_id": "..." }
}

→ Returns: { "tool": "get_deals", "params": {...} }
The endpoint handles authentication, rate limiting, and error handling. Full API docs: api.fleaberry.dev/docs
$

Everything included

Not a framework. Not a CLI. A complete production system you fork and own. Every checkbox here works on day one.

GPU

Spin up GPUs across RunPod regions
Deploy inference endpoints to Modal
Run 3,300 tok/s inference on your Mac
Auto-terminate pods to prevent runaway costs

Mobile

Monitor training runs from your phone
Watch training metrics update live
Start agent tasks from your phone
Log in with Face ID

ML

Use any training code — PyTorch TRL, VERL, SkyRL, PrimeRL, anything
Train fast with prebuilt containers
Stream training logs in real-time
Push checkpoints to Hugging Face Hub

AI Agents

Sleep while agents babysit training runs
Run text-to-speech demos
Orchestrate multi-step agent workflows
Query models via OpenAI-compatible API

Backend

Deploy to Cloudflare Workers
Store data in D1 SQLite
Manage state with Durable Objects
WebSocket real-time updates

Build

Build hermetically with Bazel
Cross-compile to Linux x86_64
Query everything via GraphQL
One command to deploy anywhere

Four platforms, one build

Train on cloud GPUs. Monitor from your phone. Deploy services globally. Run inference locally. All from the same codebase.

GPU Training // RunPod + Modal

Provision H100s, L40S, or A100s with one command. Automatic crash recovery, log streaming, and self-terminating pods to prevent runaway costs. Supports GRPO, SFT, DPO training paradigms.

RunPod GraphQL Modal endpoints W&B tracking Heartbeat monitoring
📱

iOS App // SwiftUI

Monitor training runs from anywhere. Real-time metrics charts, event timelines, log streaming, full-text search. One-command deployment to physical devices with passkey authentication.

Swift Charts Auto-refresh Passkey auth Bazel build

Edge Services // Cloudflare

Auth, training metadata, annotation UIs, AI agent orchestration — all deployed as Workers with D1 databases and Durable Objects. Global edge, no servers, scales to zero.

Workers D1 SQLite Durable Objects WebSockets

Local Inference // MLX

Run models locally on Apple Silicon at 3,300+ tokens/sec for Qwen3-0.6B, ~1,000 tok/s for 4B models. Continuous batching, JSONL streaming, OpenAI-compatible chat format.

M-series optimized Batch inference Streaming output Benchmarks

What you actually get

This isn't about the code — it's about what you can do with it.

Train LLMs today

Clone, configure, run. Real GRPO and SFT training on cloud GPUs in minutes, not weeks of setup.

Monitor anywhere

Check training progress from your phone. Get real-time metrics without opening a laptop.

Let agents build

AI coding agents can implement features autonomously via the ticket workflow system.

Own everything

No vendor lock-in. No platform fees. No surprise bills. It's your code forever.

Optimized recipes included

Not hello world tutorials. Production-tested experiments you can adapt.

// experiments/grpo

GRPO Training

Reinforcement learning with policy optimization. Use case: training models to follow tool-calling formats, code generation with test verification.

// experiments/intent_classification

Intent Classification

Fine-tuning for multi-class classification. Use case: customer support routing, command parsing, query categorization.

// demos/tts

Text-to-Speech

Production TTS inference on Modal GPUs. Use case: voice assistants, audiobook generation, accessibility features.

Why agents love it

Types everywhere

Pydantic in Python. Rust's type system. TypeScript for services. 80% of bugs caught before runtime. Agents make fewer mistakes when the compiler tells them what's wrong.

No magic

No config files to guess at. No environment variables that change behavior. No implicit state. From the experiment entry point, you can trace every parameter that affects the run.

Fast feedback

Bazel caches test results. Only rerun what changed. An agent can iterate quickly because it doesn't wait 10 minutes to find out it broke something three files away.

One CLI, everything included

Everything you need to manage experiments, deploy services, and debug issues.

flea train

Run experiments

Start training runs, stream logs, check GPU availability, inspect metrics. Built for terminal power users.

flea ios

Device deployment

One command to build, sign, and deploy the iOS app to physical devices. Handles certificates and provisioning.

flea deploy

Service deployment

Deploy Cloudflare Workers with D1 database management. Handles dependency ordering and post-deploy verification.

flea query

API queries

Direct GraphQL queries to agent orchestration. Debug workflows, inspect state, trigger actions.

flea agent

Agent control

Manage agent integration. Create tickets, check agent status, view execution logs.

flea new

New experiments

Generate boilerplate for new training experiments. Proper structure, types, and integration tests from the start.

Start training today

Get started
$ git clone <your-private-repo>
$ flea train --experiment grpo
Provisioning GPU... Training SmolLM2-360M with GRPO...
Get in touch to git clone your head start