Open Source ML Infrastructure

Train LLMs. Build iOS apps. Own everything.

Complete ML training infrastructure with GPU orchestration, real-time mobile monitoring, AI agent workflows, and local inference — all in one Bazel monorepo. Fork it once, never pay for a platform again.

4
Languages
847+
Build Targets
3,300
tok/s on M3 Ultra
$0
Platform Fees
Terminal
$ git clone https://github.com/tensor-valley/fleaberry && cd fleaberry
$ bazel run //apps/home/ios/tools/ios-cli -- deploy
Deploying to iPhone... Training monitor ready.
$ bazel run //experiments/grpo:conductor
Provisioning H100 on RunPod... Training started.

Everything included

Not a framework. Not a CLI. A complete production system you fork and own. Every checkbox here works on day one.

Spin up H100s, L40S, A100s on RunPod GPU
Deploy inference endpoints to Modal GPU
Run 3,300 tok/s inference on your Mac GPU
Monitor everything from your phone MOBILE
Watch metrics update live MOBILE
Log in with Face ID MOBILE
Start agent tasks from your phone AI
Connect Cursor to your infra AI
Let agents write code for you AI
Train with GRPO reinforcement learning ML
Fine-tune models with SFT ML
Optimize with DPO preferences ML
Track experiments in W&B ML
Sleep while agents babysit training runs AI
Deploy to Cloudflare Workers INFRA
Store data in D1 SQLite INFRA
Manage state with Durable Objects INFRA
Build hermetically with Bazel INFRA
Cross-compile to Linux x86_64 INFRA
Ship fast with two-layer containers INFRA
Label training data with built-in tools AI
Run text-to-speech demos AI
Query everything via GraphQL INFRA
WebSocket real-time updates INFRA
Idempotent retry logic INFRA

Four platforms, one build

Train on cloud GPUs. Monitor from your phone. Deploy services globally. Run inference locally. All from the same codebase.

GPU Training // RunPod + Modal

Provision H100s, L40S, or A100s with one command. Automatic crash recovery, log streaming, and self-terminating pods to prevent runaway costs. Supports GRPO, SFT, DPO training paradigms.

RunPod GraphQL Modal endpoints W&B tracking Heartbeat monitoring
📱

iOS App // SwiftUI

Monitor training runs from anywhere. Real-time metrics charts, event timelines, log streaming, full-text search. One-command deployment to physical devices with passkey authentication.

Swift Charts Auto-refresh Passkey auth Bazel build

Edge Services // Cloudflare

Auth, training metadata, annotation UIs, AI agent orchestration — all deployed as Workers with D1 databases and Durable Objects. Global edge, no servers, scales to zero.

Workers D1 SQLite Durable Objects WebSockets

Local Inference // MLX

Run models locally on Apple Silicon at 3,300+ tokens/sec for Qwen3-0.6B, ~1,000 tok/s for 4B models. Continuous batching, JSONL streaming, OpenAI-compatible chat format.

M-series optimized Batch inference Streaming output Benchmarks

What you actually get

This isn't about the code — it's about what you can do with it.

Train LLMs today

Clone, configure, run. Real GRPO and SFT training on cloud GPUs in minutes, not weeks of setup.

Monitor anywhere

Check training progress from your phone. Get real-time metrics without opening a laptop.

Let agents build

AI coding agents can implement features autonomously via the ticket workflow system.

Own everything

No vendor lock-in. No platform fees. No surprise bills. It's your code forever.

AI agents that ship code

Built-in Cursor API integration for autonomous code generation. Agents take tickets, plan implementation, and submit PRs.

Workflow
Ticket created Agent plans Human approves
Execution
Cursor API Durable Objects
State
SQLite per agent Structured logs
Output
PR submitted WebSocket updates
# Agent workflow: ticket to pull request

# 1. Create a ticket via GraphQL
mutation {
  createTicket(input: {
    title: "Add user settings page"
    description: "..."
  })
}

# 2. Agent plans, human approves
# 3. Agent executes autonomously
# 4. PR ready for review

# Real-time progress via WebSocket
# Full audit trail in SQLite

Working demos included

Not hello world tutorials. Real experiments from production use.

// experiments/grpo

GRPO Training

Complete reinforcement learning loop with policy optimization, instrumentation, and integration tests. CPU-compatible for local debugging.

// experiments/banking77

Intent Classification

Full fine-tuning reference for Banking77 dataset. Shows the complete experiment lifecycle from config to evaluation.

// demos/tts

Text-to-Speech

Production TTS inference via Modal GPUs. Cloudflare Workers frontend, Qwen3-TTS model, audio streaming.

A real ML project, start to finish

See how agents and humans work together to ship a production model. Most of your time goes to labeling and experiment design — not infrastructure.

// The scenario

You're building a SaaS for sales pipeline management. Users ask analytical questions in natural language: "Show me deals closing this quarter over $50k" or "Which reps have the lowest conversion rate?" Your system needs to convert these to the right tool calls with correct parameters. GPT-4 works but costs $0.03/query — at 100k queries/day, that's $90k/month. Time to train a smaller model.

1
Label data
2
Benchmark
3
Optimize
4
Reward model
5
Infra
6
Stabilize
7
Train
Agent autonomous
Human involved
Step 1

Build the annotation UI

Agent builds Human labels
An agent scaffolds an annotation interface. You enter queries, Claude generates candidate tool calls, you correct them. Keyboard-optimized: j/k to navigate, e to edit, Enter to approve. 400 examples labeled, all model outputs retained.
400
Examples
~6 hrs
Human time
Query
"Deals closing Q1 over $50k"
Tool call Claude
get_deals(q="Q1", min=50k)
j k
Step 2

Benchmark existing models

Agent runs
Agent runs your test set against GPT-4o, Claude 3.5, Llama 70B, Qwen 7B and more. Provisions endpoints, evaluates, produces report. Now you have baselines.
68%
Tool (Qwen 7B)
51%
+ Params
94%
GPT-4o
Tool
+Params
GPT-4o
96%
94%
Claude
94%
91%
Qwen 7B
68%
51%
Step 3

Try prompt optimization first

Agent optimizes
Agent attempts automatic prompt optimization on Qwen 7B. Tries few-shot, chain-of-thought, structured output. Modest gains (+7%), but not enough.
51% → 58%
Accuracy
+7%
Improvement
few-shot
CoT
58%
Step 4

Build a reward model for GRPO

Agent builds Human verifies
Agent creates a verifier that checks if tool calls are correct. Uses gold labels as positives, retained model outputs as negatives. You verify 50 "wrong" samples. 87% verifier agreement.
87%
Verifier acc
~1 hr
Human time
Gold
Model
87%
Step 5

Find cost-effective infrastructure

Agent benchmarks
Agent runs identical jobs across RunPod, Lambda, Modal. Measures throughput, cost, reliability. L40S at $0.99/hr wins for this model size.
L40S
GPU
$0.99/hr
Cost
~$12
Per run
H100
$3.99
L40S
$0.99
A10
$0.44
Step 6

Stabilize training dynamics

Agent iterates
Early runs show reward hacking and collapse. Agent detects patterns, adjusts learning rate and KL penalty. After 4 diagnostic runs, finds stable hyperparameters.
4
Runs
2e-6
LR
0.05
KL coef
Collapse
Stable
Step 7

Run full training

Agent runs Human monitors
Agent launches runs with 3 seeds. You monitor from iOS app — reward curves, eval metrics. Agent iterates, selects best checkpoint. 91% accuracy achieved.
91%
Accuracy
6
Runs
~$70
GPU cost
Run #6
Reward 0.85
Eval 91%

The outcome

From zero to production-ready model in 2 weeks of calendar time, with roughly 8 hours of focused human work — mostly labeling and experiment design. The agents handled infrastructure, benchmarking, hyperparameter search, and training orchestration. Your model runs locally at 150 tokens/sec or on a $0.20/hr GPU for batch inference.

51% → 91%
Accuracy improvement
~$70
Total training cost
$0.0001
Per-query inference
900x
Cost reduction vs GPT-4

Why this architecture

Agents write better code when the codebase helps them. Every decision here makes mistakes harder and fixes faster.

Orchestration
Rust conductors constellation lib
Training
PyTorch Pydantic configs
Services
Cloudflare Workers D1 databases
Mobile
SwiftUI Keychain
# No YAML. No JSON. Types catch mistakes.
from pydantic import BaseModel

class TrialConfig(BaseModel):
    learning_rate: float = 2e-5
    batch_size: int = 16
    model_name: str = "SmolLM2-360M"

# IDE autocomplete. Validation at parse time.
# Agent can't typo "leraning_rate".
TRIAL = TrialConfig(
    learning_rate=5e-5,
    batch_size=32,
)

Why agents love it

Types everywhere

Pydantic in Python. Rust's type system. TypeScript for services. 80% of bugs caught before runtime. Agents make fewer mistakes when the compiler tells them what's wrong.

No magic

No config files to guess at. No environment variables that change behavior. No implicit state. From the experiment entry point, you can trace every parameter that affects the run.

Fast feedback

Bazel caches test results. Only rerun what changed. An agent can iterate quickly because it doesn't wait 10 minutes to find out it broke something three files away.

CLI tools included

Everything you need to manage experiments, deploy services, and debug issues.

flea

Run management

List runs, stream logs, check GPU availability, inspect metrics. 860 lines of Rust built for terminal power users.

ios-cli

Device deployment

One command to build, sign, and deploy the iOS app to physical devices. Handles certificates and provisioning.

cf-deploy

Service deployment

Deploy Cloudflare Workers with D1 database management. Handles dependency ordering and post-deploy verification.

graphql-cli

API queries

Direct GraphQL queries to agent orchestration. Debug workflows, inspect state, trigger actions.

cursor-cli

Agent control

Manage Cursor API integration. Create tickets, check agent status, view execution logs.

scaffold

New experiments

Generate boilerplate for new training experiments. Proper structure, types, and integration tests from the start.

An honest pitch

This isn't for everyone. Here's who should — and shouldn't — use Fleaberry.

Good fit

  • + Teams training LLMs who want to own their infrastructure
  • + Startups using AI agents to ship features faster
  • + Engineers tired of paying platform fees for training orchestration
  • + Projects that need iOS apps + ML training + cloud services together
  • + Anyone who wants reproducible, auditable ML experiments
  • + Teams running local inference on Apple Silicon

Not for you if

  • - You want a hosted platform, not code to maintain
  • - Your team only writes Python and wants to keep it that way
  • - You need production multitenancy and billing out of the box
  • - Bazel's learning curve isn't worth it for your project size
  • - You need extensive documentation and tutorials
  • - You're not comfortable with a polyglot codebase

Start training today

Get started
$ git clone https://github.com/tensor-valley/fleaberry
$ cd fleaberry && bazel run //experiments/grpo:conductor
Provisioning GPU... Training SmolLM2-360M with GRPO...
Clone the repository Read the docs