Open Source ML Infrastructure

Train LLMs. Build iOS apps. Own everything.

Complete ML training infrastructure with GPU orchestration, real-time mobile monitoring, AI agent workflows, and local inference — all in one Bazel monorepo. Fork it once, never pay for a platform again.

View on GitHub Read the docs

Languages

847+

Build Targets

3,300

tok/s on M3 Ultra

Platform Fees

Terminal

$ git clone https://github.com/tensor-valley/fleaberry && cd fleaberry

$ bazel run //apps/home/ios/tools/ios-cli -- deploy

Deploying to iPhone... Training monitor ready.

$ bazel run //experiments/grpo:conductor

Provisioning H100 on RunPod... Training started.

Everything included

Not a framework. Not a CLI. A complete production system you fork and own. Every checkbox here works on day one.

✓ Spin up H100s, L40S, A100s on RunPod GPU

✓ Deploy inference endpoints to Modal GPU

✓ Run 3,300 tok/s inference on your Mac GPU

✓ Monitor everything from your phone MOBILE

✓ Watch metrics update live MOBILE

✓ Log in with Face ID MOBILE

✓ Start agent tasks from your phone AI

✓ Connect Cursor to your infra AI

✓ Let agents write code for you AI

✓ Train with GRPO reinforcement learning ML

✓ Fine-tune models with SFT ML

✓ Optimize with DPO preferences ML

✓ Track experiments in W&B ML

✓ Sleep while agents babysit training runs AI

✓ Deploy to Cloudflare Workers INFRA

✓ Store data in D1 SQLite INFRA

✓ Manage state with Durable Objects INFRA

✓ Build hermetically with Bazel INFRA

✓ Cross-compile to Linux x86_64 INFRA

✓ Ship fast with two-layer containers INFRA

✓ Label training data with built-in tools AI

✓ Run text-to-speech demos AI

✓ Query everything via GraphQL INFRA

✓ WebSocket real-time updates INFRA

✓ Idempotent retry logic INFRA

Four platforms, one build

Train on cloud GPUs. Monitor from your phone. Deploy services globally. Run inference locally. All from the same codebase.

⚙

GPU Training // RunPod + Modal

Provision H100s, L40S, or A100s with one command. Automatic crash recovery, log streaming, and self-terminating pods to prevent runaway costs. Supports GRPO, SFT, DPO training paradigms.

RunPod GraphQL Modal endpoints W&B tracking Heartbeat monitoring

📱

iOS App // SwiftUI

Monitor training runs from anywhere. Real-time metrics charts, event timelines, log streaming, full-text search. One-command deployment to physical devices with passkey authentication.

Swift Charts Auto-refresh Passkey auth Bazel build

☁

Edge Services // Cloudflare

Auth, training metadata, annotation UIs, AI agent orchestration — all deployed as Workers with D1 databases and Durable Objects. Global edge, no servers, scales to zero.

Workers D1 SQLite Durable Objects WebSockets

⚡

Local Inference // MLX

Run models locally on Apple Silicon at 3,300+ tokens/sec for Qwen3-0.6B, ~1,000 tok/s for 4B models. Continuous batching, JSONL streaming, OpenAI-compatible chat format.

M-series optimized Batch inference Streaming output Benchmarks

What you actually get

This isn't about the code — it's about what you can do with it.

Train LLMs today

Clone, configure, run. Real GRPO and SFT training on cloud GPUs in minutes, not weeks of setup.

Monitor anywhere

Check training progress from your phone. Get real-time metrics without opening a laptop.

Let agents build

AI coding agents can implement features autonomously via the ticket workflow system.

Own everything

No vendor lock-in. No platform fees. No surprise bills. It's your code forever.

AI agents that ship code

Built-in Cursor API integration for autonomous code generation. Agents take tickets, plan implementation, and submit PRs.

Workflow

Ticket created Agent plans Human approves

Execution

Cursor API Durable Objects

State

SQLite per agent Structured logs

Output

PR submitted WebSocket updates

# Agent workflow: ticket to pull request

# 1. Create a ticket via GraphQL
mutation {
  createTicket(input: {
    title: "Add user settings page"
    description: "..."
  })
}

# 2. Agent plans, human approves
# 3. Agent executes autonomously
# 4. PR ready for review

# Real-time progress via WebSocket
# Full audit trail in SQLite

Working demos included

Not hello world tutorials. Real experiments from production use.

// experiments/grpo

GRPO Training

Complete reinforcement learning loop with policy optimization, instrumentation, and integration tests. CPU-compatible for local debugging.

// experiments/banking77

Intent Classification

Full fine-tuning reference for Banking77 dataset. Shows the complete experiment lifecycle from config to evaluation.

// demos/tts

Text-to-Speech

Production TTS inference via Modal GPUs. Cloudflare Workers frontend, Qwen3-TTS model, audio streaming.

A real ML project, start to finish

See how agents and humans work together to ship a production model. Most of your time goes to labeling and experiment design — not infrastructure.

// The scenario

You're building a SaaS for sales pipeline management. Users ask analytical questions in natural language: "Show me deals closing this quarter over $50k" or "Which reps have the lowest conversion rate?" Your system needs to convert these to the right tool calls with correct parameters. GPT-4 works but costs $0.03/query — at 100k queries/day, that's $90k/month. Time to train a smaller model.

Label data

Benchmark

Optimize

Reward model

Infra

Stabilize

Train

Agent autonomous

Human involved

Step 1

Build the annotation UI

Agent builds Human labels

An agent scaffolds an annotation interface. You enter queries, Claude generates candidate tool calls, you correct them. Keyboard-optimized: j/k to navigate, e to edit, Enter to approve. 400 examples labeled, all model outputs retained.

400

Examples

~6 hrs

Human time

Query

"Deals closing Q1 over $50k"

↓

Tool call Claude

get_deals(q="Q1", min=50k)

j k ⏎

Step 2

Benchmark existing models

Agent runs

Agent runs your test set against GPT-4o, Claude 3.5, Llama 70B, Qwen 7B and more. Provisions endpoints, evaluates, produces report. Now you have baselines.

68%

Tool (Qwen 7B)

51%

+ Params

94%

GPT-4o

Tool

+Params

GPT-4o

96%

94%

Claude

94%

91%

Qwen 7B

68%

51%

Step 3

Try prompt optimization first

Agent optimizes

Agent attempts automatic prompt optimization on Qwen 7B. Tries few-shot, chain-of-thought, structured output. Modest gains (+7%), but not enough.

51% → 58%

Accuracy

+7%

Improvement

few-shot

→

CoT

→

58%

Step 4

Build a reward model for GRPO

Agent builds Human verifies

Agent creates a verifier that checks if tool calls are correct. Uses gold labels as positives, retained model outputs as negatives. You verify 50 "wrong" samples. 87% verifier agreement.

87%

Verifier acc

~1 hr

Human time

Gold

✓

Model

✗

✓

→

87%

Step 5

Find cost-effective infrastructure

Agent benchmarks

Agent runs identical jobs across RunPod, Lambda, Modal. Measures throughput, cost, reliability. L40S at $0.99/hr wins for this model size.

L40S

GPU

$0.99/hr

Cost

~$12

Per run

H100

$3.99

L40S

$0.99

✓

A10

$0.44

Step 6

Stabilize training dynamics

Agent iterates

Early runs show reward hacking and collapse. Agent detects patterns, adjusts learning rate and KL penalty. After 4 diagnostic runs, finds stable hyperparameters.

Runs

2e-6

0.05

KL coef

Collapse

Stable

Step 7

Run full training

Agent runs Human monitors

Agent launches runs with 3 seeds. You monitor from iOS app — reward curves, eval metrics. Agent iterates, selects best checkpoint. 91% accuracy achieved.

91%

Accuracy

Runs

~$70

GPU cost

Run #6

Reward 0.85

Eval 91%

The outcome

From zero to production-ready model in 2 weeks of calendar time, with roughly 8 hours of focused human work — mostly labeling and experiment design. The agents handled infrastructure, benchmarking, hyperparameter search, and training orchestration. Your model runs locally at 150 tokens/sec or on a $0.20/hr GPU for batch inference.

51% → 91%

Accuracy improvement

~$70

Total training cost

$0.0001

Per-query inference

900x

Cost reduction vs GPT-4

Why this architecture

Agents write better code when the codebase helps them. Every decision here makes mistakes harder and fixes faster.

Orchestration

Rust conductors constellation lib

Training

PyTorch Pydantic configs

Services

Cloudflare Workers D1 databases

Mobile

SwiftUI Keychain

# No YAML. No JSON. Types catch mistakes.
from pydantic import BaseModel

class TrialConfig(BaseModel):
    learning_rate: float = 2e-5
    batch_size: int = 16
    model_name: str = "SmolLM2-360M"

# IDE autocomplete. Validation at parse time.
# Agent can't typo "leraning_rate".
TRIAL = TrialConfig(
    learning_rate=5e-5,
    batch_size=32,
)

Why agents love it

Types everywhere

Pydantic in Python. Rust's type system. TypeScript for services. 80% of bugs caught before runtime. Agents make fewer mistakes when the compiler tells them what's wrong.

No magic

No config files to guess at. No environment variables that change behavior. No implicit state. From the experiment entry point, you can trace every parameter that affects the run.

Fast feedback

Bazel caches test results. Only rerun what changed. An agent can iterate quickly because it doesn't wait 10 minutes to find out it broke something three files away.

CLI tools included

Everything you need to manage experiments, deploy services, and debug issues.

flea

Run management

List runs, stream logs, check GPU availability, inspect metrics. 860 lines of Rust built for terminal power users.

ios-cli

Device deployment

One command to build, sign, and deploy the iOS app to physical devices. Handles certificates and provisioning.

cf-deploy

Service deployment

Deploy Cloudflare Workers with D1 database management. Handles dependency ordering and post-deploy verification.

graphql-cli

API queries

Direct GraphQL queries to agent orchestration. Debug workflows, inspect state, trigger actions.

cursor-cli

Agent control

Manage Cursor API integration. Create tickets, check agent status, view execution logs.

scaffold

New experiments

Generate boilerplate for new training experiments. Proper structure, types, and integration tests from the start.

An honest pitch

This isn't for everyone. Here's who should — and shouldn't — use Fleaberry.

Good fit

+ Teams training LLMs who want to own their infrastructure
+ Startups using AI agents to ship features faster
+ Engineers tired of paying platform fees for training orchestration
+ Projects that need iOS apps + ML training + cloud services together
+ Anyone who wants reproducible, auditable ML experiments
+ Teams running local inference on Apple Silicon

Not for you if

- You want a hosted platform, not code to maintain
- Your team only writes Python and wants to keep it that way
- You need production multitenancy and billing out of the box
- Bazel's learning curve isn't worth it for your project size
- You need extensive documentation and tutorials
- You're not comfortable with a polyglot codebase

Start training today

Get started

$ git clone https://github.com/tensor-valley/fleaberry

$ cd fleaberry && bazel run //experiments/grpo:conductor

Provisioning GPU... Training SmolLM2-360M with GRPO...

Clone the repository Read the docs