ML Work When You're Not at Google

> TL;DR: A practical, reproducible workflow for solo/indie ML engineers to run experiments with AI assistants like Claude Code, devcontainers, and remote GPUs; manage secrets and data safely; and scale from laptop prototypes to multi‑GPU training with Accelerate—with guardrails to avoid hallucinated benchmarks and other foot‑guns.

(This is meant to be a running rough draft)

The following workflow came from months of trial and error doing ML work outside a big lab. I word it assertively to invoke Cunningham's Law. If something is incorrect or your project does it better, please reach out.

Development Setup: Reproducible environments and tooling

Dev Containers • Remove sudo • DevPods • Non-DevPods • VS Code-based • Quick note: sanity measures

Secrets and environment configuration: Managing keys and credentials safely

The simple approach • Practice leak recovery • Outgrowing the simple approach • Monitor Git and logs

AI Agents: Using Claude Code and other AI assistants effectively

Claude Code • (Ab)Use agents • Never go full vibe mode • Don't waste time arguing

Data Management: Handling datasets without enterprise infrastructure

Cloudflare R2 • Native cloud storage • When to level up

Scaling: From single GPU to distributed training

Multi-GPU training • Multi-machine runs

Managing experiments: Tracking what works and why

Markdown-based training logs

Some remaining paper cuts: Things that still need work

Triggering workflows

Development Setup

Dev Containers

Start with a Docker container. Yes, rebuilding for every dependency is slower than pip install, but ad‑hoc changes always come back to bite you. Keep the container fresh and consider CI tests to preserve reproducibility.

Remove sudo

I remove sudo from containers to force all changes into the Dockerfile. Ad‑hoc tweaks get forgotten by PR time. This isn't about security—it's a forcing function for environmental hygiene.

DevPods

Use DevPods when possible. I have a modest home lab, but cloud machines work well too.

Non‑DevPods

Breaking my own rule—I haven't gotten Lambda Labs to work with my devcontainer yet, so I SSH in and run a setup script instead.

When not using DevPods: start a machine, connect with Cursor or VS Code, and use SSH agent forwarding (ssh -A).

VS Code–based

I have been won over by VS Code–based editors. They work on almost any device, including in the browser. The rich plugin ecosystem and web technology base means Jupyter notebooks just work. It doesn't really matter which VS Code fork you choose; all the major ones support settings.json. Cursor and VS Code both work.

Quick note: sanity measures for multiple machines

During early experiments you may have multiple machines running different configurations or PRs in parallel. As part of a setup.sh on a new client, I modify the activityBar colors in .vscode/settings.json. You can also use the Peacock extension if you don't like the color the hostname hashed to. Prominent visual indicators per window help more than the small hostname in the lower left. If you choose a non–VS Code editor and expect to work across multiple instances, do something similar.

Secrets and environment configuration

Start simple. You can postpone a dedicated secrets manager until you have 10+ engineers, multiple services, or complex permission hierarchies.

The simple approach — a .gitignored .env file

Add .env to .gitignore and share the file out‑of‑band. Managing access can be as simple as a Google Doc/Sheet or a plaintext file behind your SSO. Never store this file in source, no matter how super‑secret the rest of the repo is, because Git history is forever and you will eventually need more granular access control. This can work for small organizations of ~10 engineers without outside collaborators.

Practice what happens when you leak a secret

Assume you have very little time between a leaked key being detected and it being abused. Maintain a runbook that every technical person with access has read and rehearsed at least once for rotating affected keys. Keep the runbook in a shared cloud doc with direct links to each service's dashboard for revoking a key and an email list to notify when you've done so. During onboarding, rotate a non‑critical key with the new hire. During offboarding, rotate all keys they had access to (a major reason this approach doesn't scale to larger teams).

Outgrowing the simple approach

If you autoscale multiple services, have >~10 engineers, or more than two hierarchies of secrets, you have outgrown the simple approach. Adopt an SSO‑integrated secrets store. HashiCorp Vault is good (if complicated); each major cloud also offers a native option. Prefer dynamic, short‑lived credentials over long‑lived passwords/keys. Evaluate by moving the least‑critical secrets first. Keep the runbook; it will shift toward setup documentation.

Monitor Git and logs

Because keys are often leaked in source control—and Git history is forever—set up key scanning in CI/CD, even if your repo is private and all collaborators have access. Most hosts have this built in; on GitHub it's called "Secret Scanning" under Security & Analysis. The second most common leak is logging, so filter and monitor logs for secret‑like strings.

AI Agents

These notes reflect using Opus 4/4.1 and Sonnet 4 in Claude Code, plus Gemini 2.5 Pro in Cursor.

Claude Code

Claude Code embodies "worse is better." Fancier tools exist, but the terminal still connects to everything. Cursor‑Agent shows promise but lacks sub‑agents and web search.

(Ab)Use agents

Claude Code has a nifty feature called "agents": sub‑prompts you can spawn from the main chat that run with their own context windows and report results back. This is useful for parallel tasks like: "spin up an @agent-general-purpose for each file in NOTES/ and check for stale links or out‑of‑date docs."

I also use agent personas to cross‑check ML literature. For example, spin up several Opus 4.1 agents prompted as expert ML reviewers and ask: "diff this training script with paper X" for a list of papers. They report conceptual or configuration differences back into the main chat where you're using the cheaper, daily‑driver Sonnet model—without blowing up the context window.

Never go full vibe mode in research

Hallucinations are well known, but here's a related failure mode: Claude will make up data. I asked it to run simple benchmarks without specifying the runtime budget. After a timeout it deemed the task "infeasible" and presented very plausible data without marking it as placeholder. I only caught this while reviewing the code to write the analysis. If you go full vibe mode, you may never notice. Read all the throwaway test code too.

Don't waste time arguing with Robots

When Claude gets stuck, don't try to convince it—just rewind. In Claude Code, tap Escape twice. In Cursor, scroll up and edit/resend. Fresh context beats arguing every time.

Data Management

For datasets under 1TB, simple scripts beat complex systems. Move files between S3-compatible storage and local disks with basic Bash or Python. Download during setup, upload checkpoints after training.

Cloudflare R2 for affordable storage

I use Cloudflare R2 primarily because of cost. At $0.015/GB/month with no egress charges, it's hard to beat for ML workflows. No bandwidth fees means you can pull datasets to any machine without surprise bills.

Native cloud storage when it makes sense

If you're already a GCP/AWS/Azure shop, use their native object stores (S3, GCS, Azure Blob) when you have existing key management and permissions there. Follow your organization's data-location policies—compliance beats cost optimization.

When to level up

This approach works well up to a few hundred gigabytes. Beyond that, download times and bandwidth costs pile up. Streaming solutions and data loaders become necessary—a topic for another post.

Scaling

Multi-GPU training runs

I use Accelerate because the changes from single‑GPU testing to multi‑GPU training are minimal.

Often you do not need full fine‑tuning. Quantization‑aware training is fiddly; LoRAs in bf16 strike a good balance between resource limits and stability.

Multi-machine runs

Multi‑machine training is another animal and out of scope for this article; I'll try to follow up in a part two. Having a ready‑to‑train container helps a lot. More nodes also means more machine failures; plan to manage those.

Managing experiments

Markdown-based training logs

Instead of CSV logs or complex tracking systems, I maintain a TRAINING_LOGBOOK.md that serves as the central record of all experiments. This approach combines human readability with AI-agent compatibility—critical when Claude Code is helping manage your runs.

Each experiment entry captures:

Objective & hypothesis: Why this run matters
Configuration: Model, hyperparameters, and config files used
Key changes: What's different from the previous run
Results: Metrics, losses, and performance indicators
W&B links: Direct links to runs for detailed metrics
Key takeaways: What worked, what didn't, and why

For complex projects, the logbook links to detailed markdown files for specific experiments or subsystems. For example, NOTES/KD_IMPLEMENTATION.md might document the evolution of knowledge distillation approaches, while NOTES/CURRICULUM_MASKING.md tracks curriculum learning experiments.

This creates a lightweight, version-controlled knowledge base that's:

Searchable: Both by humans and AI agents using grep/glob
Diffable: Git tracks exactly what changed between experiments
Portable: No vendor lock-in, works anywhere with a text editor
Context-preserving: The narrative flow helps recall why decisions were made

I still use W&B for real-time metrics and charts, but the markdown log captures the "why" and key insights that are often lost in raw metrics.

Some remaining paper cuts

Need to invest some time in a triggering workflow

Right now I'm asking Claude to run the training job in the background and polling logs after a few sleeps to make sure it's going well.

If it diagnoses a problem, emailing/alerting me would save time. Too often I walk away thinking everything is fine only to learn it died on epoch 2 due to a bug in my eval code.

Update—Claude Code now has first‑class support for background jobs, so adding an alerting/notification MCP may be all I need.

Know a better way to do ML research? Please reach out and correct me.

Table of Contents

Development Setup

Dev Containers

Remove sudo

DevPods

Non‑DevPods

VS Code–based

Quick note: sanity measures for multiple machines

Secrets and environment configuration

The simple approach — a .gitignored .env file

Practice what happens when you leak a secret

Outgrowing the simple approach

Monitor Git and logs

AI Agents

Claude Code

(Ab)Use agents

Never go full vibe mode in research

Don't waste time arguing with Robots

Data Management

Cloudflare R2 for affordable storage

Native cloud storage when it makes sense

When to level up

Scaling

Multi-GPU training runs

Multi-machine runs

Managing experiments

Markdown-based training logs

Some remaining paper cuts

Need to invest some time in a triggering workflow