> TL;DR: A practical, reproducible workflow for solo/indie ML engineers to run experiments with AI assistants like Claude Code, devcontainers, and remote GPUs; manage secrets and data safely; and scale from laptop prototypes to multiâGPU training with Accelerateâwith guardrails to avoid hallucinated benchmarks and other footâguns.
(This is meant to be a running rough draft)
The following workflow came from months of trial and error doing ML work outside a big lab. I word it assertively to invoke Cunningham's Law. If something is incorrect or your project does it better, please reach out.
Table of Contents
Development Setup: Reproducible environments and tooling
Dev Containers ⢠Remove sudo ⢠DevPods ⢠Non-DevPods ⢠VS Code-based ⢠Quick note: sanity measures
Secrets and environment configuration: Managing keys and credentials safely
The simple approach ⢠Practice leak recovery ⢠Outgrowing the simple approach ⢠Monitor Git and logs
AI Agents: Using Claude Code and other AI assistants effectively
Claude Code ⢠(Ab)Use agents ⢠Never go full vibe mode ⢠Don't waste time arguing
Data Management: Handling datasets without enterprise infrastructure
Cloudflare R2 ⢠Native cloud storage ⢠When to level up
Scaling: From single GPU to distributed training
Multi-GPU training ⢠Multi-machine runs
Managing experiments: Tracking what works and why
Some remaining paper cuts: Things that still need work
Development Setup
Dev Containers
Start with a Docker container. Yes, rebuilding for every dependency is slower than pip install
, but adâhoc changes always come back to bite you. Keep the container fresh and consider CI tests to preserve reproducibility.
Remove sudo
I remove sudo
from containers to force all changes into the Dockerfile
. Adâhoc tweaks get forgotten by PR time. This isn't about securityâit's a forcing function for environmental hygiene.
DevPods
Use DevPods when possible. I have a modest home lab, but cloud machines work well too.
NonâDevPods
Breaking my own ruleâI haven't gotten Lambda Labs to work with my devcontainer yet, so I SSH in and run a setup script instead.
When not using DevPods: start a machine, connect with Cursor or VS Code, and use SSH agent forwarding (ssh -A
).
VS Codeâbased
I have been won over by VS Codeâbased editors. They work on almost any device, including in the browser. The rich plugin ecosystem and web technology base means Jupyter notebooks just work. It doesn't really matter which VS Code fork you choose; all the major ones support settings.json
. Cursor and VS Code both work.
Quick note: sanity measures for multiple machines
During early experiments you may have multiple machines running different configurations or PRs in parallel. As part of a setup.sh
on a new client, I modify the activityBar
colors in .vscode/settings.json
. You can also use the Peacock extension if you don't like the color the hostname hashed to. Prominent visual indicators per window help more than the small hostname in the lower left. If you choose a nonâVS Code editor and expect to work across multiple instances, do something similar.
Secrets and environment configuration
Start simple. You can postpone a dedicated secrets manager until you have 10+ engineers, multiple services, or complex permission hierarchies.
The simple approach â a .gitignored .env file
Add .env
to .gitignore
and share the file outâofâband. Managing access can be as simple as a Google Doc/Sheet or a plaintext file behind your SSO. Never store this file in source, no matter how superâsecret the rest of the repo is, because Git history is forever and you will eventually need more granular access control. This can work for small organizations of ~10 engineers without outside collaborators.
Practice what happens when you leak a secret
Assume you have very little time between a leaked key being detected and it being abused. Maintain a runbook that every technical person with access has read and rehearsed at least once for rotating affected keys. Keep the runbook in a shared cloud doc with direct links to each service's dashboard for revoking a key and an email list to notify when you've done so. During onboarding, rotate a nonâcritical key with the new hire. During offboarding, rotate all keys they had access to (a major reason this approach doesn't scale to larger teams).
Outgrowing the simple approach
If you autoscale multiple services, have >~10 engineers, or more than two hierarchies of secrets, you have outgrown the simple approach. Adopt an SSOâintegrated secrets store. HashiCorp Vault is good (if complicated); each major cloud also offers a native option. Prefer dynamic, shortâlived credentials over longâlived passwords/keys. Evaluate by moving the leastâcritical secrets first. Keep the runbook; it will shift toward setup documentation.
Monitor Git and logs
Because keys are often leaked in source controlâand Git history is foreverâset up key scanning in CI/CD, even if your repo is private and all collaborators have access. Most hosts have this built in; on GitHub it's called "Secret Scanning" under Security & Analysis. The second most common leak is logging, so filter and monitor logs for secretâlike strings.
AI Agents
These notes reflect using Opus 4/4.1 and Sonnet 4 in Claude Code, plus Gemini 2.5 Pro in Cursor.
Claude Code
Claude Code embodies "worse is better." Fancier tools exist, but the terminal still connects to everything. CursorâAgent shows promise but lacks subâagents and web search.
(Ab)Use agents
Claude Code has a nifty feature called "agents": subâprompts you can spawn from the main chat that run with their own context windows and report results back. This is useful for parallel tasks like: "spin up an @agent-general-purpose
for each file in NOTES/
and check for stale links or outâofâdate docs."
I also use agent personas to crossâcheck ML literature. For example, spin up several Opus 4.1 agents prompted as expert ML reviewers and ask: "diff this training script with paper X" for a list of papers. They report conceptual or configuration differences back into the main chat where you're using the cheaper, dailyâdriver Sonnet modelâwithout blowing up the context window.
Never go full vibe mode in research
Hallucinations are well known, but here's a related failure mode: Claude will make up data. I asked it to run simple benchmarks without specifying the runtime budget. After a timeout it deemed the task "infeasible" and presented very plausible data without marking it as placeholder. I only caught this while reviewing the code to write the analysis. If you go full vibe mode, you may never notice. Read all the throwaway test code too.
Don't waste time arguing with Robots
When Claude gets stuck, don't try to convince itâjust rewind. In Claude Code, tap Escape twice. In Cursor, scroll up and edit/resend. Fresh context beats arguing every time.
Data Management
For datasets under 1TB, simple scripts beat complex systems. Move files between S3-compatible storage and local disks with basic Bash or Python. Download during setup, upload checkpoints after training.
Cloudflare R2 for affordable storage
I use Cloudflare R2 primarily because of cost. At $0.015/GB/month with no egress charges, it's hard to beat for ML workflows. No bandwidth fees means you can pull datasets to any machine without surprise bills.
Native cloud storage when it makes sense
If you're already a GCP/AWS/Azure shop, use their native object stores (S3, GCS, Azure Blob) when you have existing key management and permissions there. Follow your organization's data-location policiesâcompliance beats cost optimization.
When to level up
This approach works well up to a few hundred gigabytes. Beyond that, download times and bandwidth costs pile up. Streaming solutions and data loaders become necessaryâa topic for another post.
Scaling
Multi-GPU training runs
I use Accelerate because the changes from singleâGPU testing to multiâGPU training are minimal.
Often you do not need full fineâtuning. Quantizationâaware training is fiddly; LoRAs in bf16 strike a good balance between resource limits and stability.
Multi-machine runs
Multiâmachine training is another animal and out of scope for this article; I'll try to follow up in a part two. Having a readyâtoâtrain container helps a lot. More nodes also means more machine failures; plan to manage those.
Managing experiments
Markdown-based training logs
Instead of CSV logs or complex tracking systems, I maintain a TRAINING_LOGBOOK.md
that serves as the central record of all experiments. This approach combines human readability with AI-agent compatibilityâcritical when Claude Code is helping manage your runs.
Each experiment entry captures:
- Objective & hypothesis: Why this run matters
- Configuration: Model, hyperparameters, and config files used
- Key changes: What's different from the previous run
- Results: Metrics, losses, and performance indicators
- W&B links: Direct links to runs for detailed metrics
- Key takeaways: What worked, what didn't, and why
For complex projects, the logbook links to detailed markdown files for specific experiments or subsystems. For example, NOTES/KD_IMPLEMENTATION.md
might document the evolution of knowledge distillation approaches, while NOTES/CURRICULUM_MASKING.md
tracks curriculum learning experiments.
This creates a lightweight, version-controlled knowledge base that's:
- Searchable: Both by humans and AI agents using grep/glob
- Diffable: Git tracks exactly what changed between experiments
- Portable: No vendor lock-in, works anywhere with a text editor
- Context-preserving: The narrative flow helps recall why decisions were made
I still use W&B for real-time metrics and charts, but the markdown log captures the "why" and key insights that are often lost in raw metrics.
Some remaining paper cuts
Need to invest some time in a triggering workflow
Right now I'm asking Claude to run the training job in the background and polling logs after a few sleeps to make sure it's going well.
If it diagnoses a problem, emailing/alerting me would save time. Too often I walk away thinking everything is fine only to learn it died on epoch 2 due to a bug in my eval code.
UpdateâClaude Code now has firstâclass support for background jobs, so adding an alerting/notification MCP may be all I need.
Know a better way to do ML research? Please reach out and correct me.