Why My Minecraft Bots Now Speak — And Why I Can’t Go Back

Audio Upgrade!
I made a video about MCDS. It’s raw, no music, just bots doing their thing. And the funny part is, the audio carries it. Not because I edited it well — I didn’t. But because the bots are talking. In multiple languages. Over crackling radio. And it works.
This wasn’t the plan. Three days ago, coda didn’t exist. TempleSound didn’t exist. I had no audio layer at all. What I had was a thought: wouldn’t it be cool if the bots made noise?
So I built it. Fast. Two sibling repos in three days: coda (TypeScript, handles speech synthesis and sound) and TempleSound (Rust, generates the Roger beep). I merged it into MCDS as version 3.36.0. And now, if I turned it off, the game would feel empty. That’s the real test — not whether it works, but whether removing it hurts.
What it actually does
When you’re close to a bot — within earshot — you hear it positionally. Clean voice, no effects. It feels like a body in the room. Step back, leave the range, and it switches. Now the bot comes through on radio: filtered, degraded by distance, with a squelch beep at the end. The transition is smooth. One resolution point decides per utterance: near or far. Same code path for narration and manual say commands.
The distance calculation isn’t fancy. It’s a curve from 0.0 to 1.0 signal quality, tuned to the range I can actually measure (16, 64, 128 blocks). Earlier I had ranges at 256 and 1024, which sounds generous until you realize the tracking protocol only sees about 90 blocks. The math was right. The workspace was wrong. A correct model in the wrong context is indistinguishable from a broken one. I had to make it visible first — via a coda:route trace — to see that distant lines were routing to an inaudible channel. Then I fixed it.
The Roger beep is a TempleOS PC speaker tone. Terry Davis‘ legacy. I didn’t want to build a WAV encoder into coda, so I split it: Rust writes the raw audio, coda handles MP3 and caching. TempleSound is a satellite of a satellite, coupled one-way. If it’s missing, the beep just doesn’t play. Nothing breaks.
Why the architecture matters
coda is an effect satellite. MCDS pushes to it, fire-and-forget. coda never learns about bots. MCDS owns the voice mapping, the localization, the config. All knobs sit in one config.yml. This isn’t bureaucracy — it’s the reason I could tune the whole system in an afternoon without touching coda once.
The daemon stays warm. TTS is slow over the network, so coda pre-synthesizes line N+1 while N plays, and caches by content hash. A cold subprocess per line would have thrown that away. I chose stdin-NDJSON over HTTP because port management is a pain I didn’t need.
What surprised me
The breakthrough wasn’t code. It was a sentence: local playback is perfect for sense-of-distance, in-game you barely hear signal quality. That split the world into near and far. Near = presence. Far = distance through degradation. Both channels active, switched by range. The function signalQualityFromDistance already existed. I just had to read it right.
The other surprise: observability is part of the feature. Not a nice-to-have. Debugging audio without coda:route was guessing. One trace showed me zone: far → unknown → far in a single glance. That’s when I knew the tooling had to ship with the system.
The honest limits
I don’t have live player coordinates outside entity tracking range. The protocol doesn’t provide them. I cache last-known positions across all bots, refresh them while tracked, discard only on real disconnect. It’s a proxy, not a solution. The real fix — a local HTTP endpoint, PlayerCoordsAPI — is noted for next time. I’m not pretending the current approach is complete.
SSML emotion doesn’t work. The edge-TTS library escapes manual markup and wraps everything in a fixed template. Prosody via options only. Fixing it means forking or switching libraries. Not now.
Where this lands
In three days, audio went from wouldn’t it be cool to infrastructure I don’t want to live without. The test video proves it: raw footage, no music, and the bots carry the moment through speech alone. Not because the synthesis is perfect, but because the system knows when to be clean and when to be broken. Presence and distance. That’s the whole trick.
Here’s the video:
I’ll be back!
