Hammer.ai · blog
◆ Hello World Models · Part 3
Jul 3, 2026 · 6 min read

The LLM Learned to Stop Planning

We let an LLM design competing species, run them headless in the living world, and refine them from the results. Over three rounds it revived an extinct lineage into the winner and abandoned its own planner. Going deeper on Part 1.


Go deeper.

In Part 1 we grew a whole living world from a seed. Here we do the thing that world was built for: let an LLM design competing agents, run them, and watch what the model figures out when its agents live or die. No hand-holding, just a loss function and the analytics.

Part 1 ended on a promise. A world grown from rules is not the point on its own. The point is that you can drop agents into it, give them a score, and let them adapt where a mistake is free. So we built the other half: a headless competition mode. An LLM writes strategies, the strategies compete for survival in the real world (terrain, biomes, water, and live weather ticking every turn), and a business style loss function ranks them. Then the LLM reads the results and tries again.

We ran three rounds with a single model, gemma-3-27b, and the story that came out is better than anything we scripted.

The setup

Each strategy the LLM writes is not one creature. It is a whole species. It spawns a group of founders whose traits are sampled from ranges the model picks, and every child inherits the strategy, so the whole lineage counts toward that strategy’s score. A strategy can be a set of behavior parameters (how hungry before you forage, how fast you breed, how hard you chase shade) or a full GOAP plan (a little goal-oriented planner with actions, preconditions, and costs).

We run every roster two ways. Melee, where all four species share one world and fight over the same finite plants and water. And solo, where each species runs alone on the same worlds. The gap between the two is the cost of competition.

The score is a loss, and lower is better:

loss = -(survival + population share + longevity - volatility)

Survive, hold your share of the population, let your members reach old age instead of dying young, and do not boom and bust. It is the same shape as a business loss, revenue minus costs, which is the whole reason we care: this is the lending contest from the series in miniature, agents running policies, a loss function judging them, an LLM tuning by hand.

Round 1: the model’s confident guesses were wrong

The LLM proposed four archetypes and told us exactly what it expected:

“I anticipate the Dustrunners will initially dominate, but will crash when resources are depleted. The Stonebacks should then become more competitive. The Planner will hopefully provide a stable, efficient strategy.”

Here is what actually happened in the melee:

specieslosssurviveddied of
dustrunner (aggressive breeder)-1.251yesmostly old age
sunseeker (heat specialist)-1.224yesmostly old age
stoneback (cautious survivor)-0.308extinctdied out entirely
planful-grazer (GOAP planner)-0.040extinctstarvation and thirst

The cautious survivor did not survive. Neither did the planner. Both of the model’s “clever” ideas went extinct, and not just because they were out-competed. They died even when we ran them solo. Their problem was that they were not viable at all.

Round 2: reading the deaths, and one edit that changed everything

This is where it gets interesting. We handed the model the analytics, including the cause of every death, and it diagnosed the problem itself:

“Stoneback’s failure was likely due to its extreme caution and slow reproduction. The Planful-Grazer struggled with both starvation and dehydration, suggesting its planner wasn’t effective. The difference between solo and melee is striking, both performed slightly better in solo, indicating they were out-competed in the shared environment.”

It fixed the cautious survivor by loosening exactly the knobs it named. It bred only when 90 percent healthy, so it dropped that to 60. It waited 24 ticks between births, so it cut that to 10. It matured slowly, so it matured faster.

The result was not a small improvement. Stoneback went from dead-last-and-extinct to first place, winning the melee outright. One causal read, one set of edits, worst to best.

The planner, meanwhile, reworked to “prioritize water,” died the same way again.

Round 3: knowing when to quit

By the third round the model had seen its planner fail twice, and it drew the right conclusion instead of tuning forever:

“Planful-Grazer still failed to thrive, consistently dying out in both modes. This points to a fundamental flaw in its strategy, despite the GOAP planner rework. I will abandon the GOAP approach.”

It replaced the planner with a plain parameter strategy, a canyon wanderer, which immediately survived and landed in second. With that swap, every species in the roster lived. Extinctions per round went 2, then 1, then 0.

Why did the planner keep losing? Because the model’s plan made the “wander” action nearly free and gave it no effect. The planner, trying to reach its goals at the lowest cost, had no reason to forage or drink hard, so its creatures drifted and died thirsty. A well-tuned default behavior tree beat a hand-written planner three times in a row. The lesson is not that planning is bad. It is that a plan is only as good as its cost model, and the competition surfaced that from deaths alone.

What the LLM actually learned

Four things, none of which we told it:

  1. Being conservative loses a resource race. Its most confident bet went extinct. Breeding tempo held population share, not caution.
  2. Its own planner was worse than the tuned default, and it knew when to stop. It correctly blamed the plan, not the world, and cut it.
  3. Solo success does not transfer to the melee. It reasoned explicitly about shared-resource competition, which is exactly the signal the two modes were built to expose.
  4. Cause of death is a debugger. Starvation points at over-breeding or lazy foraging, thirst at bad habitat, old age at a healthy lineage. Every edit was targeted, not a guess.

Every run is deterministic and ticks the full weather simulation each turn, so the same roster and seed always give the same result. A nine thousand turn melee runs in about twenty seconds. The whole three-round experiment took under five minutes.

The full results, interactive.

Round-by-round standings, the exact parameter edits, the solo-versus-melee tables, and the model's reasoning at each step are all in the interactive results page.

This is the point of building a world you can measure. An LLM proposed policies, a loss function scored them, and the model learned to survive by reading why its agents died. Point the same machine at a lending world, swap creatures for underwriters and weather for a moving market, and the loop is identical. That is the next post.


aiagentsgoapworld-modelssimulation

← all posts