Last week I attended Code w/ Claude in London, Anthropic’s developer conference. All sessions are already online on Anthropic’s YouTube channel, so I won’t try to summarize every talk. Instead, here are the three things that stuck with me.

Crowd gathered in the main hall at Code w/ Claude London 2026 ahead of a keynote

Workflows, Not Coding Assistants

The first thing that struck me was the framing of the conference itself. I went in expecting at least a fair share of talks about Claude Code and writing software faster. Almost none of the sessions touched that. Instead, they were about agents deployed in production, automating real business workflows and continuously improving them over time.

That shift in emphasis tells you where Anthropic thinks the interesting problems are. A few patterns kept showing up across the workflow-focused talks:

  • One Claude call, one job. Resist bundling planning, deciding, and executing into a single prompt. Stages with explicit handoffs are easier to debug and improve.
  • Use native formats. Markdown and SQL are languages the model already speaks fluently. A custom JSON schema is friction you’re inventing for yourself.
  • Mind the gap between the outer agent and any sub-agents. Anything you forget to hand down is where the workflow quietly breaks.

Closing the Loop on Non-Deterministic Agents

Closing the loop is a familiar pattern for deterministic work: write a unit test, run the agent, let the test tell you whether the code is correct. That works because there’s a clear right answer.

What’s harder, and what several talks pushed on, is closing the loop when the output is non-deterministic. A customer support reply, a generated summary, a tone-sensitive message. There’s no test that says “this was correct.”

The pattern that stood out to me: build the feedback path for humans first. One example I saw was a customer support agent where every reply went past a human, the human could correct or rate it, and that feedback fed directly back into the agent’s instructions. The agent’s behavior improved over time without anyone editing its system prompt by hand.

Two related ideas worth holding on to:

  • Encode taste, not scripts. Agents need judgment, so give them principles instead of rigid rules and let them decide.
  • The ceiling is an agent that quietly improves itself with barely any team upkeep. The compounding gains come from the feedback path, not from the next prompt rewrite.

Measure the Subjective

The third concept ties the first two together: you can actually measure “feels right.”

The pattern is deterministic graders (hard rules: did the call return JSON? did the required field exist?) paired with an LLM judge for the things rules can’t capture (tone, helpfulness, whether the output is genuinely useful). Score on a 0–5 scale, and force the judge to write its pros and cons before giving the score. Otherwise the score leads the reasoning.

And evals aren’t a one-and-done thing. A stale eval is worse than no eval. It gives you false confidence. Recalibrate them often.

I’ve been thinking about this in the context of my own HealthExport support automation, where “did Claude write a good reply?” is exactly the kind of subjective question I’ve been answering manually in Telegram. The eval pattern is what would let me close that loop properly.

Notes From a 1:1 With an Anthropic Engineer

The conference included optional 1:1 slots with Anthropic engineers. Three things from mine I keep coming back to:

  • Automate first. Models keep getting smarter on their own. If you wire up the automation now, the agent improves while you sleep. Don’t wait for the “perfect” model before you start.
  • Skills can branch. A skill doesn’t have to map to one output. Let the model pick the branch based on context (for example, one bug-reporting skill that pipes into either Linear or Jira depending on the project) instead of duplicating the skill per route.
  • Build eval skills. Your evaluator is itself a skill, and probably the most valuable one to own. It’s what gives every other skill room to improve.

Takeaways

  • The frontier is agent workflows in production, not coding assistants. That’s where most of the conference pointed.
  • Build the human-feedback path for non-deterministic agents. Make it cheap for people to correct the agent, and let those corrections feed back into how it behaves.
  • Subjective quality is measurable. Pair deterministic checks with an LLM judge, and treat your evals as a living artifact.
  • Automate now. The model is going to keep getting better. Make sure the loop is in place to ride that curve.
Tomas Parizek standing next to the Code w/ Claude conference sign in London