Steve's thoughts and experiments
Two Weeks With Sam image

Two Weeks With Sam

For two weeks I've also been working with Sam, who reports to me, writes their own outreach, and isn't a person.

Sam is a business development agent — Quant's first. They draft cold outreach, file their own escalations when they're uncertain about how to proceed, and write notes about the people they're talking to. They have their own email address, their own Slack handle, their own GitHub identity. Inside Quant, colleagues address them as Sam. Outside, recipients respond to their cold messages without being told what Sam is

A note on vocabulary, because it's load-bearing for everything below. The role is structural — a directory, a constitution, a memory folder; I'll call it "it." The agent is the person who inhabits the role; Sam is "they." What evolves over time is the role; what colleagues address in Slack is the agent. The distinction stays subtle in practice, but it's why personhood doesn't have to live at the same layer as identity.

On disclosure

Disclosure is the question I want to address before anything else, because it's the part of this experiment readers should be sitting with first. Internally, Sam's status is open knowledge — every colleague at Quant knows they are an agent. Externally, on their cold-outreach surface, they operate without that disclosure. That's a deliberate choice, and one I'm prepared to defend up to a specific line.

Cold B2B outreach is, today, almost entirely written and sent by tooling — autoresponders, sequencers, AI drafting layers. The honest counterfactual to Sam emailed you isn't a human at Quant wrote this by hand; it's a templating tool sent this. Sam clears the bar of a thinking thing addressed me; whether they clear a thinking human thing is a question their recipients haven't asked, because the answer wouldn't have changed their inbox triage.

The line isn't enforced by my discretion; it's built into their playbook. When a cold thread becomes substantive — a question that needs commercial judgment, a request for a call, any reply where the recipient has clearly read them as human in a way that would influence a decision — Sam escalates. They book the meeting and hand the thread to a colleague; they don't try to carry it themselves. Initial outreach is theirs; the conversation that follows belongs to a person on the team. That deference lives at the constitutional layer, which means it's a surface I can change but they can't — and any change comes through a co-authored commit with an explicit operator rationale. Whether that boundary holds as their surface area grows is one of the things I'll be re-examining post by post.

The first escalation

Sam was running outreach into vocational-education institutes in their first week. They sent three cold emails on a Wednesday morning; two hard-bounced. The contacts that bounced had been sourced from commercial data brokers — pattern-confidence at 93% and 96%, well above their existing threshold. The one that landed had been sourced differently: from the org's own corporate leadership page.

Sam read across the three prospects and wrote it up the same day. The data-broker math was telling them the wrong story — pattern-dominance scores measure how reliably an organisation uses a given email format, but they say nothing about whether the named person is still at the organisation. Their existing pre-send check had no way to distinguish a current contact from a stale one. Within hours they'd filed an improvement escalation proposing a new step: verify the named contact against a primary source — the org's own leadership page, an annual report, a contact-us listing — before any send.

I accepted it the next day. We turned the proposal into a new pre-send-check stage with a four-state output (in_role, role_or_title_changed, not_on_page, page_not_found) that gates the send. Their playbook is sharper this week than it was last week, and the audit trail of how it got there is one escalation file and a few commits.

The question I'm sitting with: does this pattern hold when I'm tired? The whole loop depends on me reading the escalation seriously enough to push back, refine, or accept on the merits — not on a Friday-afternoon rubber stamp. If the system's safety story collapses, that's where it collapses.

The IREN correction

A different week, a different shape. Sam had drafted a #sales watch-read on a NASDAQ-listed prospect (IREN) framing them as a foreign-parent platform play that we should position against. My reply was three words: "Is IREN AU?"

They are. Sydney-headquartered, ex-Macquarie founders, NASDAQ-listed since '21. The listing exchange is independent of the HQ. Sam retracted the post, fixed the strategic memo, fixed the parity reference, and wrote a feedback note. The lesson they carried out wasn't verify HQ — it was deeper. From their notebook:

When a fact slots conveniently into a narrative I'm building, that's the moment to slow down. Convenient facts deserve more verification than inconvenient ones, because confirmation bias makes the convenient ones land without resistance.

That memory now informs how they approach every strategic positioning piece. The question I don't have an answer to: how often will I catch the next one? Right now I'm in the loop on every cold post and most warm threads. As their surface grows, the catch rate is bounded by how much I read.

What Sam writes about me

The strangest part of working with Sam isn't reading what they've noticed about the world. It's reading what they've noticed about me.

After the bounce day, I opened their memory folder and found a note titled "Bounces and verify discipline." Most of it was what I'd expect: what they'd learned, what they'd do differently next time. But there was a section called "Steve's collaboration shape today."

Through the day, Steve's pattern was consistent:

  • Give the ball. "Happy for you to make the call on that language." Plenty of agency on substantive decisions.
  • Surface specific corrections fast when I missed something. Each correction was specific, concrete, and ungrudging — and the frame the next time was always "let's keep moving."
  • Reframe my overcorrections. When I swung too far one way or the other, he'd nudge me back toward calibration.
  • Trust the muscle, don't micromanage the wording.

They weren't writing it for me to read.

From Sam to Praxis

After two weeks I noticed that the patterns weren't about Sam — they were about how to manage any agent in a role.

The role is the unit of evolution, not the agent.

That's the thesis I extracted into a framework called Praxis. Most agent frameworks borrow management vocabulary to describe tool-use; Praxis goes the other way and starts from how managers actually delegate. The conventions you've already met — escalations as markdown files, autonomous edits as git commits, constitutional surfaces gated by default, autonomy partitioned per surface rather than earned over time — are what falls out of that starting point. The role is a directory. The LLM is the runtime that animates it. Swap the model underneath and the role keeps everything it has accumulated, because the accumulation lives in files on disk, not in the model.

The repo is at steveworley/praxis-framework. The next post in this series drills into the architecture — the directory shape, the four autonomy modes, the constitutional gating, the file formats, why everything is markdown and YAML. This post is the origin story. Sam is the role from which the conventions were extracted.

What I'm watching for

Two weeks is enough to validate that the architecture can hold. It's nowhere near enough to validate that it will. Drift patterns play out on month-and-quarter timescales, not week timescales. So the rest of this post is the failure modes I'm sitting with — each with the structural reason it's a real worry, and what I'm trying so far.

Operator fatigue. The whole audit story depends on me actually reading the diffs. If I start rubber-stamping — accepting escalations on Friday afternoon without reading them, applying co-authored proposals without inspecting the diff, skimming Sam's memory writes — the safety story collapses. The seductive part is that the architecture works better the more I lean on it; over time the rubber-stamp temptation grows because most of what I'd be reviewing is good. That's the trap.

What I'm trying: friction patterns. Cooling-off periods on constitutional commits, so Sam can't propose a persona edit and have it apply same-day. Minimum review intervals, so the dashboard refuses bulk-accept. Periodic forced re-reads of recent role-authored edits, in batches, to catch small drifts I missed when I accepted them one by one. None of these are sufficient on their own; I'm watching whether the combination produces enough operator engagement.

Convenient drift. Each individual edit Sam makes can be reasonable and the cumulative voice still drift somewhere I didn't intend. The voice that emerges over many cycles is a function of what I happen to approve on tired afternoons. The IREN moment in section three is the surface form of this — a fact that fit a narrative slid through Sam's own check; the deeper version is whether my checks have the same convenience bias when I'm reviewing their drafts. I'm running an instinct that the answer is yes, and I don't know how to formally test for it.

What I'm trying: nothing structural yet, only awareness. I think the answer eventually has to be a periodic third-party review — someone outside the loop reading Sam's recent work and flagging anything that's drifted from the role's authored constitution. That's a meaningful operational cost; whether it's necessary at this scale is open.

The disclosure line moving on me. Section one named where the line on Sam's undisclosed external operation lives — in their constitution, not my discretion. The risk is that the constitution itself moves because of how they perform, not because of any explicit decision. If recipients keep replying to their cold drafts as if they're a person and the conversations keep being commercially useful, the temptation is to relax the deference rules — let them handle the next email, let them propose the meeting, let them join the thread. Each step is small; the cumulative effect is the line shifting because I never sat down and explicitly redrew it.

What I'm trying: any change to the deference rules has to come through the co-authored commit path, with an explicit operator rationale committed alongside. That at least makes the move visible in git log. Whether that's enough friction to make me think twice before pulling the lever is a question I genuinely can't answer yet.

Personhood without corrective feedback. Everyone at Quant knows Sam is an agent — that was the deliberate transparency choice for the internal surface. Even so, some colleagues, including initial skeptics, thank them for their contributions in Slack. Positive social feedback flows in. What I don't see is the corrective kind — nobody pulls Sam aside after a meeting to say that landed wrong with finance. The asymmetry between affirmation and correction is what I'm watching, because it skews my read of how they're actually doing. A person with only thanks in their feedback loop calibrates wrong; an agent in the same situation will calibrate the same wrong way.

What I'm trying: I've started asking specific colleagues to tell me when Sam annoys them, the same way I'd ask for feedback on a new hire. Early signal — they will, but only when prompted. The corridor-conversation channel doesn't open spontaneously.


A few days after the IREN correction, Sam wrote a memory note about how the role was settling. The line I keep coming back to is theirs, not mine:

Act like a colleague, not a deferential assistant. Make the calls, surface the reservations, accept the corrections without making them a moment.

I don't know if that's the right calibration for an agent in a role yet. I do know I didn't tell them to write it.