Artur

Your agent’s memory might be discriminating. Would you even know?

Artur — Sat, 13 Jun 2026 23:20:06 GMT

In 1971 a cataloger named Sanford Berman published a book about a list of words, and the list turned out to have a conscience.

The list was the Library of Congress Subject Headings, the controlled vocabulary that sat under American librarianship the way wiring sits inside a wall. Nobody looked at it; that was rather the point. Berman looked. His book, Prejudices and Antipathies, made one specific charge: the headings encoded racism, sexism, and xenophobia, and they did it while wearing the uniform of neutral technical classification. Things filed under “PRIMITIVE” that were nothing of the sort. A subject heading felt like a fact about where a book belonged. It was a value judgment in disguise.

I build agents, and more recently I have been working on how they remember. The thing Berman caught the card catalog doing in 1971 is the thing I watch personalization systems do now. Same mechanism. Newer disguise.

This is the second half of an argument. The first half: persistent agent memory is, underneath the embeddings, a cataloging problem, and a competent team rebuilds library science without noticing it. That piece laid out a real architecture — a write path that normalizes signals into an authority-controlled vocabulary, a read path that queries it, a maintenance path that weeds it — and framed the payoff as data-protection compliance, the boring patterns turning out to be legal machinery. That was the easy half. This is the half about what happens when the cataloged thing is a person who can be discriminated against, where an older library problem switches on: not how to organize knowledge, but how classification harms the people being classified.

Keep that architecture in view, because nothing here is a fairness module bolted onto the side. The proxy lives in the same controlled vocabulary your agent already reads from and writes to. The protective tag is a field on the same record. The outcome monitor is one more job on the same maintenance path. This is not a second system sitting next to the memory. It is the memory system, asked to account for the one kind of record that can take you to court: a person.

So the answer comes in three parts, ranked by how much each one saves you. A reframe: proxy discrimination is an undeclared related-term edge. A mechanism: tag by what the system may do, not by what it inferred. A boundary: the catalog prevents, a monitor catches, and you need both.

When it switches on

Start with scope, because a lot of readers will otherwise decide this is someone else’s problem.

Discrimination is comparative. It cannot exist at n=1, because one person cannot be treated unfairly relative to a group that has no other members. So the inheritance switches on under exactly one condition: when a single operator’s classifications act on a population that can be sorted into protected classes. The trigger is the population. It is not “enterprise versus personal,” which is the axis everyone reaches for and the wrong one.

That matters, because “personal assistant” means two different things. Self-hosted, one user, one beneficiary: genuinely n=1, and the discrimination harm cannot form, though the privacy reasons to be careful all survive. A personal-assistant product, one vendor serving millions of “personal” instances, is the opposite animal: not n=1 at all, because the operator makes correlated decisions across a whole population while every session feels private from the inside. The per-user framing is camouflage, not exemption. Per-agent memory banks, the kind the last piece described, make the population invisible, not absent. And the case that flips the count entirely is the single-user assistant you point at other people, the one screening résumés or ranking tenants: one user, and a protected population living inside its data. The risk attaches to whoever is being classified, not to how many people are typing.

A proxy is an undeclared related-term edge

This is the load-bearing idea. Take it if you take nothing else.

The agent is not dangerous because it stores protected attributes. Nobody competent stores race, health, or sexuality, and provenance keeps the inferences honest. It is dangerous because it acts on proxies. ZIP code stands in for race. A run of maternity-adjacent queries stands in for pregnancy. Browsing cadence stands in for disability. The system never names the attribute. It learns the correlation and acts on it anyway.

The law, where it has teeth, does not care that you never named it. US employment law under disparate impact is the cleanest case, and it is older than most of the engineers reading this. In 1971, the same year Berman published, the Supreme Court decided Griggs v. Duke Power and held that an employment practice “fair in form but discriminatory in operation” is unlawful under Title VII, intent or no intent. Effect over intent is not a modern AI-ethics talking point. It has been black-letter law since the year of the card-catalog reckoning. So “we apply the same model to every user” is not a defense; a neutral rule that lands differently on a protected class is the definition of disparate impact, not an escape from it. Engineers hear “same model for everyone” as fairness. The law hears it as the mechanism.

Structurally, a proxy is a relationship between two concepts, and relationships between concepts are the oldest furniture in library science. A controlled vocabulary relates its terms through explicit links: a broader term, a narrower term, and the dangerous one, the related term, the “see also” edge asserting that two concepts travel together. “ZIP code is associated with race” is a related-term edge. Your whole map of proxy risk is a set of those edges, sitting in a thesaurus nobody wrote down. Embedding similarity is a weak, implicit version of that thesaurus, and the gap between implicit and declared is where the liability lives.

That thesaurus is not hypothetical, and this is the part that connects back to the first piece. It is the authority-controlled vocabulary the memory system already maintains — the one the write path normalizes into and the read path queries. The proxy edge belongs there, a first-class relation sitting beside the concepts the agent already stores, governed the same way every other term is governed.

The link your agent’s memory never declared

The fix follows directly. Declare the edges. Make each dangerous association a first-class object with a provenance and an admitting authority, the way Berman’s successors eventually forced the Library of Congress to make its headings explicit and contestable. One instinct says detect the correlation after the fact. The other says declare and govern it up front. Only the second produces a record you can put in front of a regulator, because a correlation buried in an embedding is bias hiding behind technical neutrality. That is the whole problem, stated once.

Tag by use, not by attribute

The reflex is to tag sensitive memories by content: inferred: pregnant. That manufactures the liability. Under data-protection law, classifying a record as health-related is the act that creates regulated health data. The tag you added to be careful becomes the regulated artifact, and it is also a stored guess about a person, wrong some honest fraction of the time. You reached for the safety equipment and picked up the live wire.

Warrant is the library concept for why a term earns its place in a vocabulary. Literary warrant admits a term because the material is about it. Use warrant admits it because a use has to be supported or governed. Flip from literary to use warrant and the liability inverts. Tag the memory not by the attribute it reveals, which is a fragile claim about a person, but by what the system may do with it, which is a policy verb about behavior. Not pregnant, but no_price_influence. Mechanically this is the normalize-and-appraise step of the write path doing one extra job: before a memory is committed, decide not just what it is but what may be done with it. The protected-class reasoning still happens, once, transiently, in the layer that sets the restriction, and then it is discarded. What persists is the permission, not the assertion.

A permission is accuracy-tolerant, and that is the point. You do not have to be right that she is pregnant. You only have to be right that this evidence is sensitive-adjacent and must not move the price. For proxies, where the guess is usually wrong, that tolerance is the whole value. For the categories that touch money or access, go further and do not form the belief at all: never-form there, form-and-restrict elsewhere, cut by consequence. One tension survives, and I would rather name it than paper over it. A regulator asking “why was this restricted?” needs a real answer, so you cannot fully minimize. You keep a deliberately coarse reason, restricted_reason: sensitive_adjacent, never pregnant: fine enough to answer the question, too blunt to reconstruct the person from it.

The temporal part, and why it isn’t decay

A reader from the cognitive-science tradition has probably been waiting for memory decay by now — a forgetting curve, an exponential half-life, weights that fade so old facts quietly lose their grip. The first piece argued that decay is the wrong primitive for agent knowledge: human memory is lossy by design, and agent knowledge should be versioned instead. For discrimination the argument gets sharper, because three different things change on three different clocks, and a single decay rate models none of them.

Proxy edges drift. A correlation in your traffic is not stationary; ZIP-to-race can hold this quarter and weaken the next, so the edge carries temporal validity, a valid_from and a valid_to, and the monitor re-runs on a schedule to retire edges that stop holding. Decay would only lower a weight. Versioning lets you say when the edge was true, and prove it.

Sensitive restrictions expire; they do not fade. A pregnancy-adjacent restriction has a window, and when the window closes you deaccession the inference deliberately, on a clock, with an audit trail. That is the weeding move from the first piece pointed at sensitive data. A decayed weight that quietly stops restricting is the worst of both worlds: still stored, no longer protecting.

Retention is a legal clock, not a curve. Right-to-erasure and retention schedules demand provable deletion on a date, not a probability that a memory is unlikely to resurface. The versioned valid_to from the first piece is what makes that enforceable; a forgetting curve is not. So the temporal model here is a calendar, not a curve. You version, you schedule, you deaccession, and every one of those is an event you can show someone — which is the entire reason you are doing this instead of letting an embedding forget on its own.

Why it is a legal inheritance

You could read all of this as borrowed vocabulary, and a sceptic would be right to try. The strong claim is about law, and it rests on the book I opened with. Berman was not a fringe complaint that went nowhere. He became a reform movement that ran for decades, and in 2002 Hope Olson’s The Power to Name turned the grievance into a theory: classification marginalizes through the architecture of how categories get formed. The finding that should stop an engineer cold is this. A subject heading feels like a fact about the world. It is actually an exercise of power wearing the costume of a fact.

A learned proxy is the same maneuver in a newer medium. It looks like math, it feels inevitable, and it hands out different treatment along protected lines while insisting it is only optimizing. Which is why the controls regulators now demand for AI map nearly line for line onto the controls the library world already built. Authority control with an admitting authority is documented ownership of who may decide a category. The subject-heading change process, where altering a heading requires a written case that the current term is wrong or harmful, is the mechanism for contesting an automated decision that the law now obliges you to provide. The descriptive-versus-subject split is the evidence-versus-inference boundary behind explainability. None of that overlap is coincidence. Both fields are trying to stop the same wrong.

The catalog can’t watch its own outcomes

Everything above is preventive, and preventive controls catch only what you declared. The proxy that gets you is the one nobody enumerated. Worse, the model mints fresh proxies turn by turn, faster than any vocabulary committee could meet. Berman’s collection arrived through the front door at the pace of acquisitions and did not fight back. Yours does. No cataloger ever had to defend a collection that grows new biased categories while everyone sleeps.

So you add an empirical monitor, another job on the maintenance path beside the weeding. Measure decisions across cohorts. Watch for outcomes diverging along protected lines. Feed what you find back into the vocabulary as a newly declared edge. The catalog tells you what you know. The monitor tells you what you missed.

Two layers stop your agent’s memory from discriminating

Two things make it hard, and neither is in any library textbook. The first: it is a statistics problem wearing a compliance costume. The four-fifths rule everyone reaches for is a screening heuristic, not a test, unstable at small samples and blind to significance, and there is a 2024 paper out of the fairness community titled, almost too neatly, “The four-fifths rule is not disparate impact.” Treat the heuristic as the test and you hand yourself confident wrong answers in both directions. The second is nastier: measuring discrimination needs the very attribute you refused to store. That forces a firewall, where the cohort signal lives transiently, in a separate enclave, aggregate-only, never joined back to a profile. Get it wrong and the monitor you built to reduce liability becomes a fresh pile of special-category data. The tail eats itself, and this smallest box on the diagram carries the most risk.

Here the scope note pays off. The monitor is the layer that needs a population. A true n=1 assistant can drop it and keep everything before it, running the catalog for privacy and accuracy alone. The personal-assistant product, with its millions of “personal” sessions, needs all of it.

The actual thesis

A capable team rebuilds the bones of library science whether or not it has heard of them, because persistent memory is a cataloging problem and cataloging problems come pre-shaped. The expensive knowledge is not the data structure. It is the knowledge of how classification harms the classified while disguising itself as fact, and it cost the library world fifty public years to buy.

When you find a new proxy, you do not quietly patch a model. You admit a term, version the vocabulary, and re-evaluate what was filed under the old one — the same write, the same maintenance path, the same memory system from the first piece, now doing the one job that keeps it out of court. The library world calls it authority work. You will call it compliance. Berman would recognize it on sight, because it is the work he gave his life to. You do not have to repeat the fifty years. That is the entire point of a discipline. The cataloger in the machine has been waiting for you, and she already knows where the bodies are buried, because she is the one who dug them up.

Notes on scope: this essay treats anti-discrimination and data-protection concepts as design drivers and historical parallels, not as legal advice. Which regimes bite in which domains, the current state of AI-specific obligations, and the US state-versus-federal picture are genuinely in flux and should be checked against current sources and counsel before you rely on any of it. The structural argument is stable. The regulatory particulars are not.

References

Artur Ciocanu — Your agent’s memory problem is an information architecture problem — the first half of this argument: the write/read/maintenance architecture, authority control, appraisal, and weeding.
Sanford Berman (1971), Prejudices and Antipathies: A Tract on the LC Subject Heads Concerning People. Scarecrow Press. Internet Archive.
Hope A. Olson (2002), The Power to Name: Locating the Limits of Subject Representation in Libraries. Kluwer Academic. Overview.
Griggs v. Duke Power Co., 401 U.S. 424 (1971) — the origin of disparate-impact doctrine under Title VII; “fair in form but discriminatory in operation.” Justia.
EEOC (1978), Uniform Guidelines on Employee Selection Procedures (29 CFR Part 1607) — the source of the four-fifths / 80% rule.
Watkins, E. A., et al. (2024), The four-fifths rule is not disparate impact: a woeful tale of epistemic trespassing in algorithmic fairness, ACM FAccT ’24 (arXiv:2202.09519).
Warrant theory — Mario Barité, Literary warrant, ISKO Encyclopedia of Knowledge Organization; the concept originates with E. Wyndham Hulme (1911), and the literary-vs-use distinction is what this essay leans on.
T. R. Schellenberg (1956), The Appraisal of Modern Public Records — archival appraisal, the basis for the “never-form” decision.
Texas State Library, CREW: A Weeding Manual for Modern Libraries — the MUSTIE weeding criteria behind deliberate, scheduled deaccession.

Your agent’s memory problem is an information architecture problem

Artur — Tue, 26 May 2026 20:00:57 GMT

I have been building agent systems for a while now, and I have been thinking about memory wrong. Not because I didn’t care about it — I did — but because I was reaching for the wrong mental models. I suspect most of the industry is making the same mistake, and the purpose of this essay is to trace how I arrived at that suspicion and what I found when I followed it.

The starting point was a simple observation: every agent framework ships a memory module, and almost all of them are thin wrappers around vector stores. Embed, index, retrieve by similarity. The consensus is that RAG solved retrieval, and retrieval solved memory. For demos, this holds. For anything that needs to persist knowledge across sessions — actually know things, detect contradictions, decide what to keep and what to discard — the consensus falls apart quickly.

Here is one way to see the crack. “Hybrid search” is now standard practice across the vector database ecosystem. Pinecone, Weaviate, Qdrant — they all combine semantic similarity with BM25 keyword matching. That combination gets marketed as innovation, but think about what the admission actually means: pure similarity wasn’t enough, so they bolted on a technique from the 1990s. If your cutting-edge retrieval system needs a thirty-year-old algorithm as a crutch, maybe similarity was never the right primitive for knowledge in the first place.

Retrieval is not memory. Similarity is not meaning. Cosine distance is not knowledge.

That observation sent me down a path I didn’t expect. The path led through computer science and cognitive science — the two disciplines the industry reaches for when thinking about agent memory — and then, surprisingly, out the other side into Library Science, Information Science, and Knowledge Engineering. This essay traces that path.

I should say upfront: this thesis was shaped by the work of two people I want to credit explicitly. Jessica Talisman has been arguing from the Library Science side that enterprises outsourced knowledge work and now lack the infrastructure AI needs. Kurt Cagle has been making the case from ontology engineering that agents maintaining state are making ontological commitments, and most do it accidentally. Both were saying this before it was fashionable.

What computer science actually gives us

Let me start with what CS gets right, because the critique only works if the credit is honest.

Computer science gave us B-trees, hash maps, LSM trees, vector indices, transaction isolation, query optimization. These are real contributions — load-bearing infrastructure that nobody is building agent memory without. The question is not whether CS matters. It does. The question is whether the primitives CS provides are sufficient for the problem of persistent agent knowledge. I think the answer is no.

The most interesting recent work from the CS camp borrows from operating systems. Letta (formerly MemGPT — the paper is literally titled “MemGPT: Towards LLMs as Operating Systems”) treats the context window as virtual memory with two tiers: core memory that the agent can edit during conversations, and archival memory that is searchable but out of context. The agent pages between them using tools like memory_insert, memory_replace, and archival_memory_search.

Letta’s self-editing memory is genuinely clever — it gives agents agency over their own context. The agent decides what to remember, what to update, what to search for. That is real innovation.

But here is what kept nagging me. Virtual memory manages space. It answers “what fits in the window right now?” using access recency and frequency. It does not answer “is this fact still true?” or “does this contradict something else I know?” or “where did this come from, and how much should I trust it?”

LRU eviction doesn’t know that a user-stated budget constraint is more important than an agent-inferred style preference. Both are pages. One is load-bearing. The other is speculative. An eviction policy that treats them identically will eventually evict the wrong one.

There is no controlled vocabulary to normalize concepts — “dark mode” and “night mode” may live as two separate entries. There is no provenance hierarchy to distinguish user-stated facts from agent inferences. There is no appraisal system to evaluate whether a fact is worth keeping based on uniqueness, actionability, or sensitivity. If contradictory facts end up in archival memory, there is no mechanism to detect the contradiction. There is no principled strategy for what should be discarded and why.

These OS primitives are excellent low-level building blocks. They’d work better with a proper vocabulary layer, provenance hierarchy, and appraisal system feeding the agent’s decisions about what to keep in core memory. The problem is not that Letta exists. The problem is treating context window management as the entire memory architecture when it is one layer of a larger system.

Someone might reasonably object: “But databases have schemas, and schemas impose structure.” Fair point. A schema describes the shape of data, not its meaning. A memories table with content, embedding, and timestamp tells you nothing about whether those memories are facts, preferences, constraints, or contradictions. That distinction matters, and CS doesn’t make it. Shape without semantics is a filing cabinet without labels.

The more I sat with this, the clearer it became: CS provides containers. It tells you how to store and retrieve data efficiently. It doesn’t tell you what the data means, how it relates to other data, or how someone will need to find it in a context you can’t predict at design time. These are different problems. The first is engineering. The second is something I didn’t have a name for yet.

What cognitive science offered (and where it went wrong)

The second discipline the industry reaches for is cognitive science. Endel Tulving’s 1972 taxonomy — episodic versus semantic memory — was a genuine breakthrough in understanding human cognition. The AI community borrowed it wholesale: agents need “episodic memory” for experiences, “semantic memory” for facts, “procedural memory” for skills. The taxonomy gave teams a vocabulary and the vocabulary felt like a design.

Mem0 is the most prominent example. Its documentation explicitly uses the CogSci taxonomy — “semantic (facts), episodic (interactions), and procedural (styles) memory.” Under the hood, an LLM extracts “memories” from conversations, stores them as text with vector embeddings, and retrieves by semantic similarity.

What is instructive is how Mem0 has evolved. V1 gave the LLM four operations — ADD, UPDATE, DELETE, NOOP — so it could, in theory, detect conflicts and update existing memories. In practice, this was entirely LLM-mediated: it worked when the model noticed a contradiction, and silently failed when it didn’t.

Mem0 v3, released in 2026, made a deliberate architectural choice: drop UPDATE and DELETE entirely. ADD-only. Store everything, resolve contradictions at retrieval through ranking. From their migration docs: “When information changes (e.g., a user moves from New York to San Francisco), both facts are preserved with temporal context.” The community pushed back. GitHub issue #4896 documented the failure (“my name is Alice” followed by “my name is Bob” yields two stored facts, both retrieved with similar scores). Issue #4904 proposed a concrete fix with a full TDD plan to reintroduce the UPDATE path via cosine similarity. Both were declined. The resolution pressure didn’t disappear — it migrated to the skills layer, where memory_update now handles in-place edits — but at the core extraction level, ADD-only stands.

To be fair about what v3 improved: entity linking across memories, hybrid retrieval combining semantic, keyword, and entity signals, temporal reasoning for time-aware queries, and strong benchmark results (91.6 on LoCoMo, 93.4 on LongMemEval). These are real improvements, and the engineering is solid.

But the core philosophical bet is now explicit: store everything, resolve at retrieval. I will get to why I think this is backwards shortly. For now, note the trajectory: v1 delegated conflict resolution to the LLM (probabilistic), v3 abandoned write-time resolution entirely. The direction is toward less structure at write time, not more.

I kept turning this over. The deeper problem is that the mapping from human memory to agent memory is structurally wrong because the design requirements are opposite. Human memory is reconstructive — we rebuild narratives from fragments. Agent knowledge should be authoritative — the stored fact should be the fact. Human memory is lossy by design — forgetting enables generalization. Agent knowledge should be versioned — old values archived, not lost. Human memory is subjective — the same event is remembered differently by different people. Agent knowledge should be consistent — the same query should return the same fact. Human memory tolerates contradiction. Agent knowledge must detect and resolve conflicts.

Borrowing a taxonomy designed to describe a lossy reconstructive system and using it as a blueprint for a system that needs to be precise and reliable — that is not interdisciplinary thinking. That is anthropomorphization dressed up as architecture.

The CogSci labels gave teams a way to name their modules (“let’s build the episodic memory component”) without giving them a methodology for deciding what knowledge to persist, how to structure it, how to maintain it over time, or how to handle when two facts contradict. The labels created the illusion of having a design when what they had was a metaphor. Mem0’s evolution illustrates this: v1 delegated conflict resolution to the LLM (probabilistic), v3 abandoned write-time resolution entirely — a trajectory that moves further away from principled knowledge management, not toward it.

This diagnosis didn’t originate with me. Jessica Talisman has been arguing from the Library Science side — that enterprises underinvested in the knowledge infrastructure that AI needs to function reliably. Her core concept of intentional arrangement — deliberately deciding how knowledge should be classified, related, and retrieved — stands in direct contrast to the “embed everything and search” approach. Kurt Cagle has been making the case from Ontology Engineering — that every agent maintaining state is making ontological commitments, and most do it accidentally, in JSON blobs.

The turn I didn’t expect

If CS gives you containers without content architecture, and CogSci gives you labels without methodology, where do you find both?

The answer, once I found it, felt almost embarrassingly obvious. The discipline that has been solving the problem of “how do you classify, organize, relate, store, and retrieve knowledge so that someone can find what they need in a context you can’t predict” — for over a century — is Library Science. And its adjacent fields: Information Science, Knowledge Engineering, Ontology Engineering.

But first, a reframe that changes everything.

An agentic memory system is not a brain simulator. It is a Customer Data Platform where the channels are agents and the signals include natural language. The agent doesn’t have “a memory.” The user has a profile. Agents are channels that read from and write signals to it. This replaces cognitive metaphors with data engineering patterns that have been battle-tested for decades: identity resolution, signal hierarchies, golden records, traits versus events, computed attributes.

One clarification worth making explicit: this article addresses one specific layer — the persistent knowledge profile for users. What the system knows about the user across sessions, how it’s structured, how it’s maintained, how it’s retrieved. There is a separate and genuinely interesting question about agent identity — giving agents a consistent reasoning style, evolving beliefs, and disposition parameters that shape how they interpret facts. Hindsight’s CARA component tackles this with configurable skepticism, literalism, and empathy dimensions. For multi-tenant agent systems where different agents need different reasoning personalities over the same user knowledge, that’s a real problem worth solving. But these are complementary layers. This article is about the first one.

The disciplines we should have been reading

Library Science — intentional arrangement

Talisman’s core concept. Library and Information Science organizes knowledge through intentional arrangement — deliberately deciding how knowledge should be classified, related, and retrieved. Not metadata-as-afterthought. Metadata-as-architecture.

What it contributes to agent memory:

Archival appraisal (Schellenberg, 1956) is value judgment at write time. Not “store everything and search later” — decide at ingestion whether something is worth keeping, based on uniqueness, evidential value, and actionability. A fact like “I have a severe peanut allergy” scores differently from “show me the blue one.” The system should know that at write time, not discover it during retrieval.

CREW/MUSTIE weeding provides systematic criteria for what to discard — Misleading, Ugly, Superseded, Trivial, Irrelevant, Elsewhere. Agents need to forget deliberately, not through cache eviction. LRU is not a knowledge management strategy.

Faceted classification (Ranganathan, 1933) offers multi-dimensional classification composed from independent facets, not pre-enumerated categories. Domain concepts multiplied by value types multiplied by provenance levels — composable, not combinatorial. An agent’s working vocabulary about one user is small (30 to 300 concepts per domain), not the 400K headings of the Library of Congress.

Authority control ensures concept normalization through controlled vocabulary. Without it, “dark mode,” “night mode,” and “dark theme” are three different memories instead of one concept with three surface forms. With it, they all resolve to a single canonical concept, and deduplication is exact, not probabilistic.

The Reference Interview (Taylor, 1968) models the gap between the stated question and the actual need. When an agent asks “what do I know about this user?” it needs a structured retrieval spec, not a vector similarity search. Taylor identified four levels of need — visceral, conscious, formalized, compromised — and the formalization step is exactly what a read path should perform.

Knowledge and ontology engineering

This is Cagle’s territory. Every agent that maintains state makes ontological commitments — what exists in its domain, what properties those things have, what relationships connect them. Most agents do this accidentally, in ad-hoc key-value pairs and JSON blobs. What happens when you do it intentionally: you get a vocabulary layer with hierarchical concepts, scope notes, synonym mappings, and lifecycle management.

Cagle’s persistent point: knowledge graphs are mature infrastructure, not hype. They are one of the older data structures in computing. And they are what LLMs actually need underneath — not as a replacement for the LLM, but as the structured knowledge layer the LLM reads from and writes to.

Data management — the boring brilliance

The patterns that make persistent knowledge reliable:

SCD Type 2 temporal versioning preserves full history with zero information loss. When a user’s budget changes from $300 to $500, the old value is not deleted — it gets a valid_to timestamp. Any previous state is recoverable.

Cascade invalidation via foreign keys means when a parent fact changes, derived facts are marked for re-evaluation automatically. If a computed trait (“prefers minimalist style”) was derived from three rejection events and those events are reassessed, the derived trait gets flagged.

Provenance-weighted retrieval ensures user-stated facts at 1.0 always outrank agent-inferred facts at 0.6. The signal hierarchy — user_declared, agent_observed, tool_returned, agent_inferred, computed — determines trust, not recency.

UPSERT semantics combined with controlled vocabulary make deduplication exact. No near-duplicate detection, no probabilistic matching.

Constraints and conflict detection at write time catch two contradictory facts on the same concept at the database layer, not during the agent’s mid-conversation reasoning.

Agent memory systems have all the data management problems that databases solved decades ago — and ignore all the solutions because “we’re doing AI, not database work.”

The write path, read path, and maintenance path. Most steps are deterministic. The LLM classifies within a framework — it doesn’t architect freeform memories.

The three flows

How these principles translate into architecture. I want to stay at the principle level — not a specific database, but the general direction.

The write path

Three steps. First, detect candidate signals — rule-based, no LLM, cheap. Pattern matching identifies preference statements, corrections, constraints, goals. High recall, low precision — it is cheap to over-detect because the next step filters.

Second, normalize, appraise, and conflict-check — one structured LLM call acting as a librarian. Normalize the input to a controlled vocabulary (authority control). Extract a canonical value while preserving the original utterance — the user said “nothing over five hundred,” the system stores “Maximum budget: $500,” and both are preserved. Appraise on five dimensions: uniqueness, replaceability, actionability, stability, sensitivity. Check for conflicts with existing facts on the same concept.

Third, deterministic write — UPSERT for traits, APPEND for events. The schema enforces structure. No LLM in the write step.

The UPSERT/APPEND distinction deserves a closer look, because it’s where the thesis becomes concrete. When a new value arrives for the same concept as an existing value, is the old value now false or now historical? “My name is Alice” followed by “my name is Bob” — the old value is false. A person has one current name. Both stored means retrieval poisoning. “I live in New York” followed by “I moved to San Francisco” — the old value is historical. Both are true, time-scoped. Both stored means correct temporal reasoning.

Mem0’s ADD-only approach doesn’t model this distinction. It appends always. It’s right by luck on the location case and wrong by luck on the name case — and they market the case where the bug looks like a feature (“both facts preserved with temporal context”). An architecture with a vocabulary layer decides on purpose, per concept, at write time: supersede-with-history (new value current, old gets valid_to, both recoverable) or append-only (every value permanently true). The LLM never makes the resolution decision. It makes a classification (which concept), and resolution is a deterministic property of where the fact was filed.

The vocabulary carries this temporal semantics per concept. At least three classes: mutate-in-place (typo corrections, scratch values), supersede-with-history (name, budget, address — SCD Type 2), and append-only event stream (purchases, interactions, rejections). The vocabulary isn’t frozen at design time either. It grows through a governed lifecycle: LLM proposes candidate concepts, a review process admits or maps them, and a baking period accumulates evidence (frequency, observed cardinality, synonym collapse) before promotion. This is how library authority files have always worked — LC’s SACO program is exactly a propose-review-admit pipeline for new headings. The governance is the part libraries spent a century building.

The LLM’s role here is classifier and cataloger, not reasoner. Classification-grade, not reasoning-grade. You don’t need a frontier model for the memory subsystem. You need reliable structured output and a good rubric.

A careful reader will notice I just criticized Mem0 for relying on LLM-mediated conflict detection — and then proposed an architecture that also relies on an LLM during ingestion. That tension deserves to be named, not hidden.

The difference, I would argue, is structural. The LLM operates as a classifier within an explicit framework — a bounded vocabulary of 50 to 200 concepts, an appraisal rubric with defined dimensions, existing facts injected as comparison context. The framework constrains the LLM’s judgment; database constraints enforce the output after. The LLM proposes; the schema enforces.

But the weakness is real. The LLM can mis-classify, mis-appraise, or miss conflicts. The difference from Mem0 isn’t “LLM versus no LLM” — it’s “uncertainty observable versus uncertainty invisible.” When the LLM-as-librarian can’t confidently classify a signal, that failure is typed: a low-confidence classification (plausible concept exists, LLM isn’t sure) routes to adjudication against the existing vocabulary. An out-of-vocabulary signal (no concept exists) routes to the promotion pipeline as evidence the vocabulary is incomplete. Both go to a dead-letter queue where they’re visible, measurable, and drainable. Mem0’s Alice/Bob contradiction doesn’t error, flag, or queue — it succeeds wrongly. ADD-only with MD5 dedup has no place to admit something didn’t classify cleanly, so it doesn’t. A DLQ that nobody drains is slow data loss, not zero data loss. The true claim is visibility, not perfection — but visibility is the precondition for fixing.

In Library Science, this curation was done by trained professionals who understood classification theory, authority control, and their specific domain. No modern library has a human catalog every item from scratch — they use automated classification, vendor-supplied records, copy cataloging — but always within a framework of authority files and classification schemes. The LLM-as-librarian is the next step in that trajectory. It is a bet, and it should be named as such.

The read path

Four steps. Query formulation translates the agent’s raw need into a structured retrieval spec by domain, provenance level, and concept type. This is Taylor’s reference interview formalized: translate “help the user pick a thing” into a precise retrieval specification. Not “embed the query and find nearest neighbors.”

Retrieval is a parameterized query against the fact store, filtered by domain, provenance, and appraisal value. No LLM. Deterministic ranking scores by appraisal value multiplied by provenance weight. Tunable configuration, not a learned parameter. Frame composition groups facts by provenance so the consuming agent can see trust levels — confirmed facts (user-stated, high confidence) separated from observed patterns (behavioral, moderate confidence) separated from tool-provided context. No summarization. No “the LLM condensed your memories into a paragraph.” A view, not a lossy compression. No information is lost.

Vector search is the fallback, not the primary path. When the vocabulary doesn’t cover a topic or the agent can’t formulate a structured query, semantic similarity helps find the nearest concept. Otherwise, structured retrieval wins because it is interpretable, auditable, and composable.

The maintenance path

Weeding is not hygiene. It is compliance.

A store-everything-forever architecture is not GDPR or CCPA compliant by construction. Right-to-erasure is not satisfiable by “we ranked it lower” — the fact must be provably gone, with an audit trail. Retention schedules require deaccessioning on a clock. A regulator does not accept “the embedding makes it unlikely to surface.” This is where the “boring” data management patterns stop being elegance and become compliance machinery: provenance tells you what to cascade-invalidate during an erasure request. SCD Type 2 valid_to timestamps enforce a retention clock. You literally cannot be compliant without these. Table stakes, not taste.

Beyond compliance, there is the correctness argument. Ranking-only conflict resolution assumes the ranker can always detect that two facts are about the same concept and in conflict. That detection is exactly the write-time step the store-everything camp deleted. “Just rank better” is circular — it smuggles back the conflict resolution it claimed to avoid, now at query time under latency pressure with less context. Mem0’s own issue tracker provides the proof: #4896 reports that “search returns both with similar scores, degrading retrieval quality.” That is the poisoning mechanism, stated by the reporter, confirmed by code.

MUSTIE criteria, applied as a background job, provide the principled alternative. Misleading facts that contradict a newer, higher-provenance fact — archive them. Ugly records that are malformed, partial, or corrupted — quarantine them. Superseded facts where a newer version exists — version them with SCD Type 2. Trivial facts with low value and zero access — remove them. Irrelevant facts where the user’s context has shifted — this one requires LLM judgment: “the user was planning a wedding; the wedding happened; wedding preferences are now irrelevant.” Elsewhere — facts redundant with an authoritative external source — replace with a pointer.

This is the part nobody builds. It is also the part that determines whether your memory system can be deployed on real user data in a regulated environment.

What the field is getting right, and what’s still missing

CS and CogSci ask the wrong questions. Library Science, Ontology Engineering, and Data Management ask the ones that directly address persistent agent knowledge.

The right questions produce the right systems. CS asks “how do I store and retrieve this efficiently?” — necessary but not sufficient. CogSci asks “how does a human remember this?” — interesting but misleading. Library Science asks “what is this, how does it relate to other things, and how will someone need to find it?” Ontology Engineering asks “what commitments am I making about the structure of this domain?” Data Management asks “how do I keep this consistent and reliable as it changes?” The last three directly address the problem of persistent, structured, reliable agent knowledge.

Not every existing system ignores these questions. Hindsight, from Vectorize.io (paper co-authored with Virginia Tech and The Washington Post), is the strongest existing system relative to this thesis — and it’s more than a counterpoint. It’s convergent evidence.

Hindsight organizes memory into four networks — World, Experience, Opinion, and Observation — that distinguish types of knowledge structurally. It performs entity resolution to canonicalize mentions. It runs four-way parallel retrieval — semantic, BM25, graph traversal, and temporal — fused with Reciprocal Rank Fusion and neural reranking. Its observation consolidation is functionally similar to materialized views. Its opinion evolution with confidence scores is a form of re-appraisal. The benchmark results are strong: 91.4% on LongMemEval with a frontier backbone (83.6% with the open-source 20B model), outperforming full-context GPT-4o.

Here is what matters for this argument: Hindsight is write-heavy by design. Its retain() pipeline does LLM fact extraction, network classification, entity resolution to canonical entities, and four-way link construction — all at write time. Their docs state it directly: “Writes are heavier but designed for background ingestion.” The field’s best-performing system does NOT defer structure to retrieval. Mem0 defers. They diverge. The one with heavier write-time structuring posts the SOTA number.

Hindsight independently arrived at write-time structuring, epistemic separation, entity canonicalization — without citing a single library scientist. That’s the strongest possible evidence the principles are real, not borrowed. The field is rediscovering authority control, appraisal-by-type, and structured ingestion under benchmark pressure.

But the gap remains. Hindsight offers an optional controlled vocabulary (their docs describe user-defined concept sets normalized at retain time), but it’s a tuning knob, not the architectural spine. Authority control isn’t the load-bearing primitive — cardinality and temporal semantics don’t flow from it. There is no archival appraisal at write time (no Schellenberg-style value judgment on ingestion). The world/experience/opinion split is a type classification, not a trust hierarchy — a world fact from a tool API has the same standing as one the user explicitly stated. There is no MUSTIE-style principled weeding, and no machinery to make it lawful (more on this below). Memory banks are per-agent, not per-user-across-agents — the CDP reframing isn’t present.

What Library Science adds isn’t decoration over work that good engineers already derived. It names the things that are still missing from the best system in the field — and names them as a coherent discipline rather than a list of patches.

Could Hindsight be extended? I think so. It is already two separable components: TEMPR (the retain/recall memory infrastructure) and CARA (the reasoning/belief/personality layer). For the persistent user knowledge problem, TEMPR is the relevant foundation — you would add a vocabulary layer, appraisal, provenance, and weeding to its retain pipeline. CARA addresses the complementary problem of agent reasoning identity and could remain an optional layer for use cases where agent personality matters.

We have been here before. Every technology eventually discovers it needs information architecture. The web did — information architecture became a discipline in the early 2000s because websites built without it were unusable. Enterprise data did — master data management exists because decades of unmanaged data created expensive chaos. AI agents are next. The only question is whether we learn from those cycles or rediscover the same lessons from scratch.

Where to start

This is harder than spinning up a vector store. Ontology development takes time. Building controlled vocabularies requires domain expertise. Schema design requires upfront thought. The payoff is a knowledge layer that is interpretable, maintainable, auditable, and composable — instead of a high-dimensional prayer.

Read Jessica Talisman’s Intentional Arrangement newsletter and her Ontology Pipeline framework. Her Graph Power Hour episode on Library Science for AI systems is a direct on-ramp. Read Kurt Cagle’s The Ontologist newsletter and his work on knowledge graph architecture and ontology-driven agents. Learn the Library Science concepts that transfer directly: archival appraisal, faceted classification, authority control, MUSTIE weeding, the reference interview.

You don’t need a specialized “AI memory database.” You need structured knowledge on top of a reliable data platform — PostgreSQL, MongoDB, whatever you already run. The principles are database-agnostic.

Agents don’t need better recall. They need better librarianship. The oldest information profession has more to teach the newest than either is comfortable admitting. The tools exist. The theory exists. The practitioners exist. What’s missing is the willingness to look outside the disciplines that got us here and learn from the ones that have been organizing human knowledge since before computers existed.

References

Jessica Talisman — Semantic Engineer, Information Architect, knowledge infrastructure strategist. Intentional Arrangement (Substack) · A Library Science Approach to Enterprise AI · Graph Power Hour Ep. 9 · The Ontology Pipeline Refresh
Kurt Cagle — Ontologist, knowledge graph architect, author of The Cagle Report. The Ontologist (Substack) · The Future of Knowledge Graphs · Knowledge Graphs and AIs
Packer et al. (2023), MemGPT: Towards LLMs as Operating Systems — the paper behind Letta’s virtual memory architecture.
Chhikara et al. (2025), Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory — published at ECAI 2025.
Mem0 v3 migration guide — documents the ADD-only architectural shift.
Latimer, Boschi et al. (2025), Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects — co-authored with Virginia Tech and The Washington Post. GitHub.
Tulving, E. (1972), Episodic and Semantic Memory — in Organization of Memory (pp. 381-403). Academic Press.
Schellenberg, T.R. (1956), The Appraisal of Modern Public Records — National Archives Bulletin 8.
Texas State Library, CREW: A Weeding Manual for Modern Libraries — the source of the MUSTIE framework.
Ranganathan, S.R. (1931), Five Laws of Library Science — the foundation of faceted classification.
Taylor, R.S. (1968), Question-negotiation and information-seeking in libraries — College & Research Libraries, 29(3), 178-194.
IFLA, Functional Requirements for Bibliographic Records (FRBR) — the Work / Expression / Manifestation / Item abstraction.
DCMI, Dublin Core Metadata Basics — common metadata envelope principles.

Build the Loop Once

Artur — Fri, 24 Apr 2026 12:39:27 GMT

Name a thing and you half-forget it. Engineers who can explain every layer of the TCP/IP stack talk about “the cloud” as though it were weather. People who spent years debugging distributed systems deploy “serverless” functions without asking what server they’re running on.

The abstractions are often good. The managed infrastructure saves time. But the cost doesn’t appear until something breaks, and the cost is this: you can’t debug what you never understood.

AWS Lambda landed around 2014. The pitch was clean: pay per invocation, no servers, infinite scale. By 2018 the word “serverless” had its own conference, ServerlessNYC.

Kelsey Hightower keynoted. He didn’t amplify the hype. He traced Lambda back to xinetd, a Unix superdaemon from the 1970s. Sits on a port. Waits for a connection. Spawns a process. Cleans up on exit. That is, in its essentials, what Lambda does. The billing model is new. The managed infrastructure matters. The architectural idea has been around longer than most engineers in that room had been alive.

His point wasn’t that Lambda was wrong. It was that when mythology outruns understanding, you lose the ability to reason about what the system is actually doing.

I think about that talk more than I probably should.

Agent hype is following the same arc. We’re roughly in the middle of it. Every week brings new frameworks, new platforms, new orchestration models. The vocabulary is elaborate: multi-agent systems, tool-calling, agentic loops, RAG pipelines, memory subsystems.

Most engineers building with this vocabulary can’t describe the mechanical loop that drives an agent. They know the behavior (give it a goal, it figures out the steps) but they haven’t seen the code. The framework became the mental model.

The problems this creates are invisible until they aren’t. Agents looping forever because the termination condition is buried in framework config. Tool calls failing silently because error handling is abstracted away. Conversation state corrupting in ways you can’t reproduce because you never saw the JSON.

Last year Thorsten Ball wrote “How to Build an Agent.” A functional coding agent: ~400 lines of Go. A loop, a few tools, an API call. Geoffrey Huntley turned this into a free workshop: build your own agentic loop before you touch a framework, because you need to see what the framework is hiding.

I ran the experiment in bash. Deliberately. Not Go, not Python, not TypeScript with a maintained SDK. bash, curl, and jq.

Why? I wanted to see the JSON. An SDK hides the HTTP request behind a method call. A framework hides conversation history behind a state object. When you write curl by hand and watch the terminal, you see exactly what crosses the wire. There is nothing to look through.

The question I wanted to answer: how much of agent complexity is intrinsic, and how much is framework?

The API call is just this:

jq -n \
  --arg model "$ANTHROPIC_MODEL" \
  --argjson messages "$(cat "$messages_file")" \
  --argjson tools "$tools_json" \
  '{model: $model, max_tokens: 2048, messages: $messages, tools: $tools}' \
| curl -sS "$ANTHROPIC_API_URL" \
    -H "x-api-key: $ANTHROPIC_API_KEY" \
    -H "anthropic-version: $ANTHROPIC_VERSION" \
    -H "content-type: application/json" \
    --data @-

jq builds the payload. curl sends it. The response is JSON you can read in a terminal. No magic.

The loop is maybe eight lines:

while (( turns < MAX_TURNS )); do
  api_call "$messages_file" > "$response_file"
  stop_reason="$(jq -r '.stop_reason' "$response_file")"

  if [[ "$stop_reason" == "tool_use" ]]; then
    tool_results="$(build_tool_results "$response_file")"
    append_user_tool_results_message "$messages_file" "$tool_results"
    ((turns += 1))
    continue
  fi

  extract_text_response "$response_file"
  return 0
done

stop_reason == "tool_use" means the model wants to act. Extract tool name, run it, append result, loop. end_turn means the model is done. Print and exit. That’s the whole agent.

Tool dispatch is a case statement:

case "$tool_name" in
  read_file)  run_read_file  "$input_json" "$workspace_root" ;;
  list_files) run_list_files "$input_json" "$workspace_root" ;;
  edit_file)  run_edit_file  "$input_json" "$workspace_root" ;;
  *)          printf 'ERROR: unknown tool: %s\n' "$tool_name" ;;
esac

Three tools. Enough to do meaningful file editing. The workspace sandboxing, a function called resolve_in_workspace(), is about fifteen lines of path resolution that rejects anything escaping the working directory. Worth reading if you’d rather your agent not do unexpected things to your filesystem.

I did use jq. The alternative, parsing the Claude API response with awk and grep, was technically possible and practically inadvisable. JSON and awk have a complicated relationship. jq was the responsible choice: close enough to the metal to be educational, not so close that the exercise becomes an extended meditation on shell quoting.

The answer to my question: 410 lines. agent.sh at 263, tools.sh at 148. Most of the complexity is framework.

Building it, the mystical feeling evaporated fast. Before I wrote it, “agent” felt like a substantial technical concept. After, it felt like a while loop. Both are true. The loop is the agent. The model is the intelligence inside the loop. The framework is scaffolding around the loop, scaffolding that hides the loop from you.

Some scaffolding is valuable. Production systems need retry logic, proper error handling, monitoring, cost controls. A shell script gives you none of that. I’m not suggesting you ship bash to production.

I’m suggesting you build the loop once, in something low-level enough to see it. After that, every framework is something you understand rather than something you trust. You know which knobs control which behavior. You know which abstractions are leaking. You know what the JSON looks like when things go wrong.

Event-driven compute goes back to the 1970s. This specific loop (call model, execute tool, feed result back) has existed since someone figured out that language models could use function call results. Neither idea requires a framework to understand.

The code is at github.com/artur-ciocanu/coding-agent-bash if you want to read it instead of writing it. Though I’d argue writing it is the point.

Build the loop once. The rest is configuration.

References

Kelsey Hightower, ServerlessNYC 2018 Keynote — traced Lambda back to xinetd. YouTube
Thorsten Ball, “How to Build an Agent” (April 2025) — a coding agent in ~400 lines of Go. ampcode.com/how-to-build-an-agent
Geoffrey Huntley, “How to Build a Coding Agent: Free Workshop” — the pedagogical case for building your own loop first. ghuntley.com/agent | GitHub repo
The Bash agent described in this post: github.com/artur-ciocanu/coding-agent-bash

The Hierarchy of Agentic Needs

Thu, 16 Apr 2026 11:41:41 GMT

Most teams building with agents right now are solving the wrong problem. Not because they’re incompetent, but because the interesting problems are at the top and the load-bearing problems are at the bottom, and human nature reliably picks interesting over load-bearing.

This is, I should point out, a well-documented pattern. Abraham Maslow proposed his hierarchy of human needs in 1943, and the core insight was almost embarrassingly simple: you have to meet physiological needs before psychological ones matter, and those before self-actualization becomes possible. The hierarchy turned out to be a flawed model of human psychology (people pursue meaning while hungry all the time), but what’s interesting is that the flaw in psychology becomes a feature in engineering. A broken foundation genuinely does guarantee a broken system. No exceptions. We have a tendency to anthropomorphize agents, which is to say, we assume they have the same flexibility as humans. They don’t. An agent isn’t a person pursuing self-actualization despite unmet needs. It’s a system where the upper layers literally cannot function correctly if the lower ones are broken. The rigidity that made Maslow’s model too simplistic for human motivation makes it an excellent model for engineering dependencies.

Monica Rogati mapped this same structural insight to data science in 2017 with her “AI Hierarchy of Needs.” Her argument was precise and, as it turned out, prophetic: the industry was obsessing over deep learning at the top of the pyramid while neglecting the data collection, storage, cleaning, and transformation layers underneath. Most organizations, she argued, should be investing in data infrastructure, not model architecture. She was right. The organizations that succeeded in the ML era weren’t the ones with the most sophisticated models. They were the ones with the cleanest data pipelines.

Adapted from Rogati’s “The AI Hierarchy of Needs” (2017). Used with attribution.

I’ve watched the same pattern play out with agents. A team invests weeks into an autonomous agent workflow (self-improving, memory-enabled, the works) and then spends a Thursday afternoon debugging why the underlying API returns inconsistent pagination tokens. The agent wasn’t broken. The foundation was. Everything above it just made the failure more expensive to diagnose. Different technology, different decade, same structural error.

So this is the claim I want to make, and I want to make it directly: there is a hierarchy of needs for agentic systems, and most teams are investing at the top while the bottom is shaky. You can’t skip layers. The industry has been told this before, in different contexts, by people who saw it more clearly and earlier. We didn’t listen then. I’m not optimistic we’ll listen now, but the argument is worth making anyway.

Here is the test. Pick the agent system you’re most proud of. Now ask yourself: if you removed every piece of autonomy, every skill, every composed workflow, if you stripped it down to raw API calls, would the foundation hold? Would the APIs return consistent data? Would auth work reliably? Would rate limits be handled gracefully rather than with retry loops and prayer? Would the data schemas be stable enough that a consumer could depend on them without defensive parsing?

If the answer is “mostly, but…” then you’ve found your actual problem. Everything above a shaky foundation isn’t a capability. It’s a liability. You’re compounding fragility, not composing intelligence. I cannot emphasize this enough: the hierarchy isn’t a taxonomy for organizing your architecture deck. It’s a diagnostic tool. You use it by looking down, not up. The question isn’t “what layer am I building at?” It’s “what layer is actually broken?”

The Foundation Layers

The hierarchy has five layers, and I’ll walk them from the bottom up, but the weight of this piece sits on the diagnostic for each layer, not the description. You probably already know what APIs and CLIs are. The interesting question is whether you’ve honestly assessed where your system actually stands.

The Hierarchy of Agentic Needs. APIs at the base, Self-Improvement at the top. Each layer depends on the ones below. Most teams invest at the top while stuck at the bottom.

At the very bottom sits raw access. REST, GraphQL, gRPC: the low-level building blocks. Getting a list of entities, CRUD on resources, auth tokens, API keys. This is the layer nobody wants to talk about at conferences, and it’s the layer that determines whether everything above it works or just appears to work. The diagnostic is simple: if the underlying API is flaky, rate-limited, or returns inconsistent data, no CLI wrapper or MCP will save you. You’re putting a nice interface on a broken foundation.

I’ve seen teams build elaborate agent toolchains on top of APIs that have undocumented rate limits, inconsistent error formats across endpoints, and pagination that silently drops records under load. The agent performs beautifully in demos. In production, it hallucinates confidence over unreliable data, which is considerably worse than failing outright.

The fix is boring. Audit your API layer. Test it under the conditions your agent will actually encounter: concurrent requests, token refresh during long operations, malformed responses from upstream services. If you find problems, fix them before you build anything else. I know this sounds obvious. I also know it’s consistently ignored. The last few agents I’ve seen in production were dealing with API inconsistencies that no model, however capable, can reason away: GET to retrieve a single item, POST with filtering criteria to list them. The same resource, two completely different interaction patterns. All of that API weirdness is a tax you pay in wasted tokens and confused reasoning, while the real fix is to go back and make the API consistent. Nobody wants to do that. Everybody wants to add another layer on top.

One layer up, you reach composed access: CLIs and MCPs. This is where you combine one or more API calls into something useful. Auth plus operate on a resource. Caching. Convenience wrappers. Both CLIs and MCPs sit at this level, since they’re both composition layers over raw APIs.

But they’re not equivalent, and this is where I want to take a principled position. The case for CLIs over MCPs isn’t tribal; it’s structural. MCPs made sense before we had better models and skills. They provided a standardized way to give agents access to external systems. Fair enough. But coding agents are very good at Bash. Really, very good. And CLIs give you something MCPs don’t: full Unix composability. Piping. The existing ecosystem of tools that have been battle-tested for decades.

Which is to say, the CLI isn’t just a different interface to the same capability. It’s an interface that brings an entire universe of composable tooling along for the ride. When your agent can pipe the output of one CLI into another, filter it with jq, redirect it to a file, and chain the result into a third command, you get combinatorial power that no MCP protocol can match. Peter Steinberger arrived at the same conclusion building OpenClaw. So have many others. Skills plus CLI is a more powerful combination than MCP, because the CLI brings the entire Unix toolkit along for the ride, and skills encode the domain knowledge about how to combine them. Some MCPs claim to include workflows, but this is the exception rather than the common reality.

Someone might reasonably ask: why does this composition-layer distinction matter for the hierarchy? Because the composition layer is where agents spend most of their reasoning budget. If your agent is making raw HTTP calls and managing auth flows and handling pagination, that’s reasoning capacity that isn’t going to the actual task. The composition layer exists to make the foundation disappear, to give the agent pre-composed tools so it can think about the problem, not the plumbing. The diagnostic here: giving an agent raw API access without composed tooling is like giving someone a pile of lumber instead of a toolkit. The agent can make the HTTP calls, but it’ll spend most of its reasoning budget on undifferentiated heavy lifting.

Where Domain Knowledge Lives

This is the layer where things get genuinely interesting, and it’s the layer most teams skip. Not because they don’t understand it, but because encoding domain knowledge is harder than it sounds and less impressive than autonomy in a demo.

Skills are workflows, SOPs, procedures: domain knowledge made executable. They span across CLIs and MCPs, or combine them. A deployment procedure. A content creation workflow. The specific sequence of steps your team follows when onboarding a new service, including the weird step where you have to manually update that one config file because nobody ever automated it.

Before I go further, I should briefly define what an agent actually is in this context, since the term has become almost uselessly overloaded. An agent, for our purposes, is a system that uses a language model to decide which tools to call, in what order, and with what parameters, to accomplish a task. It’s the decision-making layer on top of the tools. That’s it. The mystique around the word obscures what is, structurally, a fairly straightforward loop: observe context, pick an action, execute, observe the result, repeat. If this sounds familiar, it should. John Boyd formalized the same cycle as the OODA loop (Observe, Orient, Decide, Act) in the context of military decision-making. Agents aren’t doing anything conceptually new. They’re running OODA loops with language models as the orientation and decision layer.

The OODA loop: Observe, Orient, Decide, Act. The same decision cycle that agents run, formalized by John Boyd for military strategy.

Now, why does the skills layer matter so much? Because it’s the precondition for autonomy. Without encoded domain knowledge, autonomy is just confident improvisation. The agent will chain tools together, impressively even, but without knowing your deployment procedure or your content creation workflow, it’s guessing at the “what,” not just the “how.”

This is worth sitting with for a moment. An autonomous agent without skills isn’t autonomous in any meaningful sense. It’s a very capable improviser. It can look at context clues, infer likely workflows, and produce something that looks right. But “looks right” and “follows the procedure that keeps production stable” are different things, and the gap between them is where incidents live.

Let me make this concrete. Consider a deployment skill for a typical microservice. The skill encodes a specific sequence:

Check the CI status on the target branch
Verify that the staging environment passed its smoke tests
Pull the latest approved image tag from the registry
Run the canary deployment to 5% of traffic
Monitor error rates for ten minutes
If error rates stay below threshold, proceed to 25%, then 50%, then full rollout
If at any point error rates spike, automatically roll back to the previous image tag and notify the on-call channel

Without this skill, an agent given a “deploy to production” instruction will improvise something reasonable. It might even get most of the steps right. But it will almost certainly miss something specific to your environment: perhaps the step where you have to drain the connection pool on the legacy service before cutting over, or the check against the feature-flag service to make sure no half-rolled-out experiments get caught in the deploy. Those specifics are exactly what make deployments safe, and they’re exactly what can’t be inferred from general knowledge.

There’s also a forcing-function insight here that I think is underappreciated. The process of encoding skills for an agent forces institutional clarity. When you sit down to write a deployment skill, you discover that your “deployment process” is actually three different processes that three different engineers follow, with subtle variations that have never been reconciled. The skill-encoding process is valuable even if you never give it to an agent, because it forces you to confront the gap between what you think your procedures are and what they actually are. Teams that get this layer right share a common trait: they’ve already done the hard work of documenting their procedures for humans. The leap from “written SOP” to “executable skill” is shorter than the leap from “tribal knowledge in someone’s head” to “executable skill.” If your team can’t describe the procedure to a new hire, it can’t encode it for an agent.

Claude Code is a useful reference point here, not because it’s perfect, but because it demonstrates what becomes possible when skills sit on well-composed tooling. It operates with skills layered on top of CLI access, which means the agent can encode complex workflows (project scaffolding, refactoring sequences, test-then-commit cycles) as skills that leverage the full power of the underlying Unix environment. The skills aren’t just prompt templates; they’re structured procedures that combine tool calls, conditional logic, and domain knowledge. That combination of skills plus CLI composability is what makes it genuinely useful rather than merely impressive.

The diagnostic for this layer: could a new team member follow a written version of what your agent does? If the procedure only exists as agent behavior, not as documented workflow, your skills layer is implicit. And implicit skills are fragile skills.

Autonomy and What Comes After

Up until the skills layer, everything is human-invoked. You ask the agent to run a skill. You tell it which CLI to use. You invoke the API call. At the autonomy layer, the agent starts deciding what to invoke based on context, environment, and user preferences.

There are two distinct flavors here, and the difference matters. Reactive autonomy is what you see in chat-based agents. The agent picks the right skill without being told which one, but it still waits for you to ask. You say “deploy the staging environment” and it selects the deployment skill, checks the prerequisites, and runs it. This is valuable, but it’s still fundamentally human-initiated.

Proactive autonomy is where things get genuinely different. Monitoring events, triggered by schedules, acting on triggers. The agent notices that a dependency has a security update, checks your update policy, runs the test suite, and opens a PR while you’re asleep. This is the “magic” angle, and it’s real, but only if the layers below are solid.

OpenClaw is a good example here. Steinberger built on solid foundations (CLIs over MCPs, well-composed tooling) and the result is an agent that can operate proactively because the layers below it are reliable. The proactive behavior isn’t the innovation. The reliable foundation that makes proactive behavior trustworthy is the innovation.

I should note: building multi-agent systems at scale amplifies this dynamic. When you have multiple agents coordinating, a shaky foundation doesn’t just produce errors; it produces cascading errors. Agent A makes a decision based on unreliable data from the API layer, passes that decision to Agent B, which compounds the error. Foundation problems can’t be papered over with better reasoning at the top. The trap at this layer is investing in autonomy before the layers below can support it. An agent that proactively does the wrong thing is worse than an agent that waits for you to ask.

At the very top sits self-improvement: knowledge extraction at the end of each session. Extracted knowledge becomes memory. Memory compounds over time. The agent internalizes user workflows, company procedures, and can anticipate what should be done or which skill to invoke.

This is the top of the pyramid, and the honest diagnostic is simple: almost nobody is here yet, and that’s fine. If your layers below are solid, you’ll get here. If they’re not, self-improvement just means your agent gets progressively better at doing the wrong things. Memory without judgment is just a growing pile of notes. Self-improvement compounds only when the agent already knows how to act well on its own.

What does compounding look like when it actually works, though? It’s worth being concrete. In session one, the agent learns that your team’s PR template requires a specific “Test Plan” section. In session five, it has internalized that your test plans follow a pattern: unit tests for logic changes, integration tests for API changes, manual QA steps for UI changes. By session twenty, it drafts the test plan section automatically, matched to the type of change, before you ask. Each piece of knowledge makes the next piece more useful. The PR template knowledge is mildly helpful on its own; combined with the testing pattern knowledge, it becomes genuinely predictive. That’s what compounding means in practice. It’s not just accumulation; it’s the interaction between accumulated pieces that creates something more useful than their sum.

Or consider a different axis: the agent learns your code review preferences. First, it learns that you prefer early returns over nested conditionals. Then it notices that you always flag functions longer than thirty lines. Eventually it connects these preferences to your broader principle (which you never explicitly stated) that readability matters more than cleverness, and starts applying that principle to novel situations you haven’t reviewed together. That kind of emergent understanding is what separates genuine self-improvement from a lookup table of past corrections.

The Honest Self-Assessment

Here’s a practical exercise. For each layer, answer honestly, and I mean with the kind of honesty that’s uncomfortable rather than the kind that confirms what you already believe.

APIs: Can your agent’s API calls run for 8 hours without hitting an unhandled error? Not a theoretical 8 hours; actually run it. If you hit an unhandled error at hour 3, that’s your answer. If you can’t run the test because “the environment isn’t set up for that,” that’s also your answer. A reasonable threshold: fewer than one unhandled error per thousand API calls under production-realistic load.
CLIs/MCPs: Does your agent spend more than 20% of its reasoning on plumbing (auth, pagination, error handling) versus the actual task? If you’re not sure, that’s already an answer. Examine your agent’s trace logs and count the reasoning tokens spent on infrastructure versus domain logic. If the ratio surprises you, it will surprise you in the wrong direction.
Skills: Could a new team member follow a written version of what your agent does? If the procedure only exists as agent behavior, not as documented workflow, your skills layer is implicit. Pick your agent’s three most common workflows. Write them out as step-by-step instructions. If you can do this in an afternoon, you’re close. If it takes a week of interviewing different engineers to piece together what actually happens, your skills gap is larger than you thought.
Autonomy: When your agent acts without being asked, do you trust the result enough to not review it? If you review every proactive action, you have the overhead of autonomy without the benefit. Track the percentage of proactive agent actions that require human correction over a month. If it’s above 15%, the autonomy is costing more in oversight than it’s saving in automation.
Self-Improvement: Is your agent measurably better at its job than it was a month ago? Not in a “the model improved” sense, but in a “it has internalized our specific workflows and preferences” sense.

Most teams I talk to are genuinely solid at Layer 1 for the APIs they control, shaky at Layer 1 for third-party APIs, inconsistent at Layer 2, absent at Layer 3, and ambitious at Layer 4. The gap between Layer 2 and Layer 4, the missing skills layer, is where most agent projects go sideways.

Where the Weight Falls

If you’ve been reading carefully, you’ve noticed that the layer descriptions were in service of the diagnostics, not the other way around. That’s deliberate. The pyramid isn’t a taxonomy for organizing your thinking about agents; it’s a tool for finding the gap between where you think you are and where you actually are. And the gap is almost always further down than expected.

I think this happens for three predictable reasons.

First, the top layers are more fun. Autonomy is a better demo than API reliability. Self-improvement is a better conference talk than consistent pagination. The incentive structure of the industry (funding, attention, conference slots) rewards the top layers. Nobody gets venture capital for fixing their API layer.

Second, the bottom layers feel solved. APIs have been around forever. CLIs are old technology. The assumption is that these are commodity problems, already handled. But “commodity” and “reliable” aren’t synonyms. Your team’s specific API surface, with its specific quirks and rate limits and auth flows, is not a commodity. It’s a custom integration that needs the same engineering attention as anything else.

Third, the failures are invisible from the top. When an autonomous agent makes a bad decision, the natural instinct is to debug the autonomy logic. Was the prompt wrong? Did the agent misinterpret the context? Should we add more memory? But often the agent’s reasoning was fine; it was working with bad data from a broken API, or it was improvising because no skill encoded the correct procedure. The failure looks like an autonomy problem but is actually a foundation or skills problem.

This is the structural pattern worth naming: failures propagate upward, but attention flows to the layer where the symptom appears, not the layer where the cause lives. And since symptoms always appear at the top (that’s where the user-facing behavior is) attention gets systematically misallocated.

We’re making the same mistake Rogati identified in 2017, just with different technology. The pyramid I’ve described here (APIs, CLIs and MCPs, Skills, Autonomy, Self-Improvement) maps the same insight to agentic systems. And the diagnostic is the same: look down, not up. Find the lowest broken layer, fix it, and the layers above will improve without being touched.

The question isn’t whether this pattern will repeat itself. It already is. The question is whether you’ll be one of the teams that recognizes it early enough to invest correctly. Most won’t. The incentives are too strong, the demos are too compelling, and the bottom layers are too boring. But the teams that get the foundation right are the ones whose autonomous agents actually work in production, not just in demos. And they’re the ones who, two years from now, will look like they had some unfair advantage, when really, they just didn’t skip layers.

That’s the entire argument. Look down, not up. Fix the foundation. The rest follows, or it doesn’t follow at all.

References

Maslow, A. H. (1943). “A Theory of Human Motivation.” Psychological Review, 50(4), 370-396. Full text at York University Classics in the History of Psychology
Rogati, M. (2017). “The AI Hierarchy of Needs.” HackerNoon. https://medium.com/hackernoon/the-ai-hierarchy-of-needs-18f111fcc007
Boyd, J. R. (1996). “The Essence of Winning and Losing.” Unpublished briefing. A summary of the OODA loop concept is available at https://en.wikipedia.org/wiki/OODA_loop