Behavioral DNA as Agent Configuration

March 2026

Tell an AI agent "be concise" and watch what happens. Claude will give you three paragraphs. GPT will give you five. Switch to a different conversation and the same model forgets what "concise" meant the first time. Ask it to be "bold but diplomatic" and you've introduced a contradiction with no resolution mechanism.

Static system prompts fail because they're ambiguous, unmeasurable, and non-portable. We built a system that replaces prose-based personality with measurable behavioral dimensions. Here's what we learned.

The three failures of static prompts

Ambiguity. "Be concise" means different things to different models, different prompt contexts, and different conversation lengths. There's no shared definition. When you set a system prompt, you're writing a wish, not a specification.

No feedback loop. You set the prompt and hope. There's no measurement of whether the model actually behaved as configured. Did your "bold" agent take a risk in that architecture decision, or did it hedge? You'd have to read every response to know.

No portability. A personality tuned for Claude doesn't transfer to GPT or Gemini. You're rewriting prompts per provider, and there's no way to verify that your rewrites produce equivalent behavior.

Dimensions over prose

Praxis replaces prose descriptions with measurable behavioral dimensions, each scored 0-100 with named poles on either end. Instead of "be bold," you configure a specific point on a spectrum between two clearly defined behaviors.

The key design decision is a dual-percentage format. Every dimension renders as two poles summing to 100%. The model doesn't have to figure out scale direction or what a raw number means on an unlabeled axis. Both poles are always visible, always explicit.

This format appears everywhere in the system: system prompts, UI sliders, exported configs, MCP responses, radar charts. One representation, zero translation between layers. That consistency turned out to matter more than we expected — it means any integration point can render a personality without knowing the internals.

The schema problem

We store behavioral configuration as a flexible document rather than fixed columns. Adding a new dimension doesn't require a migration. The dimension definitions live in code, the values live in the database, and they meet at runtime.

Each dimension also carries banded guidance — behavioral anchors for score ranges that translate abstract numbers into concrete behavioral descriptions. Without these, a score of 73 is meaningless. With them, it maps to a specific instruction the model can act on. The bands solve the interpretation gap between "what the user configured" and "what the model understands."

Archetypes: making high-dimensional config legible

Nobody thinks in high-dimensional coordinate spaces. So we built a taxonomy on top of the dimensional system. A subset of the dimensions drive archetype classification — deterministic, pure function, no ML. The remaining dimensions provide individuality within an archetype. Two agents of the same archetype can feel completely different.

The result: users think in terms like "I want this personality for code review and that one for brainstorming." The system thinks in dimensional vectors. The archetype layer translates between the two.

Behavioral fingerprints

Each dimensional configuration compresses down to a short alphanumeric code — a compact, shareable behavioral fingerprint. Two configurations with the same code behave identically at the archetype level and similarly across the fine-grained dimensions. It's content-addressable identity.

Calibration: measuring what you configured

Here's where it gets interesting. Configuration is not behavior. A user configures their agent to be bold, but does the agent actually take calculated risks? Everyone else in this space sets prompts and hopes. We measure.

We run standardized scenarios through the configured agent — situations designed to exercise the behavioral dimensions. A separate evaluation step scores the agent's actual behavior against the configured values, producing a configured-vs-measured comparison for each dimension.

The delta between configured and measured is the most valuable signal in the system. It tells you where the model's base personality is fighting your configuration. We've found that certain dimensions consistently drift in predictable directions — most models are trained toward caution, most models are trained to defer. These are systematic biases you can only see if you measure.

We deliberately use a lightweight model for calibration. Cheap enough to run frequently, good enough to detect drift. We care about trend lines more than precision on any single run.

Portability through MCP

The Praxis MCP server exposes sleeve configuration as both resources and tools. A single user can have different personalities for different tasks — switching via tool call, not prompt rewriting. The configuration travels with the user across clients. Claude Code, Cursor, any MCP-capable client gets the same behavioral configuration from the same source of truth.

The identity problem

Over 395 commits and 5 months, the hardest lesson had nothing to do with data models or prompt generation. It was about framing.

Describing behavioral configuration to an agent as a list of settings produces compliance theater. The agent acknowledges the settings and then ignores them. Framing the same configuration as identity — as a self-description the agent internalizes rather than a settings panel it references — produces natural adoption.

The dimensional format is simultaneously configuration (for the system) and self-description (for the agent). That dual nature isn't an accident. It's the core design insight. If the agent reads its own personality and it reads like settings, it treats it like settings — optional, advisory, ignorable. If it reads like identity, it treats it like identity.

What we'd do differently

The behavioral guidance anchors should have been user-editable from the start. Our interpretations of what each score range means aren't universal. A legal team's definition of "direct" is different from an engineering team's. Defaults with per-dimension overrides would serve more use cases.

We'd also invest in model-specific calibration baselines earlier. Different models have different base personalities, and knowing those baselines lets you compensate for systematic drift instead of discovering it after the fact.

The takeaway

Static system prompts are configuration without measurement. Behavioral dimensions are configuration with a feedback loop. If you can't measure whether your agent behaved as configured, you don't have configuration. You have a wish.