March 2026
Testing the Prompt
Lectr’s recommendations pass an anonymised reading profile to Claude’s API and return book suggestions. The previous post describes how the anonymisation works. This post is about the part that came next: figuring out what to actually say in the prompt, and how I tested whether it mattered.
I ran seven iterations of experiments against one real library (my own, 521 books) and roughly 600 API calls. The tests used paired trials: the same profile sent with different prompts or different levels of anonymisation, then compared by title overlap and semantic similarity. The evaluation model was all-MiniLM-L6-v2, independent of the Apple NLEmbedding used for the proxy mapping.
The Starting Point
The v1 prompt was straightforward. It listed the profile data (themes, tag clusters, sentiment, engagement scores) and asked Claude to recommend ten books as a JSON array. It included a note explaining that the terms were “semantic proxy tokens mapped from a fixed literary codebook” so Claude would interpret them as thematic descriptors rather than literal labels.
The raw profile (actual tag names like “Self Help” and “Buddhism”) worked well. It shifted Claude’s recommendations from literary fiction into psychology and philosophy, which matched my reading. The anonymised profile was the problem. After proxy mapping, the recommendations drifted back toward Claude’s defaults. The privacy layer was degrading the signal.
Measuring the Gap
I measured this by comparing each condition’s output to a null baseline (recommendations generated with no profile at all). The further from null, the more the profile is steering.
| Iteration | Proxied vs Null | Raw vs Null |
|---|---|---|
| v1 (baseline prompt) | 0.573 | 0.470 |
Lower means further from null, which means more profile influence. The proxied recommendations (0.573) were noticeably closer to null than the raw ones (0.470). The anonymisation was costing about a third of the steering effect.
Iterations 2 through 4: Sharpening the Data
My first instinct was that the proxy mapping was too loose. The v1 mapper expanded each tag into three proxy terms using a mean merge, which roughly doubled the term count. More terms meant a flatter, more generic signal.
Over three iterations I tightened the mapping (fewer proxy terms per tag, max instead of mean for merging) and added richer signals from the app: co-occurrence pairs (tags that appear together on the same books), themes extracted from 3,084 saved quotes, and engagement-weighted aggregation.
| Iteration | Proxied vs Null | What changed |
|---|---|---|
| v1 | 0.573 | Baseline (count=3, mean merge) |
| v2 | 0.544 | Sharper proxies (count=1/2, max merge) |
| v3+v4 | 0.519 | Co-occurrence, quote themes, engagement weighting |
Each iteration moved the proxied output further from null. But the improvements were getting smaller. And the raw profile wasn’t moving at all, because it was already working. I was closing the gap from the wrong side.
Iteration 5: Changing the Prompt
The v5 experiment changed the prompt instead of the data. Instead of listing profile signals and asking for recommendations, the new prompt told Claude to do two things in sequence: first infer three to five “reading appetites” from the profile, then use those appetites to select books.
It also introduced a signal hierarchy (engagement depth is the strongest signal, broad themes are the weakest), a split between anchor picks and exploratory picks, and a requirement to reference at least two profile signals per recommendation.
| Metric | v1 prompt | v5 prompt |
|---|---|---|
| Raw vs proxied title overlap | 0.037 | 0.126 |
| Proxied vs null similarity | 0.519 | 0.486 |
Title overlap between raw and proxied recommendations tripled. The anonymised profile was now producing books much closer to what the raw profile produced. And the proxied-null distance (0.486) was nearly identical to the raw-null distance (~0.48). The gap had closed.
The inferred appetites were also readable. From the anonymised profile, Claude produced things like “practical wisdom for personal transformation” and “psychology of human nature and behavior.” These matched my actual reading, despite being derived from proxy terms like “contemplative” and “resilience” rather than my real tags.
Why Appetites Helped
My theory is that the intermediate reasoning step forced Claude to synthesise the proxy terms into coherent interests before selecting books. Without it, Claude was pattern-matching on individual terms. A proxy term like “healing” in isolation could pull in any direction. But once Claude committed to an appetite like “psychology of human nature,” its book selection became more focused.
This is consistent with how chain-of-thought prompting works in other contexts. Asking a model to state its reasoning before answering tends to produce more consistent outputs. The appetite step served the same function for recommendations.
The Mistake I Made First
Before any of this, I almost concluded the profile wasn’t working at all.
My initial analysis compared raw-null similarity (0.48) to raw-proxied similarity (0.48) and read them as equivalent. It looked like the profile and null produced the same kind of output. I called this the “Madam Marie” hypothesis: the model was doing cold reading, using the profile only as material for plausible-sounding reasons while selecting the same books it would have selected anyway.
The error was that I hadn’t computed a baseline for what “same” looks like. When I measured null-null similarity (how similar two no-profile runs are to each other), it came back at 0.745. Relative to that, 0.48 represents a 36% divergence. The profile was producing categorically different books. I just hadn’t measured correctly.
This was caught by external review, not by me.
The Signal Ceiling
One other finding from the experiments. I ran a minimal-profile ablation: just the top five tags by engagement depth plus three tag clusters. No themes, no sentiment, no quote data. This tiny profile steered genre just as far from null as the full profile.
The full profile did influence which specific books got picked within a genre. And it produced more specific explanation text (about 5 profile terms per reason vs 2.4 for the minimal profile). But for the coarse question of “does Claude recommend psychology or fiction,” five tags were enough.
I kept the full profile in production anyway. The within-genre influence and richer explanations seem worth the extra payload, though I haven’t run a blinded evaluation to confirm that.
What Shipped
The v5 “infer appetites first” prompt is what runs in production. The data improvements from iterations 2 through 4 also shipped. Together, they closed the proxy gap almost completely on the metrics I measured.
The methodology has known limitations: one profile, one model, no inferential statistics. Testing Lectr’s Recommendations covers what I can and can’t claim from this data.