TitrateLab

Methodology

Methodology

Last updated 2026-04-23

This page documents exactly how the numbers on TitrateLab are computed. It exists so reviewers, researchers, and readers can check our work before deciding to trust it. If a figure in one of our articles doesn't match what you'd derive from the pipeline described here, that's a bug, and we want to hear about it.

What this page is

TitrateLab publishes research about the peptide grey market: Certificate-of-Analysis coverage, purity and dose-accuracy distributions, vendor-closure timelines, community sentiment, pricelist drift. Every published figure comes out of one of two corpora (a COA database and a Discord/forum message database) processed through the pipelines described below. We are documenting those pipelines in public, with their known limitations, so that no one has to take a TitrateLab number on faith.

We will not publish a vendor score, a vendor leaderboard, or any per-vendor claim until the open methodology issues flagged at the bottom of this page are resolved. That is an editorial rule, not a future goal. The numbers we do publish today are population-level findings across the full corpus.

Data is current as of 2026-04-23. Counts move daily; the shapes don't.

The COA corpus

The COA database is our ground-truth layer. Every record is a third-party assay of a specific peptide batch from a specific manufacturer, tied, where possible, to a public verification URL that the originating lab will confirm.

Sources, in descending order of volume.

OCR pipeline. Janoshik's public portal exposes verification pages whose structured text is rendered inside an image rather than as HTML. We originally passed these PNGs through a rate-limited third-party OCR service. After hitting cost and throughput ceilings, we migrated the inner loop to Claude Haiku 4.5 vision, validated against a held-out Gemini 2.5 Flash ground-truth set on a random ~5% sample. The two models agree on peptide name, purity percentage, quantity, and manufacturer well above the threshold we need for aggregate analysis. Where they disagree on a specific field, the record is flagged and excluded from aggregates pending manual review.

Local image caching. The Janoshik public portal purges PNG images from older test records unpredictably. When we crawled historical data, roughly 71% of PNGs older than a few weeks had already been purged: the structured test metadata persists in Janoshik's system, but the image file backing it 404s. We now cache every PNG locally at ingest time, so records OCR'd into structured fields remain auditable even after the upstream image disappears. Anything we didn't capture at first touch is functionally lost.

Numbers, as of 2026-04-23.

Outlier filtering. 214 records with absolute quantity deviation greater than 50% are excluded from aggregate deviation statistics as almost certainly OCR errors or vial-label misreads (e.g. a 1 mg-labeled cagrilintide vial that OCR'd as "11.75 mg tested" when the real label was 10 mg). The 214 excluded records represent less than 3% of the purity-populated corpus. Earlier versions of our analysis that included them produced implausibly high Janoshik aggregates; the filtered version is what we publish today.

The Discord corpus

The Discord database is our behavioral layer. It is how we measure what buyers are actually discussing, asking about, complaining about, and recommending in real time across the peptide and biohacking underground. It is also how we find vendor-closure signals (exit-scam language patterns cluster 30 to 60 days after first FDA enforcement news reaches a community).

Overwatch fleet. Our listener infrastructure, codename Overwatch, reaches 1,000+ Discord servers spanning the major peptide, biohacker, bodybuilding, and GLP-1 communities. Bots are invited members of the servers they watch: no scraping, no API rate-limit games, no terms-of-service violations on the Discord side. The scanner captures every message in every channel it reaches.

Numbers, as of 2026-04-23.

Two-stage filter. Stage one is a regex match against the peptide-campaign keyword set (tirzepatide, retatrutide, BPC-157, TB-500, and so on, plus short codes like "t5" or "reta"). Stage two is a Haiku 4.5 LLM classification that decides whether the stage-one match is actually peptide-relevant. The classifier rejects approximately 21% of stage-one hits as false positives: "roids" used in sports-banter contexts, "sust" as a video-game character name, "HGH" in a rap lyric. Every classification is stored with the model's reasoning and a confidence score, so the methodology is auditable row by row.

Privacy. We do not publish per-user Discord data in any article. All community-sentiment numbers are aggregated across the campaign-tagged corpus with no individual identification. Usernames, user IDs, and guild names are not exposed in any published figure.

Scoring

Vendor rankings — as of April 24, 2026 — use a two-axis Bayesian trust score that replaces the naive purity-average we started with. The methodology below is the rubric the chat bot and any vendor leaderboard references. Source-of-truth code is in scripts/chat/kb.py::score_vendor_trust; the 22-case regression suite in scripts/chat/test_vendor_trust.py is what gates changes.

The two numbers

Axis Range What it means
quality_score 0 – 1 Shrunken posterior-mean grade across the vendor's tested batches, with continuous recency decay and hard penalties for dose / endotoxin / catastrophic failures.
data_sufficiency 0 – 1 How much fresh-equivalent evidence that quality rests on. A vendor with ten 2-year-old tests has low sufficiency; a vendor with ten tests in the last 60 days has high sufficiency. Untested = 0.

A composite confidence_score = quality_lb × √data_sufficiency is used when a single number is required for ranking. It falls to zero for untested vendors — explicitly; we do not default unknown vendors to "average" because unknown is its own signal.

Per-batch grade

Each individual COA is first mapped to a grade in [0, 1]:

  1. If a Finnrick test_score (0–10 composite: identity + dose + purity + endotoxin) is present, use test_score / 10.
  2. Else, if a purity percent is present, use max(0, min(1, (purity − 90) / 10)) — so 90% → 0.0, 100% → 1.0.
  3. Else the batch has no usable quality signal and is excluded from quantity and quality math.

The grade is then capped:

These caps exist because a vial can be 99% pure and still be wrong: wrong compound identified, half the label dose, or contaminated. Purity alone is a beauty contest when the other signals are failing.

Recency weighting

Each batch contributes with weight w = 0.5^(age_days / 180). A 6-month half-life, continuous. A test from yesterday contributes w ≈ 1.0; a test from 2 years ago contributes w ≈ 0.04. No cliffs, no "day 90 good / day 91 bad" step functions.

Bayesian shrinkage

The raw weighted mean is a noisy estimator for small samples — two lucky tests at 99% look identical to 200 consistent tests at 99%. We shrink toward a corpus prior using a Beta-Binomial posterior:

μ_post = (Σw·g_batch + α) / (Σw + α + β) with α = 5.6, β = 2.4 (equivalent to 8 imaginary batches at the corpus mean grade of 0.70).

A vendor with 2 perfect tests lands near 0.78 (pulled down from 1.0 by the prior). A vendor with 100 perfect tests lands near 0.98 (prior's influence washes out).

Wilson-style lower bound

The posterior mean is a point estimate; we also compute an 80% one-sided lower bound: lb = max(0, μ_post − 0.84 · √(μ_post·(1−μ_post)/(Σw+α+β))). This is what rankings actually sort on. A wide confidence interval (thin data) pulls the lower bound down, so 2-for-2 vendors rank below 50-for-50 vendors with the same point estimate. Same principle as ranking Reddit comments or UCB bandits.

Critical-failure multipliers

On top of the shrunken grade, we apply two rate-based multipliers:

Dose failures are already folded into the per-batch grade cap (0.4) — no separate multiplier.

Data sufficiency

data_sufficiency = 1 − exp(−W / 8) where W is the weighted sum of batches (fresh-equivalent count). A vendor with 1 fresh test scores ~0.12; with 5 fresh tests ~0.46; with 20+ fresh tests ~0.92. This is the "how much do we know" axis — published alongside the quality score so readers can distinguish "well-characterized as bad" from "well-characterized as good" from "not enough data to say."

Why it's different from the Finnrick letter grade

Finnrick publishes a single letter (A–E) per vendor per peptide, based on their internal composite across their own tests. Ours is different because:

When our ranking contradicts Finnrick's, it's almost always because we're including evidence they don't see (Janoshik, community, blend components) or because their letter grade is a point estimate and ours is a confidence-interval lower bound. When Finnrick and we agree, that's a strong signal. When we disagree, the raw per-batch evidence is queryable via the chat bot — we don't ask you to trust either grade, just to check the batches.

Regression protection

Changes to priors, penalty constants, or grade-cap thresholds are gated by a 22-case edge-case suite covering sparsity (untested, 1 test, 100 tests), failure-mode isolation (dose-only, endotoxin-only, all-three), time decay (stale, fresh, mixed), and missing-data handling (null test_score fallback to purity, malformed dates). The suite runs nightly via systemd and alerts Discord on any failure. See scripts/chat/test_vendor_trust.py.

What this supersedes

The earlier disclosure on this page — "we don't publish vendor scores because our internal test_score is bimodal" — described a methodology stuck in development. That description no longer applies. The Bayesian composite above is bounded to [0, 1], non-bimodal, and regression-gated. However, the original caveat about "Janoshik versus Finnrick quantity-deviation disagreement" is still real and is noted under Known issues below; the composite weights both but flags their divergence on individual batches.

Temporal coverage

Most of our data is from the last few months. Be honest about that before drawing longitudinal conclusions from it.

COA corpus temporal bias. 61% of our lab data is from Q4 2025 and Q1 2026. Manufacturer behavior in early 2024 was different; the regulatory environment was different; the vendor population was different. Drawing conclusions about "2024 peptide quality" from this corpus is inappropriate. Where our articles compare year-over-year quality (Feb 2026 versus Feb 2025, for example), we footnote the sample sizes explicitly.

Discord/forum corpus temporal bias. The Overwatch fleet reached its current scale in late 2025 and has been real-time ingesting at that scale since. Earlier periods are covered by select targeted backfills: specific high-value forum threads, vendor-review archives, the MESO-Rx analytical-lab subforum crawl, Janoshik public-portal expansion. Those backfills are not comprehensive. Longitudinal claims about community sentiment or discussion volume that span the Q3/Q4 2025 boundary should be read as directional, not statistically complete.

What this means in practice. When an article says "HGH sentiment is question-dominant," that finding rests on 267 HGH-tagged messages out of 3,959 peptide-relevant messages out of 5,005 classified so far. The shape has held stable across earlier checkpoints (n=1,500, n=3,000, n=5,005), but it is not a claim about what peptide-buyer sentiment looked like in 2023. We don't have 2023.

What we don't have

The gaps matter more than the coverage, because the gaps tell you where our conclusions can't reach.

Known issues

These are open methodology bugs. Fixing them is work we are doing; publishing them is an editorial choice.

1. Finnrick versus Janoshik quantity-deviation disagreement (unresolved). On identical peptides, Janoshik reports quantity deviation roughly 5× higher than Finnrick's aggregator-lab panels. In Finding 3 of our flagship article, across 12 distinct peptides, Janoshik reads 6 to 9 percentage points heavier than Finnrick every time. That is not noise. Three explanations compete:

We have not resolved the disagreement. Until we do, any quantity-deviation figure we publish is labeled with its source (Finnrick, Janoshik, or combined). The combined aggregate is published with an explicit caveat; we do not treat the two lab pipelines as interchangeable.

2. Manufacturer-string resolution. Our 296 distinct "manufacturer" strings almost certainly reduce to roughly 60 to 100 actual underlying OEMs once the Western-storefront-to-OEM graph is properly resolved. "JKL Peptides," "JKL," and "JKL Biotech" may all refer to the same supply chain. We are rebuilding the mapping and will publish it when the graph is clean. Until then, per-manufacturer aggregates should be read as per-string, with the understanding that some strings collapse into each other.

3. test_score bimodality. Documented under "Scoring" above. Open.

4. HGH dimer-content re-extraction. Our current HGH dimer-content figure (5 of 23 batches with measurable dimerized somatropin) comes from an earlier manually-read 23-batch set. The expanded 60-batch HGH corpus has not yet been re-run for dimer analysis. That's a methodology pass scheduled for the next HGH update, not a published claim about the larger corpus.

How to reach us with corrections

For factual corrections, methodology disputes, missing sources, or anything in our published research that you think misrepresents the record: legal@titratelab.com. We read every message and respond to the substantive ones. Corrections that land in a published article are footnoted with the date of correction and, where the correspondent prefers attribution, a short acknowledgment.

Vendor and manufacturer names are used descriptively in our articles to identify parties in the documentary record. Inclusion is not endorsement. Exclusion is not condemnation. If you represent a vendor or lab and a passage misrepresents your operation, the same address applies.

Every aggregate in a TitrateLab article is reproducible from the corpora described above. If you want to reproduce a specific figure and cannot, email us: we will either point you at the query, correct the article, or correct the pipeline. All three outcomes have happened in the past. This page will be revised as methodology evolves; material changes are logged in the site's git history and surfaced in article footnotes when they affect already-published numbers.