Transparency is core to what we do. This page explains how we grade research confidence, how we report effect size separately, and what the grades mean.
Grades describe how confident the research is. They do not tell you whether the intervention fits your life.
Our state-of-the-art pipeline ingests from PubMed (the world's largest biomedical research database, maintained by the U.S. National Library of Medicine), bioRxiv and medRxiv (leading preprint servers where new research appears before peer review). Domain-specific queries filter for RCTs, meta-analyses, and systematic reviews using MeSH terms and publication type tags. Papers are deduplicated by DOI and PMID before insertion.
Every paper is analyzed by our AI system, using one of the most capable large language models available. The AI reads each abstract, extracts specific findings, and links them to the relevant supplements and habits in our database. This means thousands of papers can be analyzed together in ways that would take a human research team months.
From this analysis, we derive everything on the site: confidence grades, effect-size summaries, benefit descriptions, safety notes, and protocol recommendations. Confidence is computed by a 7-factor weighted algorithm informed by GRADE (detailed below) with percentile normalization. Effect size is computed independently and shown on each outcome page. Both recalculate automatically whenever new research is ingested.
The same underlying corpus powers features like semantic search (matching queries by meaning, not just keywords) and the Evidence Assistant, which uses retrieval-augmented generation (RAG) to pull the most relevant research before answering your question. Built on Qwen embeddings (4000-dimensional vectors) stored in a PostgreSQL vector database.
Every supplement, habit, and food receives a confidence grade based on how much research backs it and how consistent that research is. The grade answers one question: how sure are we that the effect is real. How big the effect is on a specific outcome is reported separately.
Confidence is scored by a weighted algorithm across 7 factors. Scores are percentile-normalized within each category. Our factors are informed by the GRADE framework (Guyatt et al., BMJ 2008), the Cochrane Handbook for Systematic Reviews, and the Oxford Centre for Evidence-Based Medicine levels of evidence.
Study design, risk of bias, replication, sample size, and consistency map directly to GRADE's core assessment criteria. Precision captures statistical rigor. Recency reflects the Cochrane practice of flagging outdated evidence. Effect size is intentionally NOT bundled into this confidence number, it is reported separately because how-big-the-effect and how-sure-we-are are orthogonal questions.
For simplicity, we sometimes present these as 5 factors by combining related pairs:
Confidence answers 'is the effect real'. Effect size answers 'how big is the effect on this specific outcome'. We treat them as independent because they're independent questions. A creatine effect is large for short-sprint power and minimal for VO2 max, even though the confidence in 'creatine has a real effect on power' is high.
Magnitude bands shown on every outcome page:
Minimal
Small
Medium
Large
Effect size uses the units that matter for each outcome. Examples of what we call 'Large':
| Outcome | Units | Large at |
|---|---|---|
| Sleep latency | minutes to fall asleep | 30+ min faster |
| Blood pressure | mmHg systolic/diastolic | 10+ mmHg lower |
| Muscle mass | kg lean tissue | 2+ kg added |
| VO2 max | ml/kg/min | 6+ ml/kg/min |
| Inflammation | % hs-CRP reduction | 50%+ lower |
| Recovery | % faster return to baseline | 50%+ faster |
Effect size is computed per (entity, outcome) from the studies we've ingested. It never contributes to the confidence letter, because rolling them together hides exactly the information you need to decide whether to use something. The bands above and units below are the same numbers we apply across the site, so a 'Large' on one outcome means the same thing as 'Large' on another within its own unit scale.
A grade is not a recommendation. A supplement with an S grade has strong research behind it, but that doesn't mean you specifically need it. Your diet, health status, medications, and goals all matter.
A low grade doesn't mean "bad." A B grade often just means "not enough research yet." Many promising supplements start at B and move up as more studies are published. The grade reflects the evidence, not the potential.
Grades change. When new research publishes, grades update. A supplement at A today could move to S or drop to B depending on what future studies find. That's the point. We follow the evidence, not the hype.
Individual evidence claims are aggregated per supplement and habit, producing an overall evidence grade. Protocols are then assembled by combining interventions that target specific goals, with each step showing the evidence grade of its underlying research. Protocol-level grades represent the average strength of evidence across all steps.
Our pipeline is fully automated from ingestion to scoring. Evidence grades are algorithmically derived with no manual overrides or editorial bias. We perform regular spot-checks across supplements, habits, and protocols as quality control to verify that AI-extracted claims accurately reflect the underlying research. If you spot an issue, contact us at hello@protocolengine.io.
Affiliate commissions never influence evidence grades. We recommend what the science supports, then find the best product for that recommendation. Evidence grades are derived from published research and are never adjusted based on commercial partnerships.