Methodology

Transparency is core to what we do. This page explains how we grade research confidence, how we report effect size separately, and what the grades mean.

Grades describe how confident the research is. They do not tell you whether the intervention fits your life.

How to read our grades

The research pipeline

Our state-of-the-art pipeline ingests from PubMed (the world's largest biomedical research database, maintained by the U.S. National Library of Medicine), bioRxiv and medRxiv (leading preprint servers where new research appears before peer review). Domain-specific queries filter for RCTs, meta-analyses, and systematic reviews using MeSH terms and publication type tags. Papers are deduplicated by DOI and PMID before insertion.

Every paper is analyzed by our AI system, using one of the most capable large language models available. The AI reads each abstract, extracts specific findings, and links them to the relevant supplements and habits in our database. This means thousands of papers can be analyzed together in ways that would take a human research team months.

From this analysis, we derive everything on the site: confidence grades, effect-size summaries, benefit descriptions, safety notes, and protocol recommendations. Confidence is computed by a 7-factor weighted algorithm informed by GRADE (detailed below) with percentile normalization. Effect size is computed independently and shown on each outcome page. Both recalculate automatically whenever new research is ingested.

The same underlying corpus powers features like semantic search (matching queries by meaning, not just keywords) and the Evidence Assistant, which uses retrieval-augmented generation (RAG) to pull the most relevant research before answering your question. Built on Qwen embeddings (4000-dimensional vectors) stored in a PostgreSQL vector database.

Confidence grades: S to E

Every supplement, habit, and food receives a confidence grade based on how much research backs it and how consistent that research is. The grade answers one question: how sure are we that the effect is real. How big the effect is on a specific outcome is reported separately.

SHigh confidence

Move with confidence.

Multiple well-designed human studies with consistent, reproducible results
At least one meta-analysis or systematic review with large combined sample size
Clear dose-response relationship established
Mechanism of action well-understood
Strong safety profile with long-term usage data
Benefits observed across diverse populations

ASolid confidence

Trusted recommendation.

At least one well-designed human study with meaningful effect size
Supporting evidence from additional human trials
Plausible and partially understood mechanism
Generally favorable safety profile
Consistent direction of effect, even if magnitude varies

BBuilding confidence

Worth a test run.

Some positive human studies, but often small sample sizes or short duration
Results may be limited to specific populations or conditions
Mechanism is plausible but not fully established
Safety is generally acceptable but long-term data may be limited
More research needed to confirm initial findings

CLimited confidence

Mixed results. Decide carefully.

Human studies that exist are very small, poorly designed, or conflicting
Mechanism is theoretical or based on extrapolation from related compounds
May have safety concerns or insufficient safety data
Commercially hyped beyond what the evidence supports

DPreliminary

Lab and animal stage. Wait for human studies.

Supported by animal or in-vitro studies only, with no human trials published yet
Pre-clinical results are promising enough to warrant attention
Mechanism is plausible and demonstrated in model organisms
Safety in humans is unknown or extrapolated from animal data
May become a higher tier as human research emerges

EInsufficient evidence

Not enough studies yet.

Human studies show no meaningful benefit for the claimed use
Evidence of harm, serious side effects, or unfavorable risk-benefit ratio
Regulatory warnings, bans, or restrictions in multiple countries
Marketing claims significantly exceed the evidence
Better-studied alternatives exist for the same goal

Confidence factors

Confidence is scored by a weighted algorithm across 7 factors. Scores are percentile-normalized within each category. Our factors are informed by the GRADE framework (Guyatt et al., BMJ 2008), the Cochrane Handbook for Systematic Reviews, and the Oxford Centre for Evidence-Based Medicine levels of evidence.

Study design21%Risk of bias16%Replication21%Sample size16%Consistency11%Precision5%Recency10%

Study design, risk of bias, replication, sample size, and consistency map directly to GRADE's core assessment criteria. Precision captures statistical rigor. Recency reflects the Cochrane practice of flagging outdated evidence. Effect size is intentionally NOT bundled into this confidence number, it is reported separately because how-big-the-effect and how-sure-we-are are orthogonal questions.

For simplicity, we sometimes present these as 5 factors by combining related pairs:

Study quality37%Study design + Risk of biasReplication21%Sample size16%Consistency11%Recency15%Recency + Precision

Effect size, reported separately

Confidence answers 'is the effect real'. Effect size answers 'how big is the effect on this specific outcome'. We treat them as independent because they're independent questions. A creatine effect is large for short-sprint power and minimal for VO2 max, even though the confidence in 'creatine has a real effect on power' is high.

Magnitude bands shown on every outcome page:

Minimal

Small

Medium

Large

Effect size uses the units that matter for each outcome. Examples of what we call 'Large':

Outcome	Units	Large at
Sleep latency	minutes to fall asleep	30+ min faster
Blood pressure	mmHg systolic/diastolic	10+ mmHg lower
Muscle mass	kg lean tissue	2+ kg added
VO2 max	ml/kg/min	6+ ml/kg/min
Inflammation	% hs-CRP reduction	50%+ lower
Recovery	% faster return to baseline	50%+ faster

Effect size is computed per (entity, outcome) from the studies we've ingested. It never contributes to the confidence letter, because rolling them together hides exactly the information you need to decide whether to use something. The bands above and units below are the same numbers we apply across the site, so a 'Large' on one outcome means the same thing as 'Large' on another within its own unit scale.

What grades don't mean

A grade is not a recommendation. A supplement with an S grade has strong research behind it, but that doesn't mean you specifically need it. Your diet, health status, medications, and goals all matter.

A low grade doesn't mean "bad." A B grade often just means "not enough research yet." Many promising supplements start at B and move up as more studies are published. The grade reflects the evidence, not the potential.

Grades change. When new research publishes, grades update. A supplement at A today could move to S or drop to B depending on what future studies find. That's the point. We follow the evidence, not the hype.

From research to protocol

Individual evidence claims are aggregated per supplement and habit, producing an overall evidence grade. Protocols are then assembled by combining interventions that target specific goals, with each step showing the evidence grade of its underlying research. Protocol-level grades represent the average strength of evidence across all steps.

Editorial review

Our pipeline is fully automated from ingestion to scoring. Evidence grades are algorithmically derived with no manual overrides or editorial bias. We perform regular spot-checks across supplements, habits, and protocols as quality control to verify that AI-extracted claims accurately reflect the underlying research. If you spot an issue, contact us at hello@protocolengine.io.

Affiliate policy

Affiliate commissions never influence evidence grades. We recommend what the science supports, then find the best product for that recommendation. Evidence grades are derived from published research and are never adjusted based on commercial partnerships.

References

Guyatt GH, et al. “GRADE: an emerging consensus on rating quality of evidence and strength of recommendations.” BMJ. 2008;336(7650):924-926. doi:10.1136/bmj.39489.470347.AD
Higgins JPT, et al., editors. Cochrane Handbook for Systematic Reviews of Interventions, version 6.4. Cochrane, 2023. training.cochrane.org/handbook
OCEBM Levels of Evidence Working Group. “The Oxford Levels of Evidence 2.” Oxford Centre for Evidence-Based Medicine, 2011. cebm.ox.ac.uk/resources/levels-of-evidence
Page MJ, et al. “The PRISMA 2020 statement: an updated guideline for reporting systematic reviews.” BMJ. 2021;372:n71. doi:10.1136/bmj.n71