Reliability Guide

AI4Meta treats reliability as a shared quality-check layer across screening, extraction, and model comparisons. The platform follows Feng (2015)by matching the reliability index to the structure of the variable instead of forcing one metric everywhere.

Need the wider analysis overview? Visit the Analysis Guide.

Reliability modes in AI4Meta

Screening reliability: agreement on the same include/exclude decisions.
Extraction reliability: variable-level agreement for extracted study data.
Model reliability: human-AI or model-vs-model agreement on the same task.

Feng (2015) decision tree

What kind of variable are you checking?

1. Binary or nominal category?
   -> Use a categorical chance-corrected agreement index.

2. Ordered category?
   -> Use an ordinal reliability index.

3. Continuous or interval-scale value?
   -> Use an ICC-style reliability index.

4. Mixed extraction table?
   -> Choose the recommended index per variable,
      not one global metric for every field.

Workflow guidance

Screening

Typical data shape: binary include / exclude decisions.
Best use: confirm criteria are being applied consistently before scaling up screening.
Inspect both the summary score and the disagreement queue.

Extraction

Typical data shape: many variables with different levels.
Feng-aligned rule: report one recommended metric per variable.
Inspect variable, level, recommended index, value, interpretation, and disagreement count.

Model reliability

Treat the model as another coder on the same items.
Keep the same Feng logic: categorical → categorical metric; ordinal → ordinal metric; numeric → ICC-style metric.
Use agreement as a calibration signal, not as proof that the output is correct.

Worked examples

Title/abstract screening: binary nominal decisions → categorical agreement metric.
Study design extraction: nominal categories like RCT/cohort → categorical agreement metric.
Risk of bias rating: ordered categories → ordinal reliability metric.
Mean age extraction: continuous numeric value → ICC-style reliability metric.

What to report

Which workflow was checked.
How the overlap sample was defined.
How many coders/models and items were involved.
The variable-specific metric choice following Feng (2015).
The observed reliability values and how disagreements were resolved.

Reference

Feng, G. C. (2015). Mistakes and how to avoid mistakes in using intercoder reliability indices.