Pipeline
Four-phase workflow from paper collection to statistical analysis
The pipeline is designed to be reproducible: a fixed random seed is used at the sampling stage, all API calls are logged, and manual coding is interruptible and resumable.
Phase 0 — Paper Collection
Goal: Assemble a stratified random sample of ~300–400 papers drawn across venues that represent fields with different publication velocities.
Steps:
- Build paper lists per venue (open-access APIs where available; manual collection for paywalled venues)
- Draw a reproducible stratified random sample —
sampling_tracker.py - Download PDFs —
batch_download.pyhandles open-access sources; a manual download guide is generated for paywalled content
Outputs:
data/output/sample_tracking.csv— master file with download status for every sampled paperdata/pdfs/<venue>/— PDFs organised by venue
Quick start:
python phase_0_collection/QUICKSTART_COLLECTION.pyPhase 1 — Automated Processing
Goal: Extract citations from every PDF, verify them, and produce a pre-populated coding template.
For each PDF:
- Extract plain text (PyPDF2)
- Locate and parse the reference list
- Query each citation against CrossRef — low relevance scores flag potential hallucinations
- Score surrounding context with GPTZero — high AI-probability reinforces the flag
- Write results to output CSVs
Usage:
from phase_1_automated.phase1_automated_processing import process_batch
from pathlib import Path
process_batch(
pdf_directory=Path("data/pdfs/venue_name"),
domain="FieldName",
venue="Venue Year",
year=2024,
)Outputs:
data/output/paper_metadata.csvdata/output/citations_extracted.csvdata/output/hallucination_coding.csv← pre-populated template for Phase 2
Token budget: approximately 2,500 GPTZero tokens per paper.
Phase 2 — Manual Coding
Goal: Apply the expert coding scheme to flagged citations and to the key authors of each affected paper.
Two coding tasks run in parallel:
2B — Citation Characteristics (~5 min / citation)
For each flagged citation:
- Classify Citation_Domain — standardised domain list (see Codebook)
- Assign Citation_Role — Background / Methods / Related Work / Empirical / Theoretical
- Distance_from_Paper and Recency_Category are auto-calculated
Interactive interface:
from phase_2_coding.phase2_manual_coding import interactive_coding_session
interactive_coding_session()Estimated total effort: ~48–58 hours assuming a ~30% hallucination rate and ~1.5 flagged citations per affected paper.
Phase 3 — Statistical Analysis
Goal: Test the three pre-registered research questions and generate visualisations.
| RQ | Test | Variables |
|---|---|---|
| RQ1 (Expertise) | Logistic regression / chi-square | Expertise_Match × Is_Hallucinated |
| RQ2 (Velocity) | ANOVA / Kruskal-Wallis | Domain bucket × hallucination rate |
| RQ3 (Location) | Chi-square | Section_Location × Is_Hallucinated |
Usage:
from phase_3_analysis.phase3_analysis import HallucinationAnalyzer
HallucinationAnalyzer().run_full_analysis()Outputs: test statistics printed to console; figures saved to data/output/figures/:
hallucinations_by_domain.pngexpertise_distance.pngrecency_distribution.pngsection_location.png
Reliability
After completing the full coding pass, re-code a random 10% subsample (blind to original codes) to assess intra-rater reliability. Report Cohen’s κ for Expertise_Match and Citation_Role in the final write-up.