Skip to contents

rtransparency automatically identifies and extracts indicators of research transparency from the full text of biomedical articles, in both PubMed Central (PMC) JATS XML and plain-text (PDF-derived) form. Every prediction comes with the exact statement that triggered it, so results are auditable rather than a black box. Detection is rule-based (curated regular expressions over the relevant article sections), self-contained (no GitHub-only or AGPL dependencies), and ships with reproducible accuracy benchmarks.

The ten indicators

Indicator Detects XML function Text function
Conflicts of interest A COI disclosure is present (including “no competing interests”) rt_coi_pmc rt_coi
Funding A statement that funding was received rt_fund_pmc rt_fund
Protocol registration A trial/protocol registration identifier or statement (NCT, ISRCTN, PROSPERO, OSF, CHiCTR, DRKS, ANZCTR, IRCT, UMIN, …) rt_register_pmc rt_register
Novelty The article claims its own work is novel or first rt_novelty_pmc rt_novelty
Replication A replication or external/independent validation was performed rt_replication_pmc rt_replication
Data sharing The authors’ own data are made available (repository, accession, or in-article) rt_data_code_pmc rt_data_code
Code sharing The authors’ own analysis code is shared rt_data_code_pmc rt_data_code
AI disclosure A statement discloses generative-AI use in manuscript preparation (2023+) rt_ai_pmc rt_ai
Open-access license The article is openly licensed, and which license (CC-BY, CC-BY-NC-ND, CC0, …) rt_oa_pmc rt_oa
Reporting guideline The authors followed a reporting guideline, and which (CONSORT, PRISMA, STROBE, ARRIVE, …) rt_reporting_pmc rt_reporting

Conflicts of interest and AI disclosure are disclosure-based: a statement on the topic counts whether the disclosure is positive or negative. Conflict-of- interest and funding statements are detected not only in English but also in Spanish, Portuguese, French, German and Italian.

Installation

# From CRAN (when available)
install.packages("rtransparency")

# Development version from GitHub
# install.packages("remotes")
remotes::install_github("choxos/rtransparency", build_vignettes = TRUE)

No GitHub-only or AGPL dependencies are required; data and code detection is native (it no longer wraps oddpub). rt_read_pdf() (PDF to text) additionally needs the poppler pdftotext utility on your system. The optional furrr and future packages enable parallel corpus processing; ggplot2 enables plotting.

Quick start: all ten indicators in one call

library(rtransparency)

xml <- system.file("extdata", "PMID32171256-PMC7071725.xml", package = "rtransparency")

res <- rt_all_pmc(xml, remove_ns = TRUE)

# The predictions, one column per indicator:
res[, c("is_coi_pred", "is_fund_pred", "is_register_pred", "is_novelty_pred",
        "is_replication_pred", "is_open_data", "is_open_code", "is_ai_pred",
        "is_open_access", "is_reporting_pred")]

# Each prediction is paired with the text/value that triggered it, e.g.:
res$coi_text
res$open_data_statements
res$oa_license            # e.g. "CC-BY-4.0"
res$reporting_guideline   # e.g. "PRISMA"

rt_all_pmc() returns one row with the ten predictions, the extracted statement for each, article identifiers and metadata, the year, and is_success. is_ai_pred is NA for articles published before 2023.

Per-indicator functions

Each indicator can be run on its own, for a PMC XML file or a plain-text file:

rt_coi_pmc(xml, remove_ns = TRUE)        # conflicts of interest
rt_fund_pmc(xml, remove_ns = TRUE)       # funding
rt_register_pmc(xml, remove_ns = TRUE)   # protocol registration
rt_novelty_pmc(xml, remove_ns = TRUE)    # novelty claims
rt_replication_pmc(xml, remove_ns = TRUE)# replication / external validation
rt_data_code_pmc(xml, remove_ns = TRUE)  # data AND code sharing (+ extracted links)
rt_ai_pmc(xml, remove_ns = TRUE)         # generative-AI-use disclosure (2023+)
rt_oa_pmc(xml, remove_ns = TRUE)         # open-access status + license
rt_reporting_pmc(xml, remove_ns = TRUE)  # reporting-guideline use + which one
rt_meta_pmc(xml, remove_ns = TRUE)       # article metadata

Corpus-scale processing

rt_all_pmc_dir() runs all ten indicators over an entire directory (or a vector of paths). It is built for large corpora:

res <- rt_all_pmc_dir(
  "path/to/xml",          # a directory, or a character vector of file paths
  remove_ns = TRUE,
  output    = "results.csv",  # resumable: re-running skips files already recorded
  parallel  = TRUE,           # via furrr + an active future::plan()
  progress  = TRUE
)
  • Resumable: with output, results are written to a CSV in chunks; a re-run skips files already recorded and appends only the new ones.
  • Failure-isolated: a malformed file yields an is_success = FALSE row instead of aborting the run.
  • Parallel: set future::plan("multisession") and parallel = TRUE.

Plain-text input

The same detectors run on plain-text (PDF-derived) articles. rt_read_pdf() returns the extracted text as a character string; write it to a .txt file, then point the text detectors (which share the PMC detection logic) at that file:

article_txt <- rt_read_pdf("article.pdf")   # needs poppler's pdftotext; returns text
writeLines(article_txt, "article.txt")      # the detectors take a file path

rt_all("article.txt")                       # COI, funding, registration, novelty, replication
rt_coi("article.txt")                       # or one indicator at a time
rt_ai("article.txt")                        # generative-AI-use disclosure

rt_ai() is the plain-text counterpart of rt_ai_pmc(). Because a text file carries no reliable publication date, it applies no 2023 year gate (it returns TRUE/FALSE, never NA) and cannot confine the scan to back-matter sections, so restrict its use to 2023-or-later articles and expect a slightly higher false-positive rate on papers that use AI as a research method.

Summarizing a corpus

Once you have one row per article, summarize the corpus:

data(rt_demo)            # a small simulated example shipped with the package

rt_summary(rt_demo)      # per-indicator prevalence with a Wilson confidence
                         # interval and a sensitivity/specificity-corrected
                         # (Rogan-Gladen) prevalence

rt_summary(rt_demo, by = "year")   # subgroup summaries

rt_score(rt_demo)        # add a per-article count of openness practices met

rt_plot(rt_demo)                                  # prevalence bar chart
rt_plot(rt_demo, type = "trend", year = "year")   # prevalence over time

The accuracy correction uses the bundled rt_accuracy table (detector sensitivity and specificity for eight indicators; open-access licensing and AI-use disclosure are reported uncorrected). Supply your own estimates:

rt_accuracy                              # the bundled estimates
my_acc <- data.frame(variable = "is_open_data", sensitivity = 0.84, specificity = 0.97)
rt_summary(rt_demo, accuracy = my_acc)   # correct with your own values

Linking to FAIR assessment

The data- and code-availability links the detector extracts (open_data_links, open_code_links) can be passed to FAIR-assessment tooling such as rfair to score the findability and accessibility of the shared resources.

Validation

Benchmarked against the human-labeled XML benchmark of Serghiou et al. (2021), reproducible under data-raw/benchmark/, with results in inst/benchmark/:

Indicator Sensitivity Specificity
Conflicts of interest 94.0% 100%
Funding 100% 95.7%
Protocol registration 99.2% 96.9%
Data sharing 76.5% 99.0%
Code sharing 88.1% 99.5%

Registration and code in the Serghiou benchmark table above are labeled independently of the detector; COI, funding and data labels in the 1000-article 2023 sample were reconciled against detector-extracted statements (detector-adjudicated), so their agreement is not a fully independent estimate. Data sharing is deliberately precision-favoring: its 76.5% sensitivity trades recall for 99.0% specificity (the original oddpub algorithm scores about 84%/97% on this set).

The newer indicators are validated against maintainer-built, hand-labeled benchmarks in inst/benchmark/:

Indicator Sensitivity Specificity Basis
Novelty 83.8% 95.2% hand-labeled novelty/replication gold set
Replication 92.8% 98.5% replication-enriched sample (111 positives); correction is approximate
AI-use disclosure not accuracy-corrected experimental; only 9 positives in the 2023 sample
Open-access license 100% not estimable structured <license> extraction; license-type exact match 99.8%; specificity rests on 1 negative in the OA subset, so it is reported uncorrected
Reporting guideline 93.8% 99.0% 1000-article 2023 sample hand-labeled (65 positives)

Replication’s correction mixes designs (sensitivity from the enriched sample, specificity from the representative 2023 sample), so it is less clean than the single-design corrections above. AI-use disclosure is reported uncorrected and is excluded from rt_accuracy until a larger labeled post-2022 sample exists. Two further benchmarks live in inst/benchmark/: a five-language sample for multilingual COI and funding, and a TXT-parity benchmark comparing the text and XML detectors.

See vignette("rtransparency") for the methodology and vignette("scope-and-limitations") for what each indicator does and does not capture.

Documentation

Lineage and citation

This package builds on the original rtransparent tool of Stylianos (Stelios) Serghiou, an enhanced, renamed fork maintained by Ahmad Sofi-Mahmudi (ORCID 0000-0001-6829-0823, GitHub @choxos). It adds four indicators (novelty, replication, AI disclosure, and a natively re-implemented data/code detector), multilingual COI and funding detection, plain-text parity, and corpus-scale batch processing. Serghiou is credited as an author.

The foundational paper: Serghiou et al., Assessment of transparency indicators across the biomedical literature: How open is open? PLOS Biology, 2021, doi:10.1371/journal.pbio.3001107. Run citation("rtransparency") for both references.

Use of AI

Parts of this package were developed with the assistance of generative AI (Anthropic’s Claude, via Claude Code), including code, tests, documentation, and benchmark tooling. All AI-assisted output was reviewed, run, and validated by the maintainer, who is responsible for the final content. This mirrors the kind of disclosure the package itself is built to detect.

Getting help

Please file bugs or questions as issues at https://github.com/choxos/rtransparency/issues with a minimal reproducible example.