rtransparency automatically identifies and extracts indicators of research transparency from the full text of biomedical articles, in both PubMed Central (PMC) JATS XML and plain-text (PDF-derived) form. Every prediction comes with the exact statement that triggered it, so results are auditable rather than a black box. Detection is rule-based (curated regular expressions over the relevant article sections), self-contained (no GitHub-only or AGPL dependencies), and ships with reproducible accuracy benchmarks.
The ten indicators
| Indicator | Detects | XML function | Text function |
|---|---|---|---|
| Conflicts of interest | A COI disclosure is present (including “no competing interests”) | rt_coi_pmc |
rt_coi |
| Funding | A statement that funding was received | rt_fund_pmc |
rt_fund |
| Protocol registration | A trial/protocol registration identifier or statement (NCT, ISRCTN, PROSPERO, OSF, CHiCTR, DRKS, ANZCTR, IRCT, UMIN, …) | rt_register_pmc |
rt_register |
| Novelty | The article claims its own work is novel or first | rt_novelty_pmc |
rt_novelty |
| Replication | A replication or external/independent validation was performed | rt_replication_pmc |
rt_replication |
| Data sharing | The authors’ own data are made available (repository, accession, or in-article) | rt_data_code_pmc |
rt_data_code |
| Code sharing | The authors’ own analysis code is shared | rt_data_code_pmc |
rt_data_code |
| AI disclosure | A statement discloses generative-AI use in manuscript preparation (2023+) | rt_ai_pmc |
rt_ai |
| Open-access license | The article is openly licensed, and which license (CC-BY, CC-BY-NC-ND, CC0, …) | rt_oa_pmc |
rt_oa |
| Reporting guideline | The authors followed a reporting guideline, and which (CONSORT, PRISMA, STROBE, ARRIVE, …) | rt_reporting_pmc |
rt_reporting |
Conflicts of interest and AI disclosure are disclosure-based: a statement on the topic counts whether the disclosure is positive or negative. Conflict-of- interest and funding statements are detected not only in English but also in Spanish, Portuguese, French, German and Italian.
Installation
# From CRAN (when available)
install.packages("rtransparency")
# Development version from GitHub
# install.packages("remotes")
remotes::install_github("choxos/rtransparency", build_vignettes = TRUE)No GitHub-only or AGPL dependencies are required; data and code detection is native (it no longer wraps oddpub). rt_read_pdf() (PDF to text) additionally needs the poppler pdftotext utility on your system. The optional furrr and future packages enable parallel corpus processing; ggplot2 enables plotting.
Quick start: all ten indicators in one call
library(rtransparency)
xml <- system.file("extdata", "PMID32171256-PMC7071725.xml", package = "rtransparency")
res <- rt_all_pmc(xml, remove_ns = TRUE)
# The predictions, one column per indicator:
res[, c("is_coi_pred", "is_fund_pred", "is_register_pred", "is_novelty_pred",
"is_replication_pred", "is_open_data", "is_open_code", "is_ai_pred",
"is_open_access", "is_reporting_pred")]
# Each prediction is paired with the text/value that triggered it, e.g.:
res$coi_text
res$open_data_statements
res$oa_license # e.g. "CC-BY-4.0"
res$reporting_guideline # e.g. "PRISMA"rt_all_pmc() returns one row with the ten predictions, the extracted statement for each, article identifiers and metadata, the year, and is_success. is_ai_pred is NA for articles published before 2023.
Per-indicator functions
Each indicator can be run on its own, for a PMC XML file or a plain-text file:
rt_coi_pmc(xml, remove_ns = TRUE) # conflicts of interest
rt_fund_pmc(xml, remove_ns = TRUE) # funding
rt_register_pmc(xml, remove_ns = TRUE) # protocol registration
rt_novelty_pmc(xml, remove_ns = TRUE) # novelty claims
rt_replication_pmc(xml, remove_ns = TRUE)# replication / external validation
rt_data_code_pmc(xml, remove_ns = TRUE) # data AND code sharing (+ extracted links)
rt_ai_pmc(xml, remove_ns = TRUE) # generative-AI-use disclosure (2023+)
rt_oa_pmc(xml, remove_ns = TRUE) # open-access status + license
rt_reporting_pmc(xml, remove_ns = TRUE) # reporting-guideline use + which one
rt_meta_pmc(xml, remove_ns = TRUE) # article metadataCorpus-scale processing
rt_all_pmc_dir() runs all ten indicators over an entire directory (or a vector of paths). It is built for large corpora:
res <- rt_all_pmc_dir(
"path/to/xml", # a directory, or a character vector of file paths
remove_ns = TRUE,
output = "results.csv", # resumable: re-running skips files already recorded
parallel = TRUE, # via furrr + an active future::plan()
progress = TRUE
)-
Resumable: with
output, results are written to a CSV in chunks; a re-run skips files already recorded and appends only the new ones. -
Failure-isolated: a malformed file yields an
is_success = FALSErow instead of aborting the run. -
Parallel: set
future::plan("multisession")andparallel = TRUE.
Plain-text input
The same detectors run on plain-text (PDF-derived) articles. rt_read_pdf() returns the extracted text as a character string; write it to a .txt file, then point the text detectors (which share the PMC detection logic) at that file:
article_txt <- rt_read_pdf("article.pdf") # needs poppler's pdftotext; returns text
writeLines(article_txt, "article.txt") # the detectors take a file path
rt_all("article.txt") # COI, funding, registration, novelty, replication
rt_coi("article.txt") # or one indicator at a time
rt_ai("article.txt") # generative-AI-use disclosurert_ai() is the plain-text counterpart of rt_ai_pmc(). Because a text file carries no reliable publication date, it applies no 2023 year gate (it returns TRUE/FALSE, never NA) and cannot confine the scan to back-matter sections, so restrict its use to 2023-or-later articles and expect a slightly higher false-positive rate on papers that use AI as a research method.
Summarizing a corpus
Once you have one row per article, summarize the corpus:
data(rt_demo) # a small simulated example shipped with the package
rt_summary(rt_demo) # per-indicator prevalence with a Wilson confidence
# interval and a sensitivity/specificity-corrected
# (Rogan-Gladen) prevalence
rt_summary(rt_demo, by = "year") # subgroup summaries
rt_score(rt_demo) # add a per-article count of openness practices met
rt_plot(rt_demo) # prevalence bar chart
rt_plot(rt_demo, type = "trend", year = "year") # prevalence over timeThe accuracy correction uses the bundled rt_accuracy table (detector sensitivity and specificity for eight indicators; open-access licensing and AI-use disclosure are reported uncorrected). Supply your own estimates:
rt_accuracy # the bundled estimates
my_acc <- data.frame(variable = "is_open_data", sensitivity = 0.84, specificity = 0.97)
rt_summary(rt_demo, accuracy = my_acc) # correct with your own valuesLinking to FAIR assessment
The data- and code-availability links the detector extracts (open_data_links, open_code_links) can be passed to FAIR-assessment tooling such as rfair to score the findability and accessibility of the shared resources.
Validation
Benchmarked against the human-labeled XML benchmark of Serghiou et al. (2021), reproducible under data-raw/benchmark/, with results in inst/benchmark/:
| Indicator | Sensitivity | Specificity |
|---|---|---|
| Conflicts of interest | 94.0% | 100% |
| Funding | 100% | 95.7% |
| Protocol registration | 99.2% | 96.9% |
| Data sharing | 76.5% | 99.0% |
| Code sharing | 88.1% | 99.5% |
Registration and code in the Serghiou benchmark table above are labeled independently of the detector; COI, funding and data labels in the 1000-article 2023 sample were reconciled against detector-extracted statements (detector-adjudicated), so their agreement is not a fully independent estimate. Data sharing is deliberately precision-favoring: its 76.5% sensitivity trades recall for 99.0% specificity (the original oddpub algorithm scores about 84%/97% on this set).
The newer indicators are validated against maintainer-built, hand-labeled benchmarks in inst/benchmark/:
| Indicator | Sensitivity | Specificity | Basis |
|---|---|---|---|
| Novelty | 83.8% | 95.2% | hand-labeled novelty/replication gold set |
| Replication | 92.8% | 98.5% | replication-enriched sample (111 positives); correction is approximate |
| AI-use disclosure | not accuracy-corrected | — | experimental; only 9 positives in the 2023 sample |
| Open-access license | 100% | not estimable | structured <license> extraction; license-type exact match 99.8%; specificity rests on 1 negative in the OA subset, so it is reported uncorrected |
| Reporting guideline | 93.8% | 99.0% | 1000-article 2023 sample hand-labeled (65 positives) |
Replication’s correction mixes designs (sensitivity from the enriched sample, specificity from the representative 2023 sample), so it is less clean than the single-design corrections above. AI-use disclosure is reported uncorrected and is excluded from rt_accuracy until a larger labeled post-2022 sample exists. Two further benchmarks live in inst/benchmark/: a five-language sample for multilingual COI and funding, and a TXT-parity benchmark comparing the text and XML detectors.
See vignette("rtransparency") for the methodology and vignette("scope-and-limitations") for what each indicator does and does not capture.
Documentation
-
vignette("rtransparency")— introduction and methodology -
vignette("transparency-summary")— corpus prevalence, scoring and plotting -
vignette("ai-disclosure")— the AI-use disclosure indicator in depth -
vignette("scope-and-limitations")— indicator semantics, limitations, output schema - Package website: https://choxos.github.io/rtransparency/
Lineage and citation
This package builds on the original rtransparent tool of Stylianos (Stelios) Serghiou, an enhanced, renamed fork maintained by Ahmad Sofi-Mahmudi (ORCID 0000-0001-6829-0823, GitHub @choxos). It adds four indicators (novelty, replication, AI disclosure, and a natively re-implemented data/code detector), multilingual COI and funding detection, plain-text parity, and corpus-scale batch processing. Serghiou is credited as an author.
The foundational paper: Serghiou et al., Assessment of transparency indicators across the biomedical literature: How open is open? PLOS Biology, 2021, doi:10.1371/journal.pbio.3001107. Run citation("rtransparency") for both references.
Use of AI
Parts of this package were developed with the assistance of generative AI (Anthropic’s Claude, via Claude Code), including code, tests, documentation, and benchmark tooling. All AI-assisted output was reviewed, run, and validated by the maintainer, who is responsible for the final content. This mirrors the kind of disclosure the package itself is built to detect.
Getting help
Please file bugs or questions as issues at https://github.com/choxos/rtransparency/issues with a minimal reproducible example.