This vignette describes what rfair measures and how, in enough detail
to interpret and reproduce its scores. For a quick tour see
vignette("rfair"); for the reuse/sensitivity extensions see
vignette("beyond-fuji").
1. Background: FAIR, the FAIRsFAIR metrics, and F-UJI
The FAIR principles (Wilkinson et al. 2016) state that research data should be Findable, Accessible, Interoperable, and Reusable. They are aspirational; to assess a real data object you need measurable indicators.
The FAIRsFAIR project turned the principles into a concrete, testable metric set, and the F-UJI tool (Devaraju & Huber, PANGAEA) implemented an automated assessment service for them. F-UJI is a Python web service: you send it a persistent identifier (PID) and it returns per-metric scores.
rfair is a native R reimplementation of
the F-UJI metrics (version 0.8). It performs the whole assessment in R,
with no external server, so assessments are scriptable, reproducible,
and embeddable in R pipelines. The original rfair package
(v1) was only an HTTP client for an F-UJI server; this version (v2) is
the engine itself.
2. The assessment pipeline
A single call to assess_fair() runs this pipeline:
identifier
│ id_parse() scheme detection + normalization + resolver URL
▼
resolution content-negotiated GET, follow redirects -> landing page
│ resolve_landing_page()
▼
harvesting a sequence of collectors, in priority order:
│ collect_html_meta() embedded JSON-LD (schema.org), Dublin Core,
│ OpenGraph, Highwire meta tags
│ collect_signposting() HTTP Link header + <link rel> typed links
│ collect_datacite() DataCite JSON via content negotiation
│ collect_xml() DataCite XML, Dublin Core, MODS, EML, ISO19139
│ collect_rdf() JSON-LD (native) and Turtle/RDF-XML (via rdflib)
│ collect_github() GitHub repository + codemeta.json + CITATION.cff
│ harvest_data() HEAD on data links for MIME type and size
▼
mapping + merging each source is mapped to one reference schema and
│ merge_metadata() merged (first-non-empty for scalars; union for
│ lists; longer-but-similar replacement)
▼
evaluation one evaluator per metric inspects the merged metadata
│ run_evaluators() and the resolved identifier, scoring each test
▼
scoring per-test scores -> per-metric -> F/A/I/R -> overall
│ get_assessment_summary()
▼
fair_assessment tidy S3 object (print / summary / as.data.frame /
as_fuji_json / as_rdf)
Identifier handling
id_parse() recognizes DOIs, Handles, ARKs, URNs, UUIDs,
identifiers.org PIDs, w3id, and plain URLs, normalizes
them, and constructs a resolver URL. Persistence is inferred from the
scheme.
Harvesting and content negotiation
Different repositories expose metadata in different ways. rfair asks
for several representations of the same object via HTTP content
negotiation (the Accept header) and scrapes the
landing page, then merges everything into a single
reference schema (~30 elements: creator,
title, publisher,
publication_date, license,
access_level, object_content_identifier,
related_resources, …). When two sources disagree, scalars
keep the first non-empty value (replaced only by a longer,
sufficiently-similar string), and list-valued elements are unioned.
The metric model
Metrics are data-driven: their definitions, tests, scores, and maturity levels come from the bundled FAIRsFAIR YAML, not from hard-coded R logic.
rfair_metric_versions() # bundled metric versions
#> [1] "0.8" "0.5" "0.5ssv2"
#> [4] "0.5ss" "0.5env" "0.7_software"
#> [7] "0.7_software_cessda" "0.6a2a" "0.4"
#> [10] "0.3" "0.2"
# v0.8 has 17 metrics across F/A/I/R (one row each):
nrow(as.data.frame(assess_fair("https://doi.org/10.5281/zenodo.8347772", resolve = FALSE)))
#> [1] 17Each metric has one or more tests. A test contributes a score and a maturity level (a CMMI level 0–3: incomplete, initial, moderate, advanced) when it passes. Metrics use one of two scoring mechanisms:
- cumulative — passed tests’ scores add up;
- alternative — tests are alternative routes to the same points (the earned score is capped at the metric total).
The criterium engine (criterium_engine.R) builds each
metric’s result from the YAML and lets evaluators mark tests passed;
as_fuji_json() then emits a payload matching the upstream
F-UJI FAIRResults schema.
3. What each FAIR category measures (v0.8)
| metric | what rfair checks | |
|---|---|---|
| F | F1-01MD | identifier follows a unique scheme (URI/URN/UUID/HASH/PID) |
| F1-02MD | identifier is persistent and registered (resolves) | |
| F2-01M | core descriptive metadata present (creator, title, id, date, publisher, type, summary, keywords) | |
| F3-01M | metadata links to the downloadable data content | |
| F4-01M | metadata offered in a search-engine-ingestible way (embedded JSON-LD / meta tags) | |
| A | A1-01M | access level / rights are stated in metadata |
| A1-02MD | metadata and data are retrievable via their identifiers | |
| A1.1-01MD | identifiers use a standardized communication protocol (http/https/ftp) | |
| A1.2-01MD | the protocol supports authentication where needed | |
| I | I1-01M | metadata uses a formal, machine-readable representation (JSON-LD/RDF/XML) |
| I2-01M | metadata uses terms from registered semantic vocabularies | |
| I3-01M | qualified references to related entities (with relation types) | |
| R | R1-01M | metadata describes the data content (type, format/size) |
| R1.1-01M | a machine-readable license is present and SPDX/CC-recognized | |
| R1.2-01M | provenance information (creators, dates, contributors) | |
| R1.3-01M | a community-/discipline-endorsed metadata standard is used | |
| R1.3-02D | data is in a recommended (scientific/open/long-term) file format |
The score for a category is the sum of earned over total across its metrics; the overall FAIR score is the sum across all 17, and the maturity is the (clamped) mean of the per-category maturities.
# the canonical principle definitions these metrics map to
fair_principles("I")[, c("id", "definition")]
#> id
#> 1 I1
#> 2 I2
#> 3 I3
#> definition
#> 1 (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
#> 2 (meta)data use vocabularies that follow FAIR principles
#> 3 (meta)data include qualified references to other (meta)data4. Software FAIR (FRSM)
For software objects, rfair also bundles the FRSM (FAIR for Research
Software) metric set; select it with
metric_version = "0.7_software". The GitHub harvester
inspects the repository file tree for signals (a license file, tests, CI
workflows, dependency manifests, a registry DOI, a release version,
contributors) and the 17 FRSM evaluators score from them. FRSM scoring
is heuristic and not yet validated against an upstream software-FAIR
reference.
5. Fidelity to F-UJI
Because rfair reimplements an existing scoring engine, it includes a
non-CRAN conformance harness. tests/conformance/run.R runs
identifiers through both rfair and a locally run, version-matched F-UJI
server and compares per-metric earned scores. A manual run on 2026-06-16
against F-UJI 4.0.0 (metrics v0.8) measured 94.1% on a Zenodo
DOI (16/17 metrics exact) and 85.3% across
PANGAEA and Dryad; the consistent divergence was the data file-format
metric (F-UJI uses Tika content detection where rfair uses an HTTP
HEAD). This reference-server comparison is not reproduced by CI yet. A
separate harness (tests/conformance/parity.R) compares the
R engine with the browser TypeScript engine on registry-derivable
metrics after the webapp branch is checked out alongside
the package.
6. Beyond F-UJI
rfair adds checks that automated FAIR tools usually miss, motivated
by peer review of a COVID-19 FAIR study: license reusability
(not just presence) with the (Re)usable Data Project taxonomy,
controlled-access/sensitive-data flagging, identifier hygiene, and the
FAIR-TLC (Traceable, Licensed, Connected) extension.
See vignette("beyond-fuji").
7. Limitations
- The browser app is registry-only (CORS): it cannot harvest landing pages, so some metrics score lower than the R engine.
- I2-01M (semantic vocabularies) scores 0 for objects whose metadata uses only default namespaces (dc/schema.org/DataCite) — this matches F-UJI.
- RDF Turtle/RDF-XML harvesting and
as_rdf()Turtle output need the optionalrdflibpackage (systemlibrdf); without it those paths are skipped. - Live scores depend on the object’s current metadata and on third-party services (DataCite, Crossref, GitHub) being reachable.
References
- Wilkinson et al. (2016). The FAIR Guiding Principles. Sci Data.
- Devaraju & Huber. F-UJI. https://github.com/pangaea-data-publisher/fuji
- FAIRsFAIR metrics.
- Carbon et al. (2019). (Re)usable data licensing. PLOS ONE.