DataCite Public Data File
DataCite Public Data Files contain metadata for all publicly available DataCite DOIs.
We release an updated version of the public data file annually. Each annual public data file contains all DOIs that were registered up to the end of the year.
All metadata contained in the file is freely available under a CC0 waiver. For more information, see the DataCite Data File Use Policy.
Access the Public Data Files
All available public data files are accessible through https://datafiles.datacite.org/.
Contents of the Public Data Files
DataCite Public Data File 2024
The DataCite Public Data File 2024 contains all DataCite DOIs in Findable state that were registered up to the end of 2024. The file was generated on January 6, 2025.
Updates to the Public Data File structure
This section describes the structure of the latest DataCite Public Data File. For this release, we updated the data file structure to make it easier to access, parse, and interpret.
- The data file’s internal directory structure is now organized by the date the DOI was last updated.
- The top-level file is now packaged into a TAR archive, and individual files inside are compressed with GZIP. This enables quicker and more efficient extraction, as it allows both 1) a specific selection of files to be extracted from the archive without needing to decompress the entire file (a function of how GZIP works), and 2) decompression of the whole data file in parallel.
- A CSV file is included with a full list of DOIs, their latest state (Findable or Registered), the associated Repository Account ID, and their latest updated timestamp
- We have added a Readme file to the data file.
Directory structure
The public data file contains a top-level dois
folder. The dois
folder contains a series of folders named with an updated_YYYY-MM
convention. Each of these folders contains:
- All Findable DataCite DOI records updated during the corresponding month (at the time the public data file was generated).
- Tabular report for the corresponding month.
The directory structure looks like this:
dois
├── updated_2024-07
│ ├── 2024-07.csv.gz
│ ├── part_0000.jsonl.gz
│ ├── part_0001.jsonl.gz
│ ├── part_0002.jsonl.gz
├── updated_2024-08
... ├── 2024-08.csv.gz
├── part_0000.jsonl.gz
├── part_0001.jsonl.gz
├── part_0002.jsonl.gz
├── part_0003.jsonl.gz
...
DataCite DOI records
DataCite DOI records are compressed in a series of gzip files within each updated_YYYY-MM
folder. Each gzip file contains up to 10,000 DataCite DOI records in JSON Lines format. The .gz
files are named with a part_[four-digit part number].jsonl.gz
convention.
What DataCite DOI records are included?
The public data file contains a record of every Findable state DOI. Each record is represented once in a folder corresponding to the month the DOI was last updated.
What is the format of individual records?
The format of individual records in the file will reflect a DOI singleton response from the DataCite REST API with the following adjustments:
- The records contain additional affiliation and publisher metadata returned when the following REST API URL parameters are set:
affiliation=true
andpublisher=true
. See Can I see more detailed affiliation and publisher information in the REST API? for more details. - The records do not contain the
xml
value.
Tabular reports
Each updated_YYYY-MM
folder contains a tabular report in CSV format, compressed with gzip, that contains:
- A list of all registered DataCite DOIs (in Findable or Registered state) updated during the corresponding month (at the time the public data file was generated)
- The date they were updated
- Their current state (
findable
orregistered
) - Their current Repository ID.
Rows in the CSV file are structured like the following:
doi | state | client_id | updated |
---|---|---|---|
10.0166/fk2.stagefigshare.9753211 | findable | figshare.dud | 2024-08-01T00:20:45Z |
10.15493/dea.mims.20210317 | registered | nrf.saeon | 2024-08-01T00:32:57 |
Registered state DOIs are included to identify DOIs that are no longer publicly available . Repository IDs aid in identifying DOIs that have moved between repositories.
DataCite Public Data File 2023
The DataCite Public Data File 2023 contains all DataCite DOIs in Findable state that were registered up to the end of 2023.
The metadata is available within a gzipped tar file. Inside this file, individual records are grouped into folders by DOI prefix. Each prefix folder contains a set of JSON files, comprising up to 1,000 records each.
Each record is in a JSON format that is based on the DataCite Metadata Schema. For more information, see the DataCite XML to JSON Mapping.
Using the Public Data Files
As each public data file contains metadata for all Findable DataCite DOIs (through the given year), it can be used to seed an initial harvest of DataCite DOI metadata.
For newly registered DOIs and updates to existing DOI metadata, the DataCite REST API can be used to retrieve a list of DOIs. The results can be sorted and filtered by the last updated date. For example:
- Sort by date updated (newest to oldest): https://api.datacite.org/dois?disable-facets=true&sort=-updated&publisher=true&affiliation=true
- Filter by date updated and sort oldest to newest: https://api.datacite.org/dois?disable-facets=true&sort=updated&publisher=true&affiliation=true&query=updated:[2025-01-01%20TO%202025-01-31]
Updated 14 days ago