Data Citation Corpus

About the data file for the Data Citation Corpus

The Data Citation Corpus is a project by DataCite and Make Data Count funded by the Wellcome Trust, which has as focus the development of a comprehensive, centralized and publicly-available resource of data citations from a variety of sources.

The data file for the first release of the Data Citation Corpus includes 10,006,058 data citation records. The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.

Each data citation record is comprised of:

  1. A pair of identifiers: An identifier for the dataset (a DOI or an Accession number) and the DOI of the publication object (journal article or preprint) in which the dataset is cited.
  2. Various metadata for the dataset and for the citing object.

More information about the first release of the Data Citation Corpus, including the methodology for data citations contributed by CZI and planned enhancements for the corpus can be found on the Make Data Count website.

Feedback on the data file can be submitted via Github. For general questions, email [email protected].

Accessing the data file for the Data Citation Corpus

The data file is available on Zenodo.

Version 1.0 of the corpus data file was released on January 30, 2024. Release v1.1 is an optimized version of v1.0 designed to make the original citation records more usable.

Contents of the data file for the Data Citation Corpus

The data file for the Data Citation Corpus includes 10,006,058 data citation records.

Data sources

The file includes two sources for data citations:

SourceDataset-article relationshipDocumentation
DataCite Event DataCitations are determined based on the resource type and the relation type designated in the metadata for the dataset or the article:

- ResourceType= Dataset; relationType=IsReferencedBy/IsCitedBy/IsSupplementTo

- ResourceType= Text; relationType= References/Cites/IsSupplementedBy
DataCite documentation on contributing citations

DataCite event data model
Chan Zuckerberg (CZI) Science Knowledge GraphMention to dataset identifier (accession number or DOI) identified in the text of an article by NER Model (SciBERT Model)More information about the open sourced algorithm is forthcoming

The scope of the data file covers dataset-article pairs, i.e. it includes pairs where the citing object is a journal article or a preprint and the cited object is a dataset.

Data Fields Included

In addition to the identifier for the dataset and the citing object, each record includes metadata fields for the journal, publisher and publication date for the citing object (from Crossref metadata) and the repository where the dataset is hosted (via DataCite or EMBL-EBI). Where additional metadata fields are available (e.g. for affiliation, subject or other) this is included in the data citation record.

Each data citation includes the following fields:

FieldDescriptionRequired?
idInternal identifier for the citationYes
createdDate of item's incorporation into the corpusYes
updatedDate of item's most recent update in corpusYes
repositoryRepository where cited data is storedNo
publisherPublisher for the article citing the dataNo
journalJournal for the article citing the dataNo
titleTitle of cited dataNo
objIdDOI of article where data is citedYes
subjIdDOI or accession number of cited dataYes
publishedDateDate when citing article was publishedNo
accessionNumberAccession number of cited dataNo
doiDOI of cited dataNo
relationTypeIdRelation type in metadata between citation object and subjectNo
sourceIdSource where citation was harvestedYes
subjectsSubject information for datasetNo
affiliationsAffiliation information for creator of cited dataNo
fundersFunding information for cited dataNo