Data Citation Corpus
About the data file for the Data Citation Corpus
The Data Citation Corpus is a project by DataCite and Make Data Count funded by the Wellcome Trust, which has as its focus the development of a comprehensive, centralized and publicly-available resource of data citations from a variety of sources.
The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.
Each data citation record is comprised of:
- A pair of identifiers: An identifier for the dataset (a DOI or an Accession number) and the DOI of the publication object (journal article or preprint) in which the dataset is cited.
- Various metadata for the dataset and for the citing object.
More information about the first release of the Data Citation Corpus, including the methodology for data citations contributed by CZI and planned enhancements for the corpus can be found on the Make Data Count website.
Feedback on the data file can be submitted via GitHub. For general questions, email [email protected].
Accessing the data file for the Data Citation Corpus
The data file is available on Zenodo.
Contents of the data file for the Data Citation Corpus
Data sources
The file includes two sources for data citations:
Source | Dataset-article relationship | Documentation |
---|---|---|
DataCite Event Data | Citations are determined based on the resource type and the relation type designated in the metadata for the dataset or the article: - ResourceType= Dataset; relationType=IsReferencedBy/IsCitedBy/IsSupplementTo - ResourceType= Text; relationType= References/Cites/IsSupplementedBy | DataCite documentation on contributing citations DataCite event data model |
Chan Zuckerberg (CZI) Science Knowledge Graph | Mention to dataset identifier (accession number or DOI) identified in the text of an article by NER Model (SciBERT Model) | More information about the open sourced algorithm is forthcoming |
The scope of the data file covers dataset-article pairs, i.e. it includes pairs where the citing object is a journal article or a preprint and the cited object is a dataset.
Data fields included
In addition to the identifier for the dataset and the citing object, each record includes metadata fields for the journal, publisher and publication date for the citing object (from Crossref metadata) and the repository where the dataset is hosted (via DataCite or EMBL-EBI). Where additional metadata fields are available (e.g. for affiliation, subject or other) this is included in the data citation record.
Each data citation includes the following fields:
Field | Description | Required? |
---|---|---|
id | Internal identifier for the citation | Yes |
created | Date of item's incorporation into the corpus | Yes |
updated | Date of item's most recent update in corpus | Yes |
repository | Repository where cited data is stored | No |
publisher | Publisher for the article citing the data | No |
journal | Journal for the article citing the data | No |
title | Title of cited data | No |
publication | DOI of article where data is cited | Yes |
dataset | DOI or accession number of cited data | Yes |
publishedDate | Date when citing article was published | No |
source | Source where citation was harvested | Yes |
subjects | Subject information for dataset | No |
affiliations | Affiliation information for creator of cited data | No |
funders | Funding information for cited data | No |
Updated 16 days ago