The Data Citation Corpus: Collaboratively advancing the evaluation of the impact of open data
-
- Iratxe Puebla
Data citations provide insights into the use of open data, but tracking connections between journal articles and datasets remains challenging and time-consuming, as many citations are missing from structured metadata or remain locked in closed systems. This limits our understanding of the impact of open data, and hinders inclusion of open data in research assessment.
To address this, Make Data Count is developing the Data Citation Corpus, a large, open collection of data citations identified through multiple methods. The Corpus goes beyond citations collected via article references and dataset metadata, to also incorporate mentions to data identified by full-text mining of articles using machine-learning methodologies. This approach scales the number of data citations without imposing additional burden on researchers, repositories, publishers, or evaluators.
The Corpus includes 5 million citations from DataCite Event Data, Chan Zuckerberg Initiative, and Aligning Science Across Parkinson’s, incorporating datasets with DOIs and accession numbers. The store of data citations is available under a CC0 license, and can be explored via an interactive dashboard: https://corpus.datacite.org/dashboard. The Corpus has been used by groups and organizations to explore the use of datasets, including the State of Open Data report, Northwestern University, and University of Colorado Boulder.
We will report on our progress on the Data Citation Corpus to include citations from additional sources, the latest developments in machine-learning methodologies to identify data mentions, and examples from the applications of the Corpus to gain insights into the impact of open data.