Can Common Crawl reliably track persistent identifier (PID) use over time?

Abstract

We report here on the results of two studies using two and four monthly web crawls respectively from the Common Crawl (CC) initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over 1012 URIs from over 5 * 109 pages crawled in April 2014 and April 2017, the second study adds a further 3 * 109 pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information.

0

Discussion (0)

Sign in to join the discussion.

Loading comments…