Link Checker
The DataCite link checker service is a custom-built web crawler that periodically checks a random sampling of DOIs to verify that they still resolve to a valid URL and to gather other useful information about the metadata for DOIs registered with DataCite.
The link checker service does not check every DOI. We are considering exploring how we expand the functionality and you can share your feedback about this on the DataCite roadmap.
How do I use the link checker?
The link checker service runs automatically in the background, you do not need to enable it in order to benefit from the service. The results are visible when logged in with a member, consortium or consortium organization account.
The results of the link checker are displayed in DOI Fabrica at the bottom of each DOI record that has been checked.
You can filter the list of DOIs on the DOIs tab by HTTP status code to see which DOIs have link checker results of each status.
Link checker results are also available to DataCite Members via the REST API.
How does the link checker work?
The link checker checks one random DOI per Repository per day. It attempts to follow the URL listed in the URL field of the DOI's metadata and returns results about whether it was successful and what it found at the other end.
The crawler that powers link checker was built open-source by DataCite and goes by the name PidCheck. It is software built on top of existing crawler technology, namely the Scrapy project, with various customizations specific to us.
What does the link checker look for?
The link checker looks for characteristics of proper functioning of URLs, as well as elements that make up a well-formed DOI landing page. (See Best Practices for DOI Landing Pages.)
HTTP status code
The link checker will attempt to follow the URL listed in the URL field of the DOI's metadata. If that URL resolves successfully, the link checker will return HTTP status code 200. Otherwise, one of several standard HTTP error codes will be returned, such as 404 (page not found).
Number and URL of any redirects
If the link checker is redirected while attempting to follow a URL, the results will display a list of all URLs the link checker was redirected through, ending in the final URL that was ultimately resolved.
Landing page
The link checker will return the HTTP content type of the content found at the URL to which it ultimately resolves. Ideally, this will be the content type text/html
, indicating that an HTML landing page was found on the other end. If another content type is found, the link checker will indicate which content type was found.
The landing page includes the DOI.
The link checker will indicate whether a DOI was found on the landing page.
Schema.org metadata
The link checker looks for schema.org metadata on the landing page, if a landing page is found. It's is specifically looking for embedded JSON-LD with @context https://schema.org
.
What do I do if the link checker isn't getting the results I think it should get?
Certain HTTP errors can be temporary. If one of your DOIs is showing that it returned an error code on its last check, try to resolve the URL yourself to see if there is still a problem. If the URL resolves normally, there is no need to contact us. The HTTP status code will be updated the next time the link checker works its way around checking that DOI.
If the URL resolved successfully (status code 200), but the results of the metadata checks don't align with what you were expecting, make sure that your DOI's landing page conforms to our recommendations for Best Practices for DOI Landing Pages. If your landing page does conform, but the link checker is not picking up the appropriate results, please contact us at [email protected] and we will investigate the issue.
Updated 8 months ago