{"id":20669,"date":"2019-04-12T17:18:34","date_gmt":"2019-04-12T16:18:34","guid":{"rendered":"https:\/\/www.ch.imperial.ac.uk\/rzepa\/blog\/?p=20669"},"modified":"2019-04-12T17:18:34","modified_gmt":"2019-04-12T16:18:34","slug":"a-search-of-some-major-chemistry-publishers-for-fair-data-records","status":"publish","type":"post","link":"https:\/\/www.rzepa.net\/blog\/?p=20669","title":{"rendered":"A search of some major chemistry publishers for FAIR data records."},"content":{"rendered":"<div class=\"kcite-section\" kcite-section-id=\"20669\">\n<p>In recent years, findable data has become ever more important (the <strong><span style=\"color: #ff0000;\">F<\/span><\/strong> in FAIR). Here I test that <span style=\"color: #ff0000;\"><strong>F<\/strong><\/span> using the DataCite search service.<\/p>\n<p>Firstly an introduction to this service. This is a metadata database about datasets and other research objects. One of the properties is\u00a0<tt>relatedIdentifier<\/tt> which records other identifiers associated with the dataset, being say the DOI of any published article associated with the data, but it could also be pointers to related datasets.<\/p>\n<p>One can query thus:<\/p>\n<ol>\n<li><small><tt><a href=\"https:\/\/search.datacite.org\/works?query=relatedIdentifiers.relatedIdentifier:*\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/search.datacite.org\/works?query=relatedIdentifiers.relatedIdentifier:*<\/a><\/tt><\/small><br \/>\nwhich retrieves the very healthy looking <strong>6,179,287<\/strong>\u00a0works.<\/li>\n<li>One can restrict this to a specific publisher by the DOI prefix assigned to that publisher:<br \/>\n<small><tt><a href=\"https:\/\/search.datacite.org\/works?query=relatedIdentifiers.relatedIdentifier:10.1021*\" target=\"_blank\" rel=\"noopener noreferrer\">?query=relatedIdentifiers.relatedIdentifier:10.1021*<\/a><\/tt><\/small><br \/>\nwhich returns a respectable <strong>210,240<\/strong>\u00a0works.<\/li>\n<li>It turns out that the major contributor to FAIR currently are crystal structures from the CCDC. One can remove them from the search to see what is left over:<br \/>\n<small><tt><a href=\"https:\/\/search.datacite.org\/works?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*)\" target=\"_blank\" rel=\"noopener noreferrer\">?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*)<\/a><\/tt><\/small>\u00a0<br \/>\nand one is down to <strong>14,213<\/strong>\u00a0works, of which many nevertheless still appear to be crystal structures. These may be links to other crystal datasets.<\/li>\n<\/ol>\n<p>I have performed searches <strong>2<\/strong> and <strong>3<\/strong> for some popular publishers of chemistry (the same set that were <a href=\"https:\/\/www.ch.imperial.ac.uk\/rzepa\/blog\/?p=20468\">analysed here<\/a>).<\/p>\n<table border=\"1\">\n<tbody>\n<tr>\n<th style=\"width: 77px;\">Publisher<\/th>\n<th style=\"width: 73px;\">Search 2<\/th>\n<th style=\"width: 73px;\">Search 3<\/th>\n<\/tr>\n<tr>\n<td>ACS<\/td>\n<td><a href=\"https:\/\/search.datacite.org\/works?query=relatedIdentifiers.relatedIdentifier:10.1021*\">210,240<\/a><\/td>\n<td><a href=\"https:\/\/search.datacite.org\/works?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*)\">14,213<\/a><\/td>\n<\/tr>\n<tr>\n<td>RSC<\/td>\n<td><a href=\"https:\/\/search.datacite.org\/works?query=relatedIdentifiers.relatedIdentifier:10.1039*\">138,147<\/a><\/td>\n<td><a href=\"https:\/\/search.datacite.org\/works?query=(relatedIdentifiers.relatedIdentifier:10.1039*)+NOT+(identifier:*10.5517*)\">1,279<\/a><\/td>\n<\/tr>\n<tr>\n<td>Elsevier<\/td>\n<td><a href=\"https:\/\/search.datacite.org\/works?query=relatedIdentifiers.relatedIdentifier:10.1016*\">185,351<\/a><\/td>\n<td><a href=\"https:\/\/search.datacite.org\/works?query=(relatedIdentifiers.relatedIdentifier:10.1016*)+NOT+(identifier:*10.5517*)\">56,373<\/a><\/td>\n<\/tr>\n<tr>\n<td>Nature<\/td>\n<td><a href=\"https:\/\/search.datacite.org\/works?query=relatedIdentifiers.relatedIdentifier:10.1038*\">12,316<\/a><\/td>\n<td><a href=\"https:\/\/search.datacite.org\/works?query=(relatedIdentifiers.relatedIdentifier:10.1038*)+NOT+(identifier:*10.5517*)\">8,104<\/a><\/td>\n<\/tr>\n<tr>\n<td>Wiley<\/td>\n<td><a href=\"https:\/\/search.datacite.org\/works?query=relatedIdentifiers.relatedIdentifier:10.1002*\">135,874<\/a><\/td>\n<td><a href=\"https:\/\/search.datacite.org\/works?query=(relatedIdentifiers.relatedIdentifier:10.1002*)+NOT+(identifier:*10.5517*)\">9,283<\/a><\/td>\n<\/tr>\n<tr>\n<td>Science<\/td>\n<td><a href=\"https:\/\/search.datacite.org\/works?query=relatedIdentifiers.relatedIdentifier:10.1126*\">3,384<\/a><\/td>\n<td><a href=\"https:\/\/search.datacite.org\/works?query=(relatedIdentifiers.relatedIdentifier:10.1126*)+NOT+(identifier:*10.5517*)\">2,343<\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>These publishers all have significant numbers of datasets which at least accord with the <strong>F<\/strong> of FAIR. A lot of data sets may not have metadata which in fact points back to a published article, since this can be something that has to be done <a href=\"https:\/\/www.ch.imperial.ac.uk\/rzepa\/blog\/?p=20634\">only when the DOI of that article appears<\/a>, in other words AFTER the publication of the dataset. So these numbers are probably low rather than high.<\/p>\n<p>How about the other way around? Rather than datasets that have a journal article as a related identifier, we could search for articles that have a dataset as a related identifier?<\/p>\n<ol start=\"4\">\n<li><small><tt><a href=\"https:\/\/search.datacite.org\/works?query=(identifier:*10.1039*)+AND+(relatedIdentifiers.relatedIdentifier:*)\n\" target=\"_blank\" rel=\"noopener noreferrer\">?query=(identifier:*10.1039*)+AND+(relatedIdentifiers.relatedIdentifier:*)<\/a><\/tt><\/small><br \/>\nreturns rather mysterious <strong>nothing found<\/strong>. It might also be that there is no mapping of this search between the CrossRef and DataCite metadata schemas.<\/li>\n<li>And just to show the searches are behaving as expected:<br \/>\n<small><tt><a href=\"https:\/\/search.datacite.org\/works?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+AND+(identifier:*10.5517*)\" target=\"_blank\" rel=\"noopener noreferrer\">?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+AND+(identifier:*10.5517*)<\/a><\/tt><\/small><br \/>\nreturns 196,027 works.<\/li>\n<\/ol>\n<p>It will also be of interest to show how these numbers change over time. Is there an exponential increase? We shall see.<\/p>\n<p>Finally, we have not really explored adherence to eg the <strong>AIR<\/strong> of <strong>FAIR<\/strong>. \u00a0That is for another post.<\/p>\n<!-- kcite active, but no citations found -->\n<\/div> <!-- kcite-section 20669 -->","protected":false},"excerpt":{"rendered":"<p>In recent years, findable data has become ever more important (the F in FAIR). Here I test that F using the DataCite search service. Firstly an introduction to this service. This is a metadata database about datasets and other research objects. One of the properties is\u00a0relatedIdentifier which records other identifiers associated with the dataset, being [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[3],"tags":[1397,1783,811,2556,2563,2440,1721,2318,2444,2319,1473,1474,2623,2470],"class_list":["post-20669","post","type-post","status-publish","format-standard","hentry","category-chemical-it","tag-academic-publishing","tag-datacite","tag-digital-object-identifier","tag-digital-technology","tag-elsevier","tag-findability","tag-identifiers","tag-information","tag-information-architecture","tag-information-science","tag-knowledge","tag-knowledge-representation","tag-search-service","tag-web-design"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p1gPyz-5nn","jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/posts\/20669","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=20669"}],"version-history":[{"count":0,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/posts\/20669\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=20669"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=20669"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=20669"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}