{"id":18427,"date":"2017-06-08T17:32:21","date_gmt":"2017-06-08T16:32:21","guid":{"rendered":"http:\/\/www.ch.imperial.ac.uk\/rzepa\/blog\/?p=18344"},"modified":"2017-06-08T17:32:21","modified_gmt":"2017-06-08T16:32:21","slug":"how-to-search-data-repositories-for-fair-chemical-content-and-data-subjectscheme","status":"publish","type":"post","link":"https:\/\/www.rzepa.net\/blog\/?p=18427","title":{"rendered":"How to search data repositories for FAIR chemical content and data: SubjectScheme"},"content":{"rendered":"<div class=\"kcite-section\" kcite-section-id=\"18427\">\n<p>As data repositories start to flourish, it is reasonable to ask questions such as <em>what sort of chemistry can be found there and how can I find it?<\/em> Here I give an updated<span id=\"cite_ITEM-18427-0\" name=\"citation\"><a href=\"#ITEM-18427-0\">[1]<\/a><\/span> worked example of a digital repository search for chemical content and also pose an important issue for the chemistry domain.<\/p>\n<p>Firstly, I should say this search is restricted just to those data repositories that submit indexing terms (metadata) to DataCite, which is the agency that will be used to conduct the searches. Each type of metadata is defined by a prefix or operator field (much in the same way that an advanced Google search can be <a href=\"http:\/\/www.googleguide.com\/advanced_operators_reference.html\">prefixed<\/a> with an operator, e.g. <strong>author:<\/strong><sup>\u2665<\/sup>). I will use just two such DataCite field prefixes<sup>\u2020<\/sup> here as exemplars (there are many more).<\/p>\n<ol>\n<li><strong>media:<\/strong> This specifies the media type for the data being searched. For restriction to chemistry one takes advantage of the <strong>chemical\/x-<\/strong> media type, as described previously.<span id=\"cite_ITEM-18427-1\" name=\"citation\"><a href=\"#ITEM-18427-1\">[2]<\/a><\/span><\/li>\n<li><strong>SubjectScheme:<\/strong> This is a new declaration, as specified in the DataCite V4 metadata schema.<span id=\"cite_ITEM-18427-2\" name=\"citation\"><a href=\"#ITEM-18427-2\">[3]<\/a><\/span> The subject scheme in effect declares a subject-specific term, and is designed to be used by domains such as chemistry.<\/li>\n<\/ol>\n<p>This latter is best illustrated by one specific example of a search which I will dissect here:<br \/>\n <a href=\"https:\/\/search.datacite.org\/works?query=media:chemical\\\/x\\-gaussian*+subjectScheme:inchikey+subject:XZYDALXOGPZGNV-UHFFFAOYSA-M+media:chemical\\\/x\\-mnpub*\"> <span style=\"background-color: cornflowerblue;\">https:\/\/search.datacite.org\/works?query=<\/span><span style=\"background-color: lightpink;\">media:chemical\\\/x\\-gaussian*<\/span><b style=\"background-color: plum;\">+S<\/b><span style=\"background-color: darkturquoise;\">ubjectScheme:inchikey<\/span><b style=\"background-color: plum;\">+<\/b><span style=\"background-color: deepskyblue;\">subject:XZYDALXOGPZGNV-UHFFFAOYSA-M<\/span><b style=\"background-color: plum;\">+<\/b><span style=\"background-color: sandybrown;\">media:chemical\\\/x\\-mnpub*<\/span><\/a><sup>\u2021<\/sup><\/p>\n<ol>\n<li><span style=\"background-color: cornflowerblue;\">https:\/\/search.datacite.org\/works?query=<\/span> <a href=\"https:\/\/search.datacite.org\/help.html\">queries the DataCite MDS<\/a><sup>\u2020<\/sup> (metadata store).<\/li>\n<li><span style=\"background-color: lightpink;\">media:chemical\\\/x\\-gaussian*<\/span> defines a media type which contains the string <span style=\"background-color: lightpink;\">chemical\/x-gaussian<\/span>, with the <span style=\"background-color: lightpink;\">*<\/span> being a wild-card which allows any characters to follow this string. This now is specifying any data repository where <strong>Gaussian<\/strong> files have been deposited and assigned this media type.<\/li>\n<li><b style=\"background-color: plum;\">+<\/b> represents a Boolean <b style=\"background-color: plum;\">AND<\/b> operator.<\/li>\n<li><span style=\"background-color: darkturquoise;\">SubjectScheme:inchikey<\/span> restricts a subject search to a <span style=\"background-color: darkturquoise;\">subjectScheme<\/span> having the value <span style=\"background-color: darkturquoise;\">inchikey<\/span>, whilst<\/li>\n<li><span style=\"background-color: deepskyblue;\">subject:XZYDALXOGPZGNV-UHFFFAOYSA-M<\/span> defines the value of the subject itself.<\/li>\n<li><span style=\"background-color: sandybrown;\">media:chemical\/x-mnpub<\/span> completes the search definition, this relating to the mandatory additional presence of an <strong>Mpublish<\/strong><span id=\"cite_ITEM-18427-3\" name=\"citation\"><a href=\"#ITEM-18427-3\">[4]<\/a><\/span> file indicating (spectroscopic, probably NMR) data readable by the MestreNova program.<\/li>\n<\/ol>\n<p>One hit with these restrictions has doi: <a href=\"http:\/\/doi.org\/10.14469\/HPC\/2635\">10.14469\/HPC\/2635<\/a> and clicking the button on the landing page for this object labelled <strong>metadata<\/strong> resolves to <em>e.g.<\/em><br \/>\n <a href=\"https:\/\/data.datacite.org\/application\/vnd.datacite.datacite+xml\/10.14469\/hpc\/1976\"><small>https:\/\/data.datacite.org\/application\/vnd.datacite.datacite+xml\/10.14469\/hpc\/2635<\/small><\/a>,<br \/>\n and downloads the metadata record for this object. Part of this record looks a bit like:<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"18503\" data-permalink=\"https:\/\/www.rzepa.net\/blog\/?attachment_id=18503\" data-orig-file=\"https:\/\/i0.wp.com\/www.rzepa.net\/blog\/wp-content\/uploads\/2017\/06\/171.jpg?fit=1252%2C240&amp;ssl=1\" data-orig-size=\"1252,240\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"171\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/www.rzepa.net\/blog\/wp-content\/uploads\/2017\/06\/171.jpg?fit=300%2C58&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/www.rzepa.net\/blog\/wp-content\/uploads\/2017\/06\/171.jpg?fit=450%2C86&amp;ssl=1\" class=\"aligncenter size-large wp-image-18503\" src=\"https:\/\/i0.wp.com\/www.rzepa.net\/blog\/wp-content\/uploads\/2017\/06\/171.jpg?resize=450%2C86&#038;ssl=1\" alt=\"\" width=\"450\" height=\"86\" srcset=\"https:\/\/i0.wp.com\/www.rzepa.net\/blog\/wp-content\/uploads\/2017\/06\/171.jpg?resize=1024%2C196&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.rzepa.net\/blog\/wp-content\/uploads\/2017\/06\/171.jpg?resize=300%2C58&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.rzepa.net\/blog\/wp-content\/uploads\/2017\/06\/171.jpg?resize=768%2C147&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.rzepa.net\/blog\/wp-content\/uploads\/2017\/06\/171.jpg?w=1252&amp;ssl=1 1252w, https:\/\/i0.wp.com\/www.rzepa.net\/blog\/wp-content\/uploads\/2017\/06\/171.jpg?w=900&amp;ssl=1 900w\" sizes=\"auto, (max-width: 450px) 100vw, 450px\" \/><\/p>\n<p>This brings me to the important issue for the chemistry domain, which is to agree upon a core set of <strong>SubjectSchemes<\/strong> for implementation in\u00a0data repositories with domain-specific chemical content. The two subjects above, the InChI and the InChIKey seem obvious candidates for inclusion. But how the list is extended and how the SubjectScheme\u00a0is specified are now matters for the community to discuss. Perhaps the <a href=\"http:\/\/goldbook.iupac.org\/\">IUPAC GoldBook<\/a> is one starting point for the SubjectScheme URIs. Watch this space.<\/p>\n<hr \/>\n<p><sup>\u2021<\/sup>The \\ syntax indicates an <a href=\"http:\/\/lucene.apache.org\/core\/4_0_0\/queryparser\/org\/apache\/lucene\/queryparser\/classic\/package-summary.html#Escaping_Special_Characters\">&#8220;escaped&#8221; character<\/a>. Thus in chemicalx\\-gaussian a \\ ensured that the following \/ is treated as part of the search string, and not as part of the search syntax. Likewise <b>\\-<\/b> ensures the minus character is part of the string and not a syntactic negation. The current list of characters requiring escaping is <tt style=\"background-color: lightyellow;\">+ - &amp; | ! ( ) { } [ ] ^ \" ~ * ? : \\ \/<\/tt><\/p>\n<p><sup>\u2020<\/sup> The documentation lists common fields, but there are far more specified in V4 of their schema. The ones you see used here are not (yet?) documented at <a href=\"https:\/\/search.datacite.org\/help.html\">https:\/\/search.datacite.org\/help.html<\/a><\/p>\n<p><sup>\u2665<\/sup> This <a href=\"http:\/\/www.googleguide.com\/advanced_operators_reference.html\">Google page<\/a> has a rich plethora of powerful searches, which I suggest almost no-one knows about!<\/p>\n<hr \/>\n<h2>References<\/h2>\n    <ol class=\"kcite-bibliography csl-bib-body\"><li id=\"ITEM-18427-0\">H.S. Rzepa, A. Mclean, and M.J. Harvey, \"InChI As a Research Data Management Tool\", <i>Chemistry International<\/i>, vol. 38, pp. 24-26, 2016. <a href=\"https:\/\/doi.org\/10.1515\/ci-2016-3-408\">https:\/\/doi.org\/10.1515\/ci-2016-3-408<\/a>\n\n<\/li>\n<li id=\"ITEM-18427-1\">H.S. Rzepa, P. Murray-Rust, and B.J. Whitaker, \"The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange\", <i>Journal of Chemical Information and Computer Sciences<\/i>, vol. 38, pp. 976-982, 1998. <a href=\"https:\/\/doi.org\/10.1021\/ci9803233\">https:\/\/doi.org\/10.1021\/ci9803233<\/a>\n\n<\/li>\n<li id=\"ITEM-18427-2\">DataCite Metadata Working Group., \"DataCite Metadata Schema Documentation for the Publication and Citation of Research Data v4.0\", <i>DataCite e.V.<\/i>, 2016. <a href=\"https:\/\/doi.org\/10.5438\/0012\">https:\/\/doi.org\/10.5438\/0012<\/a>\n\n<\/li>\n<li id=\"ITEM-18427-3\">M.J. Harvey, A. McLean, and H.S. Rzepa, \"A metadata-driven approach to data repository design\", <i>Journal of Cheminformatics<\/i>, vol. 9, 2017. <a href=\"https:\/\/doi.org\/10.1186\/s13321-017-0190-6\">https:\/\/doi.org\/10.1186\/s13321-017-0190-6<\/a>\n\n<\/li>\n<\/ol>\n\n<\/div> <!-- kcite-section 18427 -->","protected":false},"excerpt":{"rendered":"<p>As data repositories start to flourish, it is reasonable to ask questions such as what sort of chemistry can be found there and how can I find it? Here I give an updated worked example of a digital repository search for chemical content and also pose an important issue for the chemistry domain. Firstly, I [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[3],"tags":[511,2204,2205,2206,2207,2208,2209,1877,2211,2210,2212,1009,330,2213,2214,2215],"class_list":["post-18427","post","type-post","status-publish","format-standard","hentry","category-chemical-it","tag-chemical-content","tag-chemicalx-media-type","tag-chemicalx-gaussian","tag-company-datacite","tag-company-google","tag-digital-repository-search","tag-domain-specific-chemical-content","tag-media-type","tag-mediachemicalx-gaussian","tag-mediachemicalx-mnpub","tag-question","tag-search-definition","tag-search-engines","tag-search-string","tag-search-syntax","tag-subject-search"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p1gPyz-4Nd","jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/posts\/18427","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=18427"}],"version-history":[{"count":0,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/posts\/18427\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=18427"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=18427"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=18427"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}