{"id":13826,"date":"2015-04-08T17:54:54","date_gmt":"2015-04-08T16:54:54","guid":{"rendered":"http:\/\/www.ch.imperial.ac.uk\/rzepa\/blog\/?p=13826"},"modified":"2015-04-08T17:54:54","modified_gmt":"2015-04-08T16:54:54","slug":"goldilocks-data","status":"publish","type":"post","link":"https:\/\/www.rzepa.net\/blog\/?p=13826","title":{"rendered":"Goldilocks Data."},"content":{"rendered":"<div class=\"kcite-section\" kcite-section-id=\"13826\">\n<p>Last August, I <a title=\"Data galore!  134 kilomolecules.\" href=\"http:\/\/www.ch.imperial.ac.uk\/rzepa\/blog\/?p=12803\" target=\"_blank\">wrote about<\/a> <em>data galore<\/em>, the archival\u00a0of data for 133,885 (134 kilo) molecules into a repository, together with an associated data descriptor<span id=\"cite_ITEM-13826-0\" name=\"citation\"><a href=\"#ITEM-13826-0\">[1]<\/a><\/span> published in the new journal <em>Scientific Data<\/em>. Since six months is a long time in the rapidly evolving field of RDM, or research data management, I offer an update in the form of some new observations.<\/p>\n<p>Firstly, 131 kilo molecules are now offered in a new different form;\u00a0<a href=\"http:\/\/gdb.koitz.info\/gdbrowse\/\">http:\/\/gdb.koitz.info\/gdbrowse\/<\/a>\u00a0and it is worth comparing the differences between the presentation of the two sets of otherwise identical data.<\/p>\n<ol>\n<li>The original<strong><span style=\"color: #ff00ff;\"> archive<\/span><\/strong>\u00a0had a single assigned DOI<span id=\"cite_ITEM-13826-1\" name=\"citation\"><a href=\"#ITEM-13826-1\">[2]<\/a><\/span> from where you could download a ZIP file to be unpacked and navigated on your own computer. The exposed metadata for the deposition (by which I mean in this case, metadata registered with <a href=\"http:\/\/search.datacite.org\/\" target=\"_blank\">DataCite<\/a>, the registration authority used by Figshare) was limited to general information about the 133,885 molecules such as the authorship and license. The granularity is coarse, not extending to descriptions of individual molecules.<\/li>\n<li>The new version forgoes the ZIP archive, replacing it with a proper <strong><span style=\"color: #ff00ff;\">database<\/span><\/strong> (based on <a href=\"http:\/\/www.mongodb.org\/\" target=\"_blank\">MongoDB<\/a>) containing information about 130,832 molecules.<b>\u00a0<\/b>\u00a0This allows one to search the data\u00a0at the individual\u00a0molecule level (formula, InChI descriptor, mass, <em>etc<\/em>) using the tools provided. To the end-user, this is much more useful; the data is both\u00a0<strong>discoverable<\/strong> and\u00a0<strong>re-usable<\/strong>.<\/li>\n<\/ol>\n<p>This is no overlap between these two presentations of the data. There also appears to be no API (application programming interface) which might allow one to write code to construct one&#8217;s own searches. The apparent absence of an API also means that really only a human navigating the set menus can discover and re-use that\u00a0data; the data might not be mineable by a machine for example. The absence of an API is not that unusual, only some of the best known molecular databases offer this; the\u00a0<a href=\"http:\/\/www.programmableweb.com\/api\/rcsb-protein-data-bank\" target=\"_blank\">RCSB Protein Data Bank<\/a> is a good example. More significantly, each instance of such a molecule-based database is likely to have its own way of accessing the data and even if a documented API were available, one would still have to write specific code for each such resource.<\/p>\n<p>So the first bowl contains what I suggest is cold porridge and the second is perhaps\u00a0equivalent to a\u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/Table_d%27h\u00f4te\" target=\"_blank\">table d&#8217;h\u00f4te menu<\/a>. Does Goldilocks have a third option? I would argue yes, she could have:<\/p>\n<ol start=\"3\">\n<li>We recently published data for 158 kilo molecules<span id=\"cite_ITEM-13826-2\" name=\"citation\"><a href=\"#ITEM-13826-2\">[3]<\/a><\/span> for which each molecule carries its own metadata. That metadata can be queried using any search engine that supports the basic metadata standards:<br \/>\n<small><a href=\"http:\/\/search.datacite.org\/ui?q=has_media:true&amp;fq=prefix:10.14469\" target=\"demo\">http:\/\/search.datacite.org\/ui?q=has_media:true&amp;fq=prefix:10.14469<\/a><\/small><br \/>\nis an example. Or armed with the metadata schema, one could also write one&#8217;s own search engine\u00a0and in theory at least, that code should serve to query ANY repository that supports these standards.<\/li>\n<\/ol>\n<p>You could argue that all that has happened is one has simply replaced a specific database API (if it exists) with a specific metadata schema. But these metadata schemas are controlled standards, the components of which should be self-describing (and one can see the schema components by invoking the link above).<\/p>\n<p>As the archival of data (RDM) becomes increasingly important, communities will have to start making decisions about which flavour of data-porridge to offer Goldilocks. For molecular data at least, I suggest the third option is highly desirable and perhaps likely to be the most persistent. Parochial databases very much depend on a specialised team of people to maintain them in perpetuity, which I gather now means 20 years. At very least, we should start to have a debate about how the future will evolve. Let us not leave this debate merely in the hands of a small number of large organisations that are likely to make decisions based on their own business models. After all, it\u00a0starts off at least as our data, not theirs! Arguably, we as authors have now largely lost control over how our stories (journal articles) are managed, let us not cede the same for data.<\/p>\n<h2>References<\/h2>\n    <ol class=\"kcite-bibliography csl-bib-body\"><li id=\"ITEM-13826-0\">R. Ramakrishnan, P.O. Dral, M. Rupp, and O.A. von Lilienfeld, \"Quantum chemistry structures and properties of 134 kilo molecules\", <i>Scientific Data<\/i>, vol. 1, 2014. <a href=\"https:\/\/doi.org\/10.1038\/sdata.2014.22\">https:\/\/doi.org\/10.1038\/sdata.2014.22<\/a>\n\n<\/li>\n<li id=\"ITEM-13826-1\">Raghunathan Ramakrishnan., P. Dral, P.O. Dral, M. Rupp, and O. Anatole Von Lilienfeld., \"Quantum chemistry structures and properties of 134 kilo molecules\", 2014. <a href=\"https:\/\/doi.org\/10.6084\/m9.figshare.978904\">https:\/\/doi.org\/10.6084\/m9.figshare.978904<\/a>\n\n<\/li>\n<li id=\"ITEM-13826-2\">Y. Zhang, H.S. Rzepa, J.J.P. Stewart, P. Murray-Rust, M.J. Harvey, N. Mason, A. McLean, and Imperial College High Performance Computing Service., \"Revised Cambridge NCI database\", 2014. <a href=\"https:\/\/doi.org\/10.14469\/ch\/2\">https:\/\/doi.org\/10.14469\/ch\/2<\/a>\n\n<\/li>\n<\/ol>\n\n<\/div> <!-- kcite-section 13826 -->","protected":false},"excerpt":{"rendered":"<p>Last August, I wrote about data galore, the archival\u00a0of data for 133,885 (134 kilo) molecules into a repository, together with an associated data descriptor published in the new journal Scientific Data. Since six months is a long time in the rapidly evolving field of RDM, or research data management, I offer an update in the [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[3],"tags":[801,1371,1289],"class_list":["post-13826","post","type-post","status-publish","format-standard","hentry","category-chemical-it","tag-api","tag-rcsb-protein-data-bank","tag-search-engine"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p1gPyz-3B0","jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13826","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13826"}],"version-history":[{"count":0,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=\/wp\/v2\/posts\/13826\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13826"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13826"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.rzepa.net\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13826"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}