Posts Tagged ‘search engine’

How FAIR are the data associated with the 2017 Molecules-of-the-Year?

Wednesday, March 7th, 2018

C&EN has again run a vote for the 2017 Molecules of the year. Here I take a look not just at these molecules, but at how FAIR (Findable, Accessible, Interoperable and Reusable) the data associated with these molecules actually is.

I went about finding out as follows:

  1. The article DOI for all seven candidates was linked to the C&EN site.
  2. From there I manually tracked down the Supporting information
  3. Some of this SI gave a CCDC deposition number for crystal structure data for the molecule in question. The easiest way of going directly to the data was to use the search.datacite.org search engine and to enter the keywords CCDC + deposition number. This gives a DOI for the data, examples of which are included in the table below.
  4. In other examples, I used the CSD Conquest search program and entered the names of 2-3 of the authors of the articles. This also worked well.
  5. Most of the SI files, downloaded as PDF files also had static images of NMR spectra included. This is not active data, and hence does not fulfil the F and I of FAIR, and probably the A as well. None of it is FAIR as defined by my post here although it is actually really easy to make it so. One of the examples had ~116 spectra so unFAIRed.
  6. In another example there was also computational data, included simply as a set of XYZ coordinates and again contained in the PDF file. This too is not really FAIR, since one has to know how to extract it from this container and repurpose it. It also represents a tiny subset of the data potentially available.
How FAIR are the data associated with the 2017 Molecules-of-the-Year?
# Title Article DOI Data DOI
1 Persulfurated Coronene: A New Generation of “Sunflower” 10.1021/jacs.6b12630 Data available only as PDF
Hosted by Figshare
The SI also has its own DOI:
10.1021/jacs.6b12630.s001
2 A Truncated Molecular Star 10.1021/jacs.6b12630 Crystal structure data:
10.5517/ccdc.csd.cc1nb303
3 Synthesis of trinorbornane 10.1039/c7cc06273g Crystal structure data:
10.5517/ccdc.csd.cc1p7806
4 Braiding a molecular knot with eight crossings 10.1126/science.aal1619 Crystal structure data:
10.5517/ccdc.csd.cc1m85y0
5 Unique physicochemical and catalytic properties dictated by the B3NO2 ring system 10.1038/nchem.2708 Crystal structure data:
10.5517/ccdc.csd.cc1lkff0
6 Total synthesis of mycobacterial arabinogalactan containing 92 monosaccharide units 10.1038/ncomms148510 116 NMR spectra available only as PDF. No crystal structure
7 Nitrogen Lewis Acids 10.1021/jacs.6b12360 NMR spectra available only as PDF.
Computed coordinates available only as PDF
Crystal structures data:
CCDC 1457983-1457987,1458000-1458001
e.g. 10.5517/ccdc.csd.cc1ky4qc
10.5517/ccdc.csd.cc1ky4rd

The FAIRness of the data for these molecules of the year is largely rescued by the crystal structure data deposited with the CCDC in their CSD database and rendered F of FAIR by the persistent identifiers such as the (parochial) deposition numbers or the more general DOI. Now if the NMR and computational data were also covered in this way, we would be making great progress. There are of course many other types of data included with these examples, and procedures for making such data also FAIR have to be worked out by the community.

In order to construct the table above, I had to put about two hours of effort into tracking down the items (and this only because I have done this sort of search before). Perhaps next year I might persuade C&EN to include such a table in their own article!

A nice example of open data (in London).

Sunday, March 5th, 2017

Living in London, travelling using public transport is often the best way to get around. Before setting out on a journey one checks the status of the network. Doing so today I came across this page: our open data from Transport for London. 

  1. I learnt that by making TFL travel data openly available, some 11,000 developers (sic!) have registered for access, out of which some 600 travel apps have emerged.
  2. The data is in XML, which makes it readily inter-operable.[1]
  3. This encourages crowd-sourced innovation.
  4. They have taken the trouble to produce an API (application programmable interface) which allows rich access to the data and information about e.g. AccidentStats, AirQuality, BikePoint, Journey, Line, Mode, Occupancy, Place, Road, Search, StopPointVehicle.

Chemists could learn some lessons here! Of course, there are quite a few chemical databases with APIs that are examples of open data, but the “ESI” (electronic supporting information) sources which almost all published articles rely upon to disseminate data are clearly struggling to cope. Take for example this recent article[2], where much of the data has been dropped into the inevitable PDF “coffin” and which is a breathtaking 907 pages long. To give the authors their due, they also provide 20 CIF files which ARE good sources of data. Rarely commented on, but clearly missing from the information associated with this (indeed most) articles is the metadata about the data. Thus the metadata for these CIF files amounts to just e.g. 229. To find out the context, one has to scour the article (or the 907 pages of the ESI) to identify compound 229 (I strongly suspect it’s a molecule because of the implied semantics of the term, not because its been explicitly declared). You will not find the metadata at e.g. data.datacite.org which is one open aggregator and global search engine based on deposited metadata.

I have commented elsewhere on this blog that other types of data could also be enhanced in the manner that CIF crystallographic files represent. For example the Mpublish NMR project, examples of which are shown here, and for which typical data AND its metadata can be seen at DOI: 10.14469/hpc/1053. I fancy that if this method had been adopted,[2] those 907 pages might have shrunk somewhat, although of course not entirely. But my hope is that gradually the innovative chemistry community will find ways of exhuming more and more data from the PDF coffin and in the process reducing the paginated lengths of the PDF-based ESI further, perchance eventually even to zero?

If you are yourself preparing an article and sweating over the ESI at this very moment, do please take a look at the Mpublish method and how perhaps it can help make your NMR data at least more useful to others.


I understand an article describing this project is in preparation. If you cannot wait, this recent application of the Mpublish project has some details.[3]

References

  1. P. Murray-Rust, and H.S. Rzepa, "Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles", Journal of Chemical Information and Computer Sciences, vol. 39, pp. 928-942, 1999. https://doi.org/10.1021/ci990052b
  2. J.M. Lopchuk, K. Fjelbye, Y. Kawamata, L.R. Malins, C. Pan, R. Gianatassio, J. Wang, L. Prieto, J. Bradow, T.A. Brandt, M.R. Collins, J. Elleraas, J. Ewanicki, W. Farrell, O.O. Fadeyi, G.M. Gallego, J.J. Mousseau, R. Oliver, N.W. Sach, J.K. Smith, J.E. Spangler, H. Zhu, J. Zhu, and P.S. Baran, "Strain-Release Heteroatom Functionalization: Development, Scope, and Stereospecificity", Journal of the American Chemical Society, vol. 139, pp. 3209-3226, 2017. https://doi.org/10.1021/jacs.6b13229
  3. M.J. Harvey, A. McLean, and H.S. Rzepa, "A metadata-driven approach to data repository design", Journal of Cheminformatics, vol. 9, 2017. https://doi.org/10.1186/s13321-017-0190-6

Goldilocks Data.

Wednesday, April 8th, 2015

Last August, I wrote about data galore, the archival of data for 133,885 (134 kilo) molecules into a repository, together with an associated data descriptor[1] published in the new journal Scientific Data. Since six months is a long time in the rapidly evolving field of RDM, or research data management, I offer an update in the form of some new observations.

Firstly, 131 kilo molecules are now offered in a new different form; http://gdb.koitz.info/gdbrowse/ and it is worth comparing the differences between the presentation of the two sets of otherwise identical data.

  1. The original archive had a single assigned DOI[2] from where you could download a ZIP file to be unpacked and navigated on your own computer. The exposed metadata for the deposition (by which I mean in this case, metadata registered with DataCite, the registration authority used by Figshare) was limited to general information about the 133,885 molecules such as the authorship and license. The granularity is coarse, not extending to descriptions of individual molecules.
  2. The new version forgoes the ZIP archive, replacing it with a proper database (based on MongoDB) containing information about 130,832 molecules.  This allows one to search the data at the individual molecule level (formula, InChI descriptor, mass, etc) using the tools provided. To the end-user, this is much more useful; the data is both discoverable and re-usable.

This is no overlap between these two presentations of the data. There also appears to be no API (application programming interface) which might allow one to write code to construct one’s own searches. The apparent absence of an API also means that really only a human navigating the set menus can discover and re-use that data; the data might not be mineable by a machine for example. The absence of an API is not that unusual, only some of the best known molecular databases offer this; the RCSB Protein Data Bank is a good example. More significantly, each instance of such a molecule-based database is likely to have its own way of accessing the data and even if a documented API were available, one would still have to write specific code for each such resource.

So the first bowl contains what I suggest is cold porridge and the second is perhaps equivalent to a table d’hôte menu. Does Goldilocks have a third option? I would argue yes, she could have:

  1. We recently published data for 158 kilo molecules[3] for which each molecule carries its own metadata. That metadata can be queried using any search engine that supports the basic metadata standards:
    http://search.datacite.org/ui?q=has_media:true&fq=prefix:10.14469
    is an example. Or armed with the metadata schema, one could also write one’s own search engine and in theory at least, that code should serve to query ANY repository that supports these standards.

You could argue that all that has happened is one has simply replaced a specific database API (if it exists) with a specific metadata schema. But these metadata schemas are controlled standards, the components of which should be self-describing (and one can see the schema components by invoking the link above).

As the archival of data (RDM) becomes increasingly important, communities will have to start making decisions about which flavour of data-porridge to offer Goldilocks. For molecular data at least, I suggest the third option is highly desirable and perhaps likely to be the most persistent. Parochial databases very much depend on a specialised team of people to maintain them in perpetuity, which I gather now means 20 years. At very least, we should start to have a debate about how the future will evolve. Let us not leave this debate merely in the hands of a small number of large organisations that are likely to make decisions based on their own business models. After all, it starts off at least as our data, not theirs! Arguably, we as authors have now largely lost control over how our stories (journal articles) are managed, let us not cede the same for data.

References

  1. R. Ramakrishnan, P.O. Dral, M. Rupp, and O.A. von Lilienfeld, "Quantum chemistry structures and properties of 134 kilo molecules", Scientific Data, vol. 1, 2014. https://doi.org/10.1038/sdata.2014.22
  2. Raghunathan Ramakrishnan., P. Dral, P.O. Dral, M. Rupp, and O. Anatole Von Lilienfeld., "Quantum chemistry structures and properties of 134 kilo molecules", 2014. https://doi.org/10.6084/m9.figshare.978904
  3. Y. Zhang, H.S. Rzepa, J.J.P. Stewart, P. Murray-Rust, M.J. Harvey, N. Mason, A. McLean, and Imperial College High Performance Computing Service., "Revised Cambridge NCI database", 2014. https://doi.org/10.14469/ch/2

A newcomer in the game of how we find and use data.

Saturday, May 17th, 2014

I remember a time when tracking down a particular property of a specified molecule was an all day effort, spent in the central library (or further afield). Then came the likes of STN Online (~1980) and later Beilstein. But only if your institution had a subscription. Let me then cut to the chase: consider this URL: http://search.datacite.org/ui?q=InChIKey%3DLQPOSWKBQVCBKS-PGMHMLKASA-N The site is datacite, which collects metadata about cited data! Most of that data is open in the sense that it can be retrieved without a subscription (but see here that it is not always made easy to do so). So, the above is a search for cited data which contains the InChIkey LQPOSWKBQVCBKS-PGMHMLKASA-N. This produces the result:
datacite1
This tells you who published the data (but oddly, its date is merely to the nearest year? It is beta software after all). The advanced equivalent of this search looks like this:

datacite2

 

where the subject of the search is now the InChIkey. If you are familiar with the various molecular search engines, you will appreciate that this generic data search is still fairly primitive. But SEO (search engine optimisation) achieved by improving the quality of the metadata would help improve that experience.

The important thing about DataCite is that it only searches the metacontent of digital repositories, wherein one may expect to find properly curated data, and in particular the possibility of not merely finding highly processed data, but also of the original (instrumental or computational) datafile from which the metadata was abstracted. Rather than a visual graph, one might expect to also find the original data (to however many decimal points). Rather than just molecular coordinates, one might also find a full wavefunction describing the electron density distribution, or a full spectral analysis. In the original form as deposited by researchers, and not in a processed form as supplied by an “added value” resource. Don’t get me wrong; validated data is wonderful, but validation has to be done according to a schema, and such schemas change, improve, evolve over time.

The other important point I think which the above introduces is the concept that DataCite (and similar organisations) might act as a portal, through which software agents might act to validate/aggregate data. The utopian world would be that every organisation that produces data captures it in a form that DataCite and others can find. Unless of course the data is in itself also their business model, and they wish to exert a monopoly over it. One might appreciate monopolies if the alternative is not having access to the data at all, but perhaps at the expense of innovation? I cannot help but feel that once data citation as shown above becomes a generally accepted best practice amongst scientists, then entirely new ways of adding value to it will emerge in abundance. It would be interesting to see whether the current more monopolistic models survive this transition by upping their own game.