Posts Tagged ‘Information’

A search of some major chemistry publishers for FAIR data records.

Friday, April 12th, 2019

In recent years, findable data has become ever more important (the F in FAIR). Here I test that F using the DataCite search service.

Firstly an introduction to this service. This is a metadata database about datasets and other research objects. One of the properties is relatedIdentifier which records other identifiers associated with the dataset, being say the DOI of any published article associated with the data, but it could also be pointers to related datasets.

One can query thus:

  1. https://search.datacite.org/works?query=relatedIdentifiers.relatedIdentifier:*
    which retrieves the very healthy looking 6,179,287 works.
  2. One can restrict this to a specific publisher by the DOI prefix assigned to that publisher:
    ?query=relatedIdentifiers.relatedIdentifier:10.1021*
    which returns a respectable 210,240 works.
  3. It turns out that the major contributor to FAIR currently are crystal structures from the CCDC. One can remove them from the search to see what is left over:
    ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*) 
    and one is down to 14,213 works, of which many nevertheless still appear to be crystal structures. These may be links to other crystal datasets.

I have performed searches 2 and 3 for some popular publishers of chemistry (the same set that were analysed here).

Publisher Search 2 Search 3
ACS 210,240 14,213
RSC 138,147 1,279
Elsevier 185,351 56,373
Nature 12,316 8,104
Wiley 135,874 9,283
Science 3,384 2,343

These publishers all have significant numbers of datasets which at least accord with the F of FAIR. A lot of data sets may not have metadata which in fact points back to a published article, since this can be something that has to be done only when the DOI of that article appears, in other words AFTER the publication of the dataset. So these numbers are probably low rather than high.

How about the other way around? Rather than datasets that have a journal article as a related identifier, we could search for articles that have a dataset as a related identifier?

  1. ?query=(identifier:*10.1039*)+AND+(relatedIdentifiers.relatedIdentifier:*)
    returns rather mysterious nothing found. It might also be that there is no mapping of this search between the CrossRef and DataCite metadata schemas.
  2. And just to show the searches are behaving as expected:
    ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+AND+(identifier:*10.5517*)
    returns 196,027 works.

It will also be of interest to show how these numbers change over time. Is there an exponential increase? We shall see.

Finally, we have not really explored adherence to eg the AIR of FAIR.  That is for another post.

Re-inventing the anatomy of a research article.

Saturday, December 29th, 2018

The traditional structure of the research article has been honed and perfected for over 350 years by its custodians, the publishers of scientific journals. Nowadays, for some journals at least, it might be viewed as much as a profit centre as the perfected mechanism for scientific communication. Here I take a look at the components of such articles to try to envisage its future, with the focus on molecules and chemistry.

The formula which is mostly adopted by authors when they sit down to describe their chemical discoveries is more or less as follows:

  1. An introduction, setting the scene for the unfolding narrative
  2. Results. This is where much of the data from which the narrative is derived is introduced. Such data can be presented in the form of:
    • Tables
    • Figures and schemes
    • Numerical and logical data embedded in narrative text
  3. Discussion, where the models constructed from the data are illustrated and new inferences presented. Very often categories 2 and 3 are conflated into one single narrative.
  4. Conclusions, where everything is brought together to describe the essential aspects of the new science.
  5. Bibliography, where previous articles pertinent to the narrative are listed.

In the last decade or so, the management of research data has developed as a field of its own, with three phases:

  1. Setting out a data management plan at the start of the project, often a set of aspirations together with putative actions,
  2. the day-to-day management of the data as it emerges in the form of an electronic laboratory notebook (ELN),
  3. the publication of selected data from the ELN into a repository, together with the registration of metadata describing the properties of the data.

In the latter category, item 8 can be said to be a game-changer, a true disruptive influence on the entire process. The key aspect is that it constitutes independent publication of data to sit alongside the object constructed from 1-5. More disruption emerges from the open citations project, whereby category 5 above can be released by publishers to adopt its own separate existence. So now we see that of the five essential anatomic components of a research article, two are already starting to achieve their own independence. Clearly the re-invention of the anatomy of the research article is well under way already.

Next I take a look at what sorts of object might be found in category 8, drawing very much on our own experience of implementing 7 and 8 over the last twelve years or so. I start by observing that in 2 above, figures are perhaps the object most in need of disruptive re-invention. In the 1980s, authors were much taken by the introduction of colour as a means of conveying information within a figure more clearly; although the significant costs then had to be borne directly by these authors (and with a few journals this persists to this day). By the early 1990s, the introduction of the Web[1] offered new opportunities not only of colour but of an extra dimension (or at least the illusion of one) by means of introducing interactivity for three-dimensional models. Some examples resulting from combining figures from category 2 with 8 above are listed in the table below.

Examples of re-invented data objects from category 2
Example Object title Object DOI Article DOI
1 Figure 9. Catalytic cycle involving one amine …etc. 10.14469/hpc/1854 10.1039/C7SC03595K
2 FAIR Data Figure. Mechanistic insights into boron-catalysed direct amidation reactions 10.14469/hpc/4919 10.1039/C7SC03595K
3 FAIR Data table. Computed relative reaction free energies (kcal/mol-1) of Obtusallene derived oxonium and chloronium cations 10.14469/hpc/1248 10.1021/acs.joc.6b02008
4 (raw) NMR data for Epimeric Face-Selective Oxidations … 10.14469/hpc/1267 10.1021/acs.joc.6b02008
5 Bibliography 10.14469/hpc/1116 10.1021/acs.joc.6b02008

Example 1 illustrates how a figure from category 2 above can be augmented with active hyperlinks specifying the DOI of the data in category 8 from which the figure is derived, thus creating a direct and contextual connection between the research article and the research data it is based upon. These links are embedded only in the Acrobat (PDF) version of the article as part of the production process undertaken by the journal publisher. Download Figure 9 from the link here and try it for yourself or try the entire article from the journal, where more figures are so enhanced.

Example 2 takes this one stage further. The hyperlinks in the published figure in example 1 were embedded in software capable of resolving them, namely a PDF viewer. But that is all that this software allows. By relocating the hyperlink into a Web browser instead, one can add further functionality in the form of Javascripts perhaps better described as workflows (supported by browsers but not supported by Acrobat). There are three such workflows in example 2.

  • The first uses an image map to associate a region of the figure data object defined by a DOI.
  • The second interrogates the metadata specifically associated with the DOI (the same DOIs that are seen in the figure itself) to see if there is any so-called ORE metadata available (ORE= Object Re-use and Exchange). If there is, it uses this information to retrieve the data itself and pass it through to
  • the third workflow represented by a set of JavaScripts known as JSmol. These interpret the data received and construct an interactive visual 3D molecular model representing the retrieved data.

All this additional workflowed activity is implemented in a data repository. It is not impossible that it could also be implemented at the journal publisher end of things, but it is an action that would have to be supported by multiple publishers. Arguably this sort of enhancement is far better suited and more easily implemented by a specialised data publisher, i.e. a data repository.

Example 3 does the same thing for a table.

Example 4 enhances in a different manner. Conventionally NMR data is added to the supporting information file associated with a journal article, but such data is already heavily processed and interpreted. The raw instrumental data is never submitted to the journal and is pretty much always possibly only available by direct request from the original researchers (at least if the request is made whilst the original researchers are still contactable!). The data repository provides a new mechanism for making such raw instrumental (and indeed computational) data an integral part of the scientific process.

Example 5 shows how a bibliography can be linked to a secondary bibliography (citations 35 and 36 in this example in the narrative article) and perhaps in the future to Open Citations semantic searches for further cross references.

So by deconstructing the components of the standard scientific article, re-assembling some of them in a better-suited environment and then linking the two sets of components to each other, one can start to re-invent the genre and hopefully add more tools for researchers to use to benefit their basic research processes. The scope for innovation seems considerable. The issue of course is (a) whether publishers see this as a viable business model or whether they instead wish to protect their current model of the research article and whether (b) authors wish to undertake the learning curve and additional effort to go in this direction. As I have noted before, the current model is deficient in various ways; I do not think it can continue without significant reinvention for much longer. And I have to ask that if reinvention does emerge, will science be the prime beneficiary?

References

  1. H.S. Rzepa, B.J. Whitaker, and M.J. Winter, "Chemical applications of the World-Wide-Web system", Journal of the Chemical Society, Chemical Communications, pp. 1907, 1994. https://doi.org/10.1039/c39940001907

Harnessing FAIR data: A suggested useful persistent identifier (PID) for quantum chemical calculations.

Tuesday, August 7th, 2018

Harnessing FAIR data is an event being held in London on September 3rd; no doubt most speakers will espouse its virtues and speculate about how to realize its potential. Admirable aspirations indeed, but capturing hearts and minds also needs lots of real life applications! Whilst assembling a forthcoming post on this blog, I realized I might have one nice application which also pushes the envelope a bit further, in a manner that I describe below.

The post I refer to above is about using quantum chemical calculations to chart possible mechanistic pathways for the reaction between a carboxylic acid and an amine to form an amide. The FAIR data for the entire project is collected at DOI: 10.14469/hpc/4598. Part of what makes it FAIR is the metadata not only collected about this data but also formally registered with the DataCite agency. Registration in turn enables Finding; it is this aspect I want to demonstrate here.

The metadata for the above DOI includes information such as;

  1. The ORCID persistent identifier (PID) for the creator of the data (in this instance myself)
  2. Date stamps for the original creation date and subsequent modifications.
  3. A rights declaration, in this case the CC0 license which describes how the data can be re-used.
  4. Related identifiers, in this case describing members of this collection.

The data itself is held in the members of the collection, each of which is described by a more specific set of metadata in addition to the more general types in the above list (e.g. 10.14469/hpc/4606).

  1. One important additional metadata descriptor is the ORE locator (Object Re-use and Exchange, itself almost a synonym for FAIR). This allows a machine to deduce a direct path to the data file itself, and hence to retrieve it automatically if desired. It is important to note that the DOI itself (i.e. 10.14469/hpc/4606) points only to the “landing page” for the dataset, and does not necessarily describe the direct path to any specific file in the dataset. The ORE path can be used with e.g. software such as JSmol to directly load a molecule based only on its DOI. You can see an example of this here.
  2. Each molecule-based dataset contains additional specific metadata relating to the molecule itself. For example this is how the InChiKey, an identifier specific to that molecule, is expressed in metadata;
    <subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">PVXKWVPAMVWJSQ-UHFFFAOYSA-N</subject>
    The advantage of expressing the metadata in this way is that a general search of the type:
    https://search.datacite.org/works?query=subjectScheme:inchikey+subject:CZABGBRSHXZJCF-UHFFFAOYSA-N
    can be used to track down any molecule with metadata corresponding to the above InChIkey.
  3. Here is more metadata, introduced in this blog. It relates to the (computed) value of the Gibbs energy (the energy unit is in Hartree), as returned by the Gaussian program;
    <subject subjectScheme="Gibbs_Energy" schemeURI="https://goldbook.iupac.org/html/G/G02629.html" valueURI="http://gaussian.com/thermo/">-649.732417</subject>
    I here argue that it represents a unique identifier for a molecule calculation using the quantum mechanical procedures implemented in e.g. Gaussian. This identifier is different from the InChIkey, in that it can be truncated to provide different levels of information.
    • At the coarsest level, a search of the type
      https://search.datacite.org/works?query=subjectScheme:Gibbs_energy+subject:-649.*
      should reveal all molecules with the same number of atoms and electrons whose Gibbs energy has been calculated, but not necessarily with the same InChI (i.e. they may be isomers, or transition states, etc). This level might be useful for revealing most (not necessarily all) molecules involved in say a reaction mechanism. It should also be insensitive to the program system used, since most quantum codes will return a value for the Gibbs energy if the same procedures have been used (i.e. DFT method, basis set, solvation model and dispersion correction) accurate to probably 0.01 Hartree.
    • The top level of precision however is high enough to almost certainly relate to a specific molecule and probably using a specific program;
      https://search.datacite.org/works?query=subjectScheme:Gibbs_energy+subject:-649.732417
    • The searcher can experiment with different levels of precision to narrow or broaden the search.
    • I would also address the issue (before someone asks) of why I have used the Gibbs energy rather than the Total energy. Put simply, the Gibbs energy is far more useful in a chemical context. It can be used to relate the relative Gibbs energies of different isomers of the same molecule to e.g. the equilibrium constant that might be measured. Or the difference in Gibbs energies between a reactant and a transition state can be used to derive the free energy activation barrier for a reaction. The total energy is not so useful in such contexts, although of course it too could be added as a subject in the metadata above if a real use for it is found.
  4. The searcher can also use Boolean combinations of metadata, such as specifying both the InChIKey and the Gibbs Energy, along with say the ORCID of the person who may have published the data;
    https://search.datacite.org/works?query=
    subjectScheme:Gibbs_energy+subject:-649.*+
    subjectScheme:inchikey+subject:CZABGBRSHXZJCF-UHFFFAOYSA-N+
    ORCID:0000-0002-8635-8390

I have tried to show above how FAIR data implies some form of rich (registered) metadata. And how the metadata can be used to Find (the F in FAIR) data with very specific properties, thus Harnessing FAIR data.


It is a current limitation of the V4.1 DataCite schema that there appears no way to specify the data type of the subject, including any units. In theory, a range query of the type:
https://search.datacite.org/works?query=
subjectScheme:Gibbs_energy+subject:[-649.1 TO -649.8]

should be more specific, but I have not yet gotten it to work, probably because of the lack of data-typing means it is not recognised as a range of numeric values. Implicit in this search is the grouping
https://search.datacite.org/works?query=(subjectScheme:Gibbs_energy+subject:-649.*)
+
(subjectScheme:inchikey+subject:CZABGBRSHXZJCF-UHFFFAOYSA-N)
+ORCID:0000-0002-8635-8390

Currently however DataCite do not correctly honour this form of grouping.

Examples please of FAIR (data); good and bad.

Sunday, May 6th, 2018

The site fairsharing.org is a repository of information about FAIR (Findable, Accessible, Interoperable and Reusable) objects such as research data.

A project to inject chemical components, rather sparse at the moment at the above site, is being promoted by workshops under the auspices of e.g. IUPAC and CODATA and the GO-FAIR initiative. One aspect of this activity is to help identify examples of both good (FAIR) and indeed less good (unFAIR) research data as associated with contemporary scientific journal publications.

Here is one example I came across in 2017.[1]. The data associated with this article is certainly copious, 907 pages of it, not including data for 21 crystal structures! The latter is a good example of FAIR, being offered in a standard format (CIF) well-adapted for the type of data contained therein and for which there are numerous programs capable of visualising and inter-operating (i.e. re-using) it. The former is in PDF, not a format originally developed for data and one could argue is closer to the unFAIR end of the spectrum. More so when you consider this one 907-page paginated document contains diverse information including spectra on around 60 molecules. Thus the spectra are all purely visual; they are obviously data but in a form largely designed for human consumption and not re-use by software. The text-based content of this PDF does have numerous pattens, which lends itself to pattern recognition software such as OSCAR, but patterns are easily broken by errors or inexperience and so we cannot be certain what proportion of this can be recovered. The metadata associated with such a collection, if there is any at all, must be general and cannot be easily related to specific molecules in the collection. So I would argue that 907 pages of data as wrapped in PDF is not a good example of FAIR. But it is how almost all of the data currently being reported in chemistry journals is expressed. Indeed many a journal data editor (a relatively new introduction to the editorial teams) exerts a rigorous oversight over the data presented as part of article submissions to ensure it adheres to this monolithic PDF format.

You can also visit this article in Chemistry World (rsc.li/2HG7lTk) for an alternative view of what could be regarded as rather more FAIR data. The article has citations to the FAIR components, which is not published as part of the article or indeed by the journal itself but is held separately in a research data repository. You will find that at doi: 10.14469/hpc/3657 where examples of computational, crystallographic and spectroscopic data are available.

The workshop I allude to above will be held in July. Can I ask anyone reading this blog who has a favourite FAIR or indeed unFAIR example of data they have come across to share these here. We also need to identify areas simply crying out for FAIRer data to be made available as part of the publishing process beyond the types noted above. I hope to report back on both such feedback and the events at this workshop in due course.

References

  1. J.M. Lopchuk, K. Fjelbye, Y. Kawamata, L.R. Malins, C. Pan, R. Gianatassio, J. Wang, L. Prieto, J. Bradow, T.A. Brandt, M.R. Collins, J. Elleraas, J. Ewanicki, W. Farrell, O.O. Fadeyi, G.M. Gallego, J.J. Mousseau, R. Oliver, N.W. Sach, J.K. Smith, J.E. Spangler, H. Zhu, J. Zhu, and P.S. Baran, "Strain-Release Heteroatom Functionalization: Development, Scope, and Stereospecificity", Journal of the American Chemical Society, vol. 139, pp. 3209-3226, 2017. https://doi.org/10.1021/jacs.6b13229

FAIR data ⇌ Raw data.

Thursday, December 7th, 2017

FAIR data is increasingly accepted as a description of what research data should aspire to; Findable, Accessible, Inter-operable and Re-usable, with Context added by rich metadata (and also that it should be Open). But there are two sides to data, one of which is the raw data emerging from say an instrument or software simulations and the other in which some kind of model is applied to produce semi- or even fully processed/interpreted data. Here I illustrate a new example of how both kinds of data can be made to co-exist.

I will start with a recent publication[1] with the title Crystallographic Snapshot of an Arrested Intermediate in the Biomimetic Activation of CO2The nature of this intermediate caught the eye of another research group, who responded with their own critique[2] along with the comment “However, since we have no access to the original crystallographic data …” They might have been referring to the semi-processed data (containing the so-called hkl structure factors) but they may also have been alluding to the raw image data captured directly from the diffractometer cameras. That traditionally has not been available via the CSD (Cambridge structural database), but would be required for a complete re-analysis of the crystal structure. Now the first example of how both FAIR (processed) data and raw data can co-exist has appeared.

The latest version of the CSD database shows an entry resulting from the following publication[3] and the deposited data has its own DOI there (10.5517/ccdc.csd.cc1n9ppb). That entry in turn has a DOI pointer to the Raw data (10.14469/hpc/2300) held in a different location and the pointer is reciprocated (⇌) with the latter pointing back to the former. Both datasets point to the original article, thus completing a holy triangle.

There is more. The Raw dataset (10.14469/hpc/2300) declares it is a member of a superset, called Crystal structure data for Synthesis and Reactions of Benzannulated Spiroaminals; Tetrahydrospirobiquinolines (10.14469/hpc/2297where you can find information about six other related structures. That collection is in turn a member of a superset called Synthesis and Reactions of Benzannulated Spiroaminals; Tetrahydrospirobiquinolines (10.14469/hpc/2099where DOIs to other types of data associated with this project can be found, such as Computational data (10.14469/hpc/2098) and NMR data (10.14469/hpc/2294). Although a human can with some determination follow these associations up, down and across, the system is designed to also be followed by automated algorithms that could traverse this web quickly and efficiently.

So you can now see that a crystal structure held in the CSD could be the starting point for a journey of FAIR data discovery, in manner that has not hitherto been possible. How quickly the CSD will become populated by links to Raw (and other) data remains to be seen. I have not yet discovered any mechanism for specifying a CSD query which stipulates that Raw data must be available, but no doubt this will come.

To end, back to the Biomimetic Activation of CO2 referred to at the start. With no access to the original data, recourse was made to computational modelling.[2] Which where  I came in, since I wanted access to the original (computational) data. Sadly it did not appear to be available with the article,[2] in much the same manner as the original complaint. Perhaps, when FAIR data becomes fully accepted as part of how science is done nowadays, such complaints will become ever rarer!


In fact the original authors did respond[4] with an acknowledgement that their original conclusions were not correct.

Almost. The article [3] cites DOI: 10.14469/hpc/2099 (Ref 28), but it does not cite DOI: 10.5517/ccdc.csd.cc1n9ppb because the latter had not been minted yet at the time the final proofs were corrected, and there is no mechanism to add it at a later stage.

References

  1. S.L. Ackermann, D.J. Wolstenholme, C. Frazee, G. Deslongchamps, S.H.M. Riley, A. Decken, and G.S. McGrady, "Crystallographic Snapshot of an Arrested Intermediate in the Biomimetic Activation of CO<sub>2</sub>", Angewandte Chemie International Edition, vol. 54, pp. 164-168, 2014. https://doi.org/10.1002/anie.201407165
  2. J. Hurmalainen, M.A. Land, K.N. Robertson, C.J. Roberts, I.S. Morgan, H.M. Tuononen, and J.A.C. Clyburne, "Comment on “Crystallographic Snapshot of an Arrested Intermediate in the Biomimetic Activation of CO<sub>2</sub>”", Angewandte Chemie International Edition, vol. 54, pp. 7484-7487, 2015. https://doi.org/10.1002/anie.201411654
  3. J. Almond-Thynne, A.J.P. White, A. Polyzos, H.S. Rzepa, P.J. Parsons, and A.G.M. Barrett, "Synthesis and Reactions of Benzannulated Spiroaminals: Tetrahydrospirobiquinolines", ACS Omega, vol. 2, pp. 3241-3249, 2017. https://doi.org/10.1021/acsomega.7b00482
  4. S.L. Ackermann, D.J. Wolstenholme, C. Frazee, G. Deslongchamps, S.H.M. Riley, A. Decken, and G.S. McGrady, "Corrigendum: Crystallographic Snapshot of an Arrested Intermediate in the Biomimetic Activation of CO<sub>2</sub>", Angewandte Chemie International Edition, vol. 54, pp. 7470-7470, 2015. https://doi.org/10.1002/anie.201504197