Archive for the ‘Chemical IT’ Category

Metadata. Why?

Tuesday, July 2nd, 2019

I have had some interesting discussions recently regarding metadata. What emerges is that it can be quite a broadly defined concept and it is clear that a variety of answers might be obtained when asking the simple question “what is it useful for?” Here I set out some of my answers to that question.

  1. Metadata vs Data. Questions such as where is the continuum between data/metadata and whether the metadata is fine-grained or more broadly-grained.
  2. What is its ultimate destination? Should metadata reside inside a complete package or container of data, serving the purpose of succinctly describing what to expect in that package? Or should it reside entirely separately from the data package in some sort of metadata store (MDS)?
  3. Are there issues of trust or provenance? Thus, how was the metadata created, by a person or a process and when? Has it been changed since it was created? If so, what are the revisions? Does the metadata adhere to a specified structure and has it been been validated against that structure.

Some context needs to be applied before answering such questions (context is perhaps a synonym for metadata!)

  1. Firstly, I am going to use metadata here in the context of describing data itself (i.e. rather than other research objects such as journal articles). This would include answers to questions such as:
    1. who created both the data and its metadata.
    2. when were both created and perhaps modified.
    3. where the data is stored
    4. what are its defined internal structures (sometimes also called  MEDIA types).
    5. who its “publisher” is (the organisation where the data was produced or is curated).
    6. what are the access and re-use rights associated with the data.

    These are broad-grained provenance if you like.

  2. Next, metadata describing the specific the context of the data, e.g. in my case the chemistry associated with it.
    1. Is it about a molecule?
    2. if so what is the nature of the molecule?
    3. Is it computational data about a molecule.
    4. If so, what software was used for the computations and its parameters, inputs and outputs.
    5. Might it be instrumental data recorded for a molecule?
    6. If the latter, does it record the instrument and its settings?

    We are now moving into fine-grained metadata, and perhaps even crossing the boundary into data itself, since the parameters for either software or instruments can be large and complex and are often so heavily mixed into the data itself that their extrication may be a challenge.

  3. Finally, what is the purpose of creating and storing such metadata.
    1. Here the context is of “discoverability” (of the data itself) and perhaps also
    2. Reusability” and/or “Interoperability (of the data itself).
    3. These attributes are nicely summarised by the acronym FAIR, where discoverability is specified by both Findability and Accessibility.

Before introducing examples based on metadata with the focus on discoverability, I want to distinguish between locally packaged metadata and separated metadata (Qu. 2 above). The examples below relate purely to the latter, which has been created as a separate entity by registration with an agency such as DataCite. Such registration also addresses Qu. 3 above about trust. This external agency adds trust by recording the identity of the person (or a process or workflow initiated by a person) registering the metadata together with the registration date (the Datestamp) and also monitors any changes to the metadata (which is allowed) by keeping its version history. Interestingly, there seems to be no mechanism to record any processes or workflows used to create  metadata so as to learn how the metadata itself was assembled. Nor have I seen much discussion of this aspect; one for the future I fancy.

I now introduce some examples of discoverability. The descriptions are quite short and are meant to be used in conjunction with a “reverse-engineering” of the (somewhat) human readable search query. These queries are also deposited as  “data”,  at DOI: 10.14469/hpc/5920

Entry Description Elasticsearch query
1 Media (MIME) type https://search.datacite.org/works?query=media.media_type:chemical/x-mnpub*
2 Combining Media with the DataCite Subject https://search.datacite.org/works?query=media.media_type:chemical/x-mnpub*+AND+subjects.subjectScheme:inchikey+AND+subjects.subject:XZYDALXOGPZGNV-UHFFFAOYSA-M+AND+media.media_type:chemical/x-gaussian*
3 Combining ORCID with Media https://search.datacite.org/works?query=contributors.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390+AND+media.media_type:chemical/x-mnpub*
4 Exploiting Subject https://search.datacite.org/works?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:”-39.946176″
5 Exploiting Subject with range query https://search.datacite.org/works?query=subjects.subjectScheme:Gibbs_energy+AND+subjects.subject:[\-649.1 TO \-649.8]
6 Nested search with two Subjects https://search.datacite.org/works?query=(subjects.subjectScheme:inchikey+AND+subjects.subject:”-1082.980914″)+AND+(subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:KTOSDSJYNBIDCN-UHFFFAOYSA-N)
Nested search with two Subjects transposed https://search.datacite.org/works?query=(subjects.subjectScheme:inchikey+AND+subjects.subject:KTOSDSJYNBIDCN-UHFFFAOYSA-N)+AND+(subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:”-1082.980914″)
7 Two different Media types https://search.datacite.org/works?query=media.media_type:chemical/x-gaussian*+AND+media.media_type:chemical/x-mnpub*
8 License type https://search.datacite.org/works?query=rightsList.rights:”Creative Commons Public Domain Dedication (CC0 1.0)”
9 Exploiting subjectscheme https://search.datacite.org/works?query=media.media_type:chemical/x-mnpub*+AND+subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:1H
10 Exploiting subjectscheme https://search.datacite.org/works?query=media.media_type:chemical/x-mnpub*+AND+subjects.subjectScheme:NMR_Pulse+AND+subjects.subject:1D
11 Simple PID query https://search.datacite.org/works?query=identifier:*10.14469/hpc*
12 Combining ORCID with PID query https://search.datacite.org/works?query=(contributors.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390)+AND+(identifier:*10.14469/hpc*)
13 Combing researcher name with PID query https://search.datacite.org/works?query=(identifier:*10.14469/hpc*)+AND+(contributors.contributor.contributorName:Henry+Rzepa)
14 Entries in specific repository (Imperial) referencing specific Journal https://search.datacite.org/works?query=(relatedIdentifiers.relatedIdentifier:10.1021/acs.orglett*)+AND+(identifier:*10.14469/hpc*)
15 Entries in specific repository (Cambridge) referencing specific Journal https://search.datacite.org/works?query=(relatedIdentifiers.relatedIdentifier:10.1021/acs.orglett*)+AND+(identifier:*10.17863/cam*)
18 Entries in specific repository (Cambridge) referencing all publisher journals https://search.datacite.org/works?query=(relatedIdentifiers.relatedIdentifier:10.1021/acs*)+AND+(identifier:*10.17863/cam*)
16 Entries in all repositories except one referencing specific Journal https://search.datacite.org/works?query=(relatedIdentifiers.relatedIdentifier:10.1021/acs.orglett*)+NOT+(identifier:*10.5517*)
17 Entries in specific repository referencing one publisher https://search.datacite.org/works?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+AND+(identifier:*10.5517*)
19 Entries in all publisher journals, excluding one data repository https://search.datacite.org/works?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*)
20 Entries in Institutional repository referencing datasets https://search.datacite.org/works?query=(relatedIdentifiers.relatedIdentifier:*10.14469/spiral*)+AND+(identifier:*)+AND+(types.resourceTypeGeneral:Dataset)

The examples above reveal a somewhat a not entirely human-friendly syntax; with each of them some effort at “de-bugging” was needed to make them work. I gather from the  PIDForum that a more friendly GUI to achieve this is on their radar. As I develop or discover more examples of such searches I will add them to the list above at DOI: 10.14469/hpc/5920. Meanwhile, if  you want to use any of the above as a template for your own searches do please explore.

A search of some major chemistry publishers for FAIR data records.

Friday, April 12th, 2019

In recent years, findable data has become ever more important (the F in FAIR). Here I test that F using the DataCite search service.

Firstly an introduction to this service. This is a metadata database about datasets and other research objects. One of the properties is relatedIdentifier which records other identifiers associated with the dataset, being say the DOI of any published article associated with the data, but it could also be pointers to related datasets.

One can query thus:

  1. https://search.datacite.org/works?query=relatedIdentifiers.relatedIdentifier:*
    which retrieves the very healthy looking 6,179,287 works.
  2. One can restrict this to a specific publisher by the DOI prefix assigned to that publisher:
    ?query=relatedIdentifiers.relatedIdentifier:10.1021*
    which returns a respectable 210,240 works.
  3. It turns out that the major contributor to FAIR currently are crystal structures from the CCDC. One can remove them from the search to see what is left over:
    ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*) 
    and one is down to 14,213 works, of which many nevertheless still appear to be crystal structures. These may be links to other crystal datasets.

I have performed searches 2 and 3 for some popular publishers of chemistry (the same set that were analysed here).

Publisher Search 2 Search 3
ACS 210,240 14,213
RSC 138,147 1,279
Elsevier 185,351 56,373
Nature 12,316 8,104
Wiley 135,874 9,283
Science 3,384 2,343

These publishers all have significant numbers of datasets which at least accord with the F of FAIR. A lot of data sets may not have metadata which in fact points back to a published article, since this can be something that has to be done only when the DOI of that article appears, in other words AFTER the publication of the dataset. So these numbers are probably low rather than high.

How about the other way around? Rather than datasets that have a journal article as a related identifier, we could search for articles that have a dataset as a related identifier?

  1. ?query=(identifier:*10.1039*)+AND+(relatedIdentifiers.relatedIdentifier:*)
    returns rather mysterious nothing found. It might also be that there is no mapping of this search between the CrossRef and DataCite metadata schemas.
  2. And just to show the searches are behaving as expected:
    ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+AND+(identifier:*10.5517*)
    returns 196,027 works.

It will also be of interest to show how these numbers change over time. Is there an exponential increase? We shall see.

Finally, we have not really explored adherence to eg the AIR of FAIR.  That is for another post.

Impossible molecules.

Monday, April 1st, 2019

Members of the chemical FAIR data community have just met in Orlando (with help from the NSF, the American National Science Foundation) to discuss how such data is progressing in chemistry. There are a lot of themes converging at the moment. Thus this article[1] extolls the virtues of having raw NMR data available in natural product research, to which we added that such raw data should also be made FAIR (Findable, Accessible, Interoperable and Reusable) by virtue of adding rich metadata and then properly registering it so that it can be searched. These themes are combined in another article which made a recent appearance.[2]

One of the speakers made a very persuasive case based in part on e.g. the following three molecules which are discussed in the first article[1] (the compound numbers are taken from there). The question was posed at our meeting: why did the referees not query these structures? And the answer in part is to provide referees with access to the full/primary/raw NMR data (which almost invariably they currently do not have) to help them check on the peaks, the purity and indeed the assignments. I am sure tools that do this automatically from such supplied data by machines on a routine basis do exist in industry (and which is something FAIR is designed to enable). Perhaps there are open source versions available?

17 18 19

 
328[3] 348 713

Here I suggest a particularly simple and rapid “reality check” which I occasionally use myself. This is to compute the steric energy of the molecule using molecular mechanics. The mechanics method is basically a summation of simple terms such as the bond length, bond angle, torsion angle, a term which models non bonded repulsions, dispersion attractions and electrostatic contributions. The first three are close to zero for an unstrained molecule (by definition). The last three terms can be negative or positive, but unless the molecule is protein sized, they also do not depart far from zero. A suitable free tool that packages all this up is Avogadro.

The procedure is as follows

  1. Start from the Chemdraw representation of the molecule. If the publishing authors have been FAIR, you might be able to acquire that from their deposited data. Otherwise, redraw it yourself and save as e.g. a molfile or Chemdraw .cdxml file.
  2. Drop into Avogadro, which will build a 3D model for you using stereochemical information present in the Chemdraw or Molfile.
  3. In the  E tool (at the top on the left of the Avogadro menu) select e.g. the MMFF94 force field. This is a good one to use for “organic” molecules for which the total steric energy for “normal” molecules is likely to be < 200 kJ. Calculate that for your system; this normally takes less than one minute to complete. The values obtained for the three above are shown in the table. All three are well over 200 kJ/mol, which should set alarm bells ringing.
  4. A “more reasonable” structure for 17 is shown below. This has a steric energy of 152 kJ/mol, some 176 kJ/mol lower than the original structure. This does not of itself “prove” this alternative, but it is a starting point for showing it might be correct.Of course mis-assigned but otherwise reasonable structures are unlikely to be revealed by the steric energy test. But impossible ones will probably always be flagged as such using this procedure. 

Postscript: Hot on the heels of writing this, the molecule Populusone came to my attention.[4] On first sight, it seems to have some of the attributes of an “impossible molecule” (click on diagram below for 3D coordinates).

However, it has been fully characterised by x-ray analysis! The steric energy using the method above comes out at 384 kJ/mol, which in the region of impossibility! This can be decomposed into the following components: bond stretch 30, bend 51, torsion 32, van der Waals (including repulsions) 177, electrostatics 87 (+ some minor cross terms). These are fairly evenly distributed, with internal steric repulsions clearly the largest contributor. The C=C double bond is hardly distorted however, which is in its favour. Clearly a natural product can indeed load up the unfavourable interactions, and this one must be close to the record of the most intrinsically unstable natural product known!

References

  1. J.B. McAlpine, S. Chen, A. Kutateladze, J.B. MacMillan, G. Appendino, A. Barison, M.A. Beniddir, M.W. Biavatti, S. Bluml, A. Boufridi, M.S. Butler, R.J. Capon, Y.H. Choi, D. Coppage, P. Crews, M.T. Crimmins, M. Csete, P. Dewapriya, J.M. Egan, M.J. Garson, G. Genta-Jouve, W.H. Gerwick, H. Gross, M.K. Harper, P. Hermanto, J.M. Hook, L. Hunter, D. Jeannerat, N. Ji, T.A. Johnson, D.G.I. Kingston, H. Koshino, H. Lee, G. Lewin, J. Li, R.G. Linington, M. Liu, K.L. McPhail, T.F. Molinski, B.S. Moore, J. Nam, R.P. Neupane, M. Niemitz, J. Nuzillard, N.H. Oberlies, F.M.M. Ocampos, G. Pan, R.J. Quinn, D.S. Reddy, J. Renault, J. Rivera-Chávez, W. Robien, C.M. Saunders, T.J. Schmidt, C. Seger, B. Shen, C. Steinbeck, H. Stuppner, S. Sturm, O. Taglialatela-Scafati, D.J. Tantillo, R. Verpoorte, B. Wang, C.M. Williams, P.G. Williams, J. Wist, J. Yue, C. Zhang, Z. Xu, C. Simmler, D.C. Lankin, J. Bisson, and G.F. Pauli, "The value of universally available raw NMR data for transparency, reproducibility, and integrity in natural product research", Natural Product Reports, vol. 36, pp. 35-107, 2019. https://doi.org/10.1039/c7np00064b
  2. A. Barba, S. Dominguez, C. Cobas, D.P. Martinsen, C. Romain, H.S. Rzepa, and F. Seoane, "Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data", ACS Omega, vol. 4, pp. 3280-3286, 2019. https://doi.org/10.1021/acsomega.8b03005
  3. A.I. Savchenko, and C.M. Williams, "The Anti‐Bredt Red Flag! Reassignment of Neoveratrenone", European Journal of Organic Chemistry, vol. 2013, pp. 7263-7265, 2013. https://doi.org/10.1002/ejoc.201301308
  4. K. Liu, Y. Zhu, Y. Yan, Y. Zeng, Y. Jiao, F. Qin, J. Liu, Y. Zhang, and Y. Cheng, "Discovery of Populusone, a Skeletal Stimulator of Umbilical Cord Mesenchymal Stem Cells from <i>Populus euphratica</i> Exudates", Organic Letters, vol. 21, pp. 1837-1840, 2019. https://doi.org/10.1021/acs.orglett.9b00423

Free energy relationships and their linearity: a test example.

Sunday, January 13th, 2019

Linear free energy relationships (LFER) are associated with the dawn of physical organic chemistry in the late 1930s and its objectives in understanding chemical reactivity as measured by reaction rates and equilibria.

The Hammett equation is the best known of the LFERs, albeit derived “intuitively”. It is normally applied to the kinetics of aromatic electrophilic substitution reactions and is expressed as;

log KR/K0 = σRρ (for equilibria) and extended to log kR/k0 = σRρ for rates.

The equilibrium constants are normally derived from the ionisation of substituted benzoic acids, with Kbeing that for benzoic acid itself and Kthat of a substituted benzoic acid, with σR being known as the substituent constant and ρ the reaction constant. The concept involved obtaining the substituent constants by measuring the ionisation equilibria. The value of σis then assumed to be transferable to the rates of reaction, where the values can be used to obtain reaction constants for a given reaction. The latter would then be assumed to give insight into the electronic nature of the transition state for that reaction.

The term log kR/k(the ratio of rates of reaction) can be related to ΔΔG = -RT ln kR/kand this latter quantity can be readily obtained from quantum calculations, where ΔΔG is the difference in computed reaction activation free energies for two substituents (of which one might be R=H). The most interesting such Hammett plots are the ones where a discontinuity becomes apparent. The plot comprises two separate linear relationships, but with different slopes. This is normally taken to indicate a change of mechanism, on the assumption that the two mechanisms will have different responses to substituents. 

A test of this is available via the calculated activations energies for acid catalyzed cyclocondensation to give furanochromanes[1] which is a two-step reaction involving two transition states TS1 and TS2, either of which could be rate determining. A change from one to the other would constitute a change in mechanism. In this example, TS1 involves creation of a carbocationic centre which can be stabilized by the substituent on the Ar group; TS2 involves the quenching of the carbocation by a nucleophilic oxygen and hence might be expected to respond differently to the substituents on Ar. As it happens, the reaction coordinate for TS2 is not entirely trivial, since it also includes an accompanying proton transfer which might perturb the mechanism.

Fortunately for this reaction we have available full FAIR data (DOI: 10.14469/hpc/3943), which includes not only the computed free energies for both sets of transition states but also the entropy-free enthalpies for comparison. This allows the table below to be generated. For each substituent, the highest energy point is in bold, indicating the rate limiting step. The span of substituents corresponds to a range of rate constants of almost 1010, which in fact is rarely if ever achievable experimentally.

Highest free energy overall route for HCl catalysed mechanism,

trans stereochemistry

Sub ΔH/ΔG Reactant ΔH/ΔG, TS1 ΔH/ΔG, TS2 RDS
p-NH2 0.2/6.36 0.0/0.0 0.15/4.0 0.2/6.4 TS2/TS2
p-OMe 2.7/8.48 0.0/0.0 2.7/8.45 2.1/8.48 TS1/TS2
p-Me 5.5/10.00 0.0/0.0 5.5/9.9 3.9/10.00 TS1/TS2
p-Cl 7.7/12.28 0.0/0.0 7.7/12.28 5.9/11.84 TS1/TS1
p-H 7.6/13.01 0.0/0.0 7.6/13.01 5.5/11.51 TS1/TS1
p-CN 10.6/18.02 0.0/0.0 10.6 /17.61 10.5/18.02 TS1/TS2
p-NO2 12.4/19.85 0.0/0.0 12.4/18.24 12.0/19.85 TS1/TS2

For the free energies, you can see that TS2 is the rate limiting step for the first two electron donating substituents, and the last two electron withdrawing ones, whilst TS1 represents the rate limiting step for the middle substituents. This represents two changes of rate limiting step over the entire range of substituents. A different picture emerges if only the enthalpies are used. Now TS1 is rate limiting for essentially all the substituents. The difference of course arises because of significant changes to the entropy of the transition states. The Hammett equation, and its use of  σconstants to try to infer the electronic response of a reaction mechanism, does not really factor in entropic responses. Nor is it often if at all applied using a really wide range of substituents. So any linearity or indeed non-linearity in Hammett plots may correspond only very loosely to the underlying mechanisms involved.

Starting in the 1940s and lasting perhaps 40-50 years, thousands of different reaction mechanisms were subjected to the Hammett treatment during the golden era of physical organic chemistry, but very few have been followed up by exploring the computed free energies, as set out above. One wonders how many of the original interpretations will fully withstand such new scrutiny and in general how influential the role of entropy is.

References

  1. C.D. Nielsen, W.J. Mooij, D. Sale, H.S. Rzepa, J. Burés, and A.C. Spivey, "Reversibility and reactivity in an acid catalyzed cyclocondensation to give furanochromanes – a reaction at the ‘oxonium-Prins’ <i>vs.</i> ‘<i>ortho</i>-quinone methide cycloaddition’ mechanistic nexus", Chemical Science, vol. 10, pp. 406-412, 2019. https://doi.org/10.1039/c8sc04302g

Re-inventing the anatomy of a research article.

Saturday, December 29th, 2018

The traditional structure of the research article has been honed and perfected for over 350 years by its custodians, the publishers of scientific journals. Nowadays, for some journals at least, it might be viewed as much as a profit centre as the perfected mechanism for scientific communication. Here I take a look at the components of such articles to try to envisage its future, with the focus on molecules and chemistry.

The formula which is mostly adopted by authors when they sit down to describe their chemical discoveries is more or less as follows:

  1. An introduction, setting the scene for the unfolding narrative
  2. Results. This is where much of the data from which the narrative is derived is introduced. Such data can be presented in the form of:
    • Tables
    • Figures and schemes
    • Numerical and logical data embedded in narrative text
  3. Discussion, where the models constructed from the data are illustrated and new inferences presented. Very often categories 2 and 3 are conflated into one single narrative.
  4. Conclusions, where everything is brought together to describe the essential aspects of the new science.
  5. Bibliography, where previous articles pertinent to the narrative are listed.

In the last decade or so, the management of research data has developed as a field of its own, with three phases:

  1. Setting out a data management plan at the start of the project, often a set of aspirations together with putative actions,
  2. the day-to-day management of the data as it emerges in the form of an electronic laboratory notebook (ELN),
  3. the publication of selected data from the ELN into a repository, together with the registration of metadata describing the properties of the data.

In the latter category, item 8 can be said to be a game-changer, a true disruptive influence on the entire process. The key aspect is that it constitutes independent publication of data to sit alongside the object constructed from 1-5. More disruption emerges from the open citations project, whereby category 5 above can be released by publishers to adopt its own separate existence. So now we see that of the five essential anatomic components of a research article, two are already starting to achieve their own independence. Clearly the re-invention of the anatomy of the research article is well under way already.

Next I take a look at what sorts of object might be found in category 8, drawing very much on our own experience of implementing 7 and 8 over the last twelve years or so. I start by observing that in 2 above, figures are perhaps the object most in need of disruptive re-invention. In the 1980s, authors were much taken by the introduction of colour as a means of conveying information within a figure more clearly; although the significant costs then had to be borne directly by these authors (and with a few journals this persists to this day). By the early 1990s, the introduction of the Web[1] offered new opportunities not only of colour but of an extra dimension (or at least the illusion of one) by means of introducing interactivity for three-dimensional models. Some examples resulting from combining figures from category 2 with 8 above are listed in the table below.

Examples of re-invented data objects from category 2
Example Object title Object DOI Article DOI
1 Figure 9. Catalytic cycle involving one amine …etc. 10.14469/hpc/1854 10.1039/C7SC03595K
2 FAIR Data Figure. Mechanistic insights into boron-catalysed direct amidation reactions 10.14469/hpc/4919 10.1039/C7SC03595K
3 FAIR Data table. Computed relative reaction free energies (kcal/mol-1) of Obtusallene derived oxonium and chloronium cations 10.14469/hpc/1248 10.1021/acs.joc.6b02008
4 (raw) NMR data for Epimeric Face-Selective Oxidations … 10.14469/hpc/1267 10.1021/acs.joc.6b02008
5 Bibliography 10.14469/hpc/1116 10.1021/acs.joc.6b02008

Example 1 illustrates how a figure from category 2 above can be augmented with active hyperlinks specifying the DOI of the data in category 8 from which the figure is derived, thus creating a direct and contextual connection between the research article and the research data it is based upon. These links are embedded only in the Acrobat (PDF) version of the article as part of the production process undertaken by the journal publisher. Download Figure 9 from the link here and try it for yourself or try the entire article from the journal, where more figures are so enhanced.

Example 2 takes this one stage further. The hyperlinks in the published figure in example 1 were embedded in software capable of resolving them, namely a PDF viewer. But that is all that this software allows. By relocating the hyperlink into a Web browser instead, one can add further functionality in the form of Javascripts perhaps better described as workflows (supported by browsers but not supported by Acrobat). There are three such workflows in example 2.

  • The first uses an image map to associate a region of the figure data object defined by a DOI.
  • The second interrogates the metadata specifically associated with the DOI (the same DOIs that are seen in the figure itself) to see if there is any so-called ORE metadata available (ORE= Object Re-use and Exchange). If there is, it uses this information to retrieve the data itself and pass it through to
  • the third workflow represented by a set of JavaScripts known as JSmol. These interpret the data received and construct an interactive visual 3D molecular model representing the retrieved data.

All this additional workflowed activity is implemented in a data repository. It is not impossible that it could also be implemented at the journal publisher end of things, but it is an action that would have to be supported by multiple publishers. Arguably this sort of enhancement is far better suited and more easily implemented by a specialised data publisher, i.e. a data repository.

Example 3 does the same thing for a table.

Example 4 enhances in a different manner. Conventionally NMR data is added to the supporting information file associated with a journal article, but such data is already heavily processed and interpreted. The raw instrumental data is never submitted to the journal and is pretty much always possibly only available by direct request from the original researchers (at least if the request is made whilst the original researchers are still contactable!). The data repository provides a new mechanism for making such raw instrumental (and indeed computational) data an integral part of the scientific process.

Example 5 shows how a bibliography can be linked to a secondary bibliography (citations 35 and 36 in this example in the narrative article) and perhaps in the future to Open Citations semantic searches for further cross references.

So by deconstructing the components of the standard scientific article, re-assembling some of them in a better-suited environment and then linking the two sets of components to each other, one can start to re-invent the genre and hopefully add more tools for researchers to use to benefit their basic research processes. The scope for innovation seems considerable. The issue of course is (a) whether publishers see this as a viable business model or whether they instead wish to protect their current model of the research article and whether (b) authors wish to undertake the learning curve and additional effort to go in this direction. As I have noted before, the current model is deficient in various ways; I do not think it can continue without significant reinvention for much longer. And I have to ask that if reinvention does emerge, will science be the prime beneficiary?

References

  1. H.S. Rzepa, B.J. Whitaker, and M.J. Winter, "Chemical applications of the World-Wide-Web system", Journal of the Chemical Society, Chemical Communications, pp. 1907, 1994. https://doi.org/10.1039/c39940001907

Open Access journal publishing debates – the elephant in the room?

Sunday, November 4th, 2018

For perhaps ten years now, the future of scientific publishing has been hotly debated. The traditional models are often thought to be badly broken, although convergence to a consensus of what a better model should be is not apparently close. But to my mind, much of this debate seems to miss one important point, how to publish data.

Thus, at one extreme is COAlition S, a model which promotes the key principle that “after 1 January 2020 scientific publications on the results from research funded by public grants provided by national and European research councils and funding bodies, must be published in compliant Open Access Journals or on compliant Open Access Platforms.” This includes ten principles, one of which “The ‘hybrid’ model of publishing is not compliant with the above principles” has revealed some strong dissent, as seen at forbetterscience.com/2018/09/11/response-to-plan-s-from-academic-researchers-unethical-too-risky I should explain that hybrid journals are those where the business model includes both institutional closed-access to the journal via a subscription charge paid by the library, coupled with the option for individual authors to purchase an Open Access release of an article so that it sits outside the subscription. The dissenters argue that non-OA and hybrid journals include many traditional ones, which especially in chemistry are regarded as those with the best impact factors and very much as the journals to publish in to maximise both the readership, hence the impact of the research and thus researcher’s career prospects. Thus many (not all) of the American Chemical Society (ACS) and Royal Society of Chemistry (RSC) journals currently fall into this category, as well as commercial publishers of journals such as Nature, Nature Chemistry,Science, Angew. Chemie, etc. 

So the debate is whether funded top ranking research in chemistry should in future always appear in non-hybrid OA journals (where the cost of publication is borne by article processing charges, or APCs) or in traditional subscription journals where the costs are borne by those institutions that can afford the subscription charges, but of course also limit the access.  A measure of how important and topical the debate is that there is even now a movie devoted to the topic which makes the point of how profitable commercial scientific publishing now is and hence how much resource is being diverted into these profit margins at the expense of funding basic science.

None of these debates however really takes a close look at the nature of the modern research paper. In chemistry at least, the evolution of such articles in the last 20 years (~ corresponding to the online era) has meant that whilst the size of the average article has remained static at around 10 “pages” (in quotes because of course the “page” is one of those legacy concepts related to print), another much newer component known as “Supporting information” or SI has ballooned to absurd sizes. It can reach 1000 pages[1] and there are rumours of even larger SIs. The content of SI is of course mostly data. The size is often because the data is present in visual form (think spectra). As visual information, it is not easily “inter-operable” or “accessible”. Nor is it “findable” until commercial abstracting agencies chose to index it. Searches of such indexed data are most certainly “closed” (again depending on institutional purchases of access) and not “open access”. You may recognise these attributes as those of FAIR (Findable, accessible, inter-operable and re-usable). So even if an article in chemistry is published in pure OA form, in order to get FAIR access to the data associated with the article, you will probably have to go to a non-OA resource run by a commercial organisation for profit. Thus a 10 page article might itself be OA, but the full potential of its 1000+ page data (an elephant if ever there was one) ends up being very much not OA.

You might argue that the 1000+ pages of data does not require the services of an abstracting agency to be useful. Surely a human can get all the information they want from inspecting a visual spectrum? Here I raise the future prospects of AI (artificial intelligence). The ~1000 page SI I noted above[1] includes e.g NMR spectra for around 70 compounds (I tried to count them all visually, but could not be certain I found them all). A machine, trained to identify spectra from associated metadata (a feature of FAIR), could extract vastly more information than a human could from FAIR raw data (a spectrum is already processed data, with implied information/data loss) in a given time. And for many articles, not just one. Thus FAIR data is very much targeted not only at humans but at the AI-trained machines of the future.

So I again repeat my assertion that focussing on whether an article is OA or not and whether publishing in hybrid journals is to be allowed or not by funders is missing that 100-fold bigger elephant in the room. For me, a publishing model that is fit for the future should include as a top priority a declaration of whether the data associated with it is FAIR. Thus in the Plan-S ten principles, FAIR is not mentioned at all. Only when FAIR-enabled data becomes part of the debates can we truly say that the article and its data are on its way to being properly open access.


The FAIR concept did not originally differentiate between processed data (i.e. spectra) and the underlying primary or raw data on which the processed data is based. Our own implementation of FAIR data includes both types of data; raw for machine reprocessing if required, and processed data for human interpretation. Along with a rich set of metadata, itself often created using carefully designed workflows conducted by machines.

The proportion of articles relating to chemistry which do not include some form of SI is probably low. These would include articles which simply provide a new model or interpretation of previously published data, reporting no new data of their own. A famous historical example is Michael Dewar’s re-interpretation of the structure of stipitatic acid[2] which founded the new area of non-benzenoid aromaticity.

References

  1. J.M. Lopchuk, K. Fjelbye, Y. Kawamata, L.R. Malins, C. Pan, R. Gianatassio, J. Wang, L. Prieto, J. Bradow, T.A. Brandt, M.R. Collins, J. Elleraas, J. Ewanicki, W. Farrell, O.O. Fadeyi, G.M. Gallego, J.J. Mousseau, R. Oliver, N.W. Sach, J.K. Smith, J.E. Spangler, H. Zhu, J. Zhu, and P.S. Baran, "Strain-Release Heteroatom Functionalization: Development, Scope, and Stereospecificity", Journal of the American Chemical Society, vol. 139, pp. 3209-3226, 2017. https://doi.org/10.1021/jacs.6b13229
  2. M.J.S. DEWAR, "Structure of Stipitatic Acid", Nature, vol. 155, pp. 50-51, 1945. https://doi.org/10.1038/155050b0

How FAIR are the data associated with the 2017 Molecules-of-the-Year?

Wednesday, March 7th, 2018

C&EN has again run a vote for the 2017 Molecules of the year. Here I take a look not just at these molecules, but at how FAIR (Findable, Accessible, Interoperable and Reusable) the data associated with these molecules actually is.

I went about finding out as follows:

  1. The article DOI for all seven candidates was linked to the C&EN site.
  2. From there I manually tracked down the Supporting information
  3. Some of this SI gave a CCDC deposition number for crystal structure data for the molecule in question. The easiest way of going directly to the data was to use the search.datacite.org search engine and to enter the keywords CCDC + deposition number. This gives a DOI for the data, examples of which are included in the table below.
  4. In other examples, I used the CSD Conquest search program and entered the names of 2-3 of the authors of the articles. This also worked well.
  5. Most of the SI files, downloaded as PDF files also had static images of NMR spectra included. This is not active data, and hence does not fulfil the F and I of FAIR, and probably the A as well. None of it is FAIR as defined by my post here although it is actually really easy to make it so. One of the examples had ~116 spectra so unFAIRed.
  6. In another example there was also computational data, included simply as a set of XYZ coordinates and again contained in the PDF file. This too is not really FAIR, since one has to know how to extract it from this container and repurpose it. It also represents a tiny subset of the data potentially available.
How FAIR are the data associated with the 2017 Molecules-of-the-Year?
# Title Article DOI Data DOI
1 Persulfurated Coronene: A New Generation of “Sunflower” 10.1021/jacs.6b12630 Data available only as PDF
Hosted by Figshare
The SI also has its own DOI:
10.1021/jacs.6b12630.s001
2 A Truncated Molecular Star 10.1021/jacs.6b12630 Crystal structure data:
10.5517/ccdc.csd.cc1nb303
3 Synthesis of trinorbornane 10.1039/c7cc06273g Crystal structure data:
10.5517/ccdc.csd.cc1p7806
4 Braiding a molecular knot with eight crossings 10.1126/science.aal1619 Crystal structure data:
10.5517/ccdc.csd.cc1m85y0
5 Unique physicochemical and catalytic properties dictated by the B3NO2 ring system 10.1038/nchem.2708 Crystal structure data:
10.5517/ccdc.csd.cc1lkff0
6 Total synthesis of mycobacterial arabinogalactan containing 92 monosaccharide units 10.1038/ncomms148510 116 NMR spectra available only as PDF. No crystal structure
7 Nitrogen Lewis Acids 10.1021/jacs.6b12360 NMR spectra available only as PDF.
Computed coordinates available only as PDF
Crystal structures data:
CCDC 1457983-1457987,1458000-1458001
e.g. 10.5517/ccdc.csd.cc1ky4qc
10.5517/ccdc.csd.cc1ky4rd

The FAIRness of the data for these molecules of the year is largely rescued by the crystal structure data deposited with the CCDC in their CSD database and rendered F of FAIR by the persistent identifiers such as the (parochial) deposition numbers or the more general DOI. Now if the NMR and computational data were also covered in this way, we would be making great progress. There are of course many other types of data included with these examples, and procedures for making such data also FAIR have to be worked out by the community.

In order to construct the table above, I had to put about two hours of effort into tracking down the items (and this only because I have done this sort of search before). Perhaps next year I might persuade C&EN to include such a table in their own article!