A databank of molecular dynamics reaction trajectories (DDT) focused on undergraduate teaching.

April 22nd, 2020

In a previous post, I talked about a library of reaction pathway intrinsic reaction coordinates (IRCs) containing 115 examples of organic and organometallic reactions. Now (thanks Dean!) I have been alerted to a brand new databank of dynamics trajectories (DDT), with the focus on those reactions taught in undergraduate organic chemistry courses, some of which are shown below.

Each example takes the form of two movie animations, one showing the classical IRC path and the other the “major trajectory” (DT) resulting from a molecular dynamics calculation. The latter representation incorporates molecular vibrations into the picture, showing how they evolve from reactants into a reaction product over a time period. Dynamics are a more realistic picture of how a molecule actually reacts. Any given trajectory can follow its own path, but the most common path for it to take is indeed that defined by the IRC and here you can compare the two approaches. What is interesting of course are those examples where the IRC and DT differ, with perhaps the latter not necessarily following the minimum energy path charted by the former. Even more fascinating are those non-classical reactions where a given IRC path, as defined by a single unique transition state at the top of the energy barrier, can nonetheless result in two or more different reaction outcomes. [1] See my analysis.

This databank is still young, with eight reactions at the time this blog was written. Suggestions for new reactions are invited, and I do hope it grows rapidly. Also I look forward to a section describing the technicalities of how the DT are computed and what sorts of resources are required to do this routinely (an article describing the databank is being prepared). In particular, how do the computer time resources needed for IRC and DT compare? It is good indeed to see this dynamic picture entering into the undergraduate taught curriculum.


For an example of a differing outcome, see here.

References

  1. X.S. Bogle, and D.A. Singleton, "Dynamic Origin of the Stereoselectivity of a Nucleophilic Substitution Reaction", Organic Letters, vol. 14, pp. 2528-2531, 2012. https://doi.org/10.1021/ol300817a

A molecular sponge for hydrogen storage- the future for road transport?

April 19th, 2020

In the news this week is a report of a molecule whose crystal lattice is capable of both storing and releasing large amounts of hydrogen gas at modest pressures and temperatures. Thus “NU-1501-Al” can absorb 14 weight% of hydrogen. To power a low-polluting car with a 500 km range, about 4-5 kg of hydrogen gas would be need to be stored and released safely. The molecule is of interest since it opens a systematic strategy of synthetically driven optimisation towards a viable ultra-porous storage material,[1] much like a lead drug compound can be optimised.

I thought it would be informative to show a 3D interactive model of the crystal lattice here and so I went in search of coordinates. These are indeed available online. This is an example of scientific data Interoperability and Reuse, part of the FAIR data acronym. Before showing the model, I thought it worth briefly describing the procedure for starting with deposited data and converting (interoperating) it to the model here.

  1. The molecule is a so-called MOF, or Metal-Organic-Framework. The core organic framework in this case is composed of linked tryptycene derivatives. Shown below is the 3D structure of this linker, oriented here to show the three-fold symmetry (actually D3) of the molecule, rather than any attempt to reveal all the atoms without any hidden ones. To see the latter, you are encouraged to click on the diagram and view the molecule as a rotatable model instead. The coordinates below are optimised using molecular mechanics to reveal the role of the linker units.

Click for a rotatable 3D model.

  1. The data comes in the form of a CIF (crystallographic information) file and needs to be loaded into software that can manipulate such a format. In this case a program called Mercury (from CCDC) is available. Doing so reveals two minor oddities, circled in red below. The phenomenon arises from disorder, or two or more structures each with what is called partial occupancy. In this case, the disorder is largely limited to a p-substituted phenyl spacer linkage, which can adopt one of two rotational positions in the structure. The projection below is now selected to reveal the disorder rather than the symmetry.
  2. I want to “inter-operate” these coordinates into something that can be modelled and for this, the structure has to be edited to reduce it to a single unambiguous model. My very simple expedient here was simply to remove extraneous disordered atoms entirely; since they are acting as a spacing unit, this is unlikely to change the overall picture. Again, the projection below is selected to show the symmetry present and in particular the hexagonal-like channels that appear in the crystal lattice. To achieve this lattice, the unit cell has to be grown in all three directions using the calculate packing option in the Mercury program.

Click for 3D rotatable model

Clearly, the hexagonal cavities formed can accommodate a large number of hydrogen molecules. As to why, it is no doubt complex, but I cannot help but notice that the surface of the cavity is lined with multiple C-H units from the aryl spacer units pointing inwards. Given that hydrogen is a very good inducer of dispersion attractions, it would be interesting indeed to see whether the very large number of H…H2 dispersion attractions possible inside the cavity of this species might at least in part be responsible for the ability of this framework to accommodate hydrogen (or methane) gas.[2] It would be good to have an estimate of the dispersion energy term for NU-1501-Al and related species and the contribution of this term to the overall thermodynamics of the system. By the same token, replacing the four aryl C-H units with C-F units (a weaker dispersion attractor, think non-stick teflon) should reduce the ability to absorb hydrogen if dispersion is indeed important.


On the other hand, if the orientation of the aryl C-H groups is important in terms of dispersion attractons, perhaps these groups are actually critical to the effect.

References

  1. Z. Chen, P. Li, R. Anderson, X. Wang, X. Zhang, L. Robison, L.R. Redfern, S. Moribe, T. Islamoglu, D.A. Gómez-Gualdrón, T. Yildirim, J.F. Stoddart, and O.K. Farha, "Balancing volumetric and gravimetric uptake in highly porous materials for clean energy", Science, vol. 368, pp. 297-303, 2020. https://doi.org/10.1126/science.aaz8881
  2. S. Rösel, C. Balestrieri, and P.R. Schreiner, "Sizing the role of London dispersion in the dissociation of all-meta tert-butyl hexaphenylethane", Chemical Science, vol. 8, pp. 405-410, 2017. https://doi.org/10.1039/c6sc02727j

Choreographing a chemical ballet: a story of the mechanism of 1,4-Michael addition.

April 13th, 2020

A reaction can be thought of as molecular dancers performing moves. A choreographer is needed to organise the performance into the ballet that is a reaction mechanism. Here I explore another facet of the Michael addition of a nucleophile to a conjugated carbonyl compound. The performers this time are p-toluene thiol playing the role of nucleophile, adding to but-2-enal (green) acting as the electrophile and with either water or ammonia serving the role of a catalytic base to help things along.

The scheme above is deliberately set out as an eight-membered ring so that if the three dancers wish to do so, they can all act in concert. Oh, there is also a bit-actor (water) forming a hydrogen bond to X, the role of which will become clearer as the ballet proceeds. The curly arrows indicate what the electrons in the bonds or the lone pairs are expected to do. The three black arrows can be accompanied by either two blue arrows to give five in all, or just four if the two blue arrows are replaced by a single red one.

The choreographer in our performance is actually going to be a density functional quantum mechanical calculation (ωB97XD/Def2-TZVPP/SCRF=water, data at DOI: 10.14469/hpc/7027 since you ask), which has the single minded intention of ensuring that the cast is at the lowest possible energy at each stage of the ballet. The performance is shown below with X=O in the cast (water). Water is a poor base; its ability to grab a proton is weak. 

We can also show the entire dance using an Intrinsic Reaction Coordinate or IRC, this being the lowest energy pathway that the cast can achieve along this particular route to the end. Watch the animation above to see the performance! The catalyst (X=O remember) firstly gets into the best position to grab a proton from the S-H group, using its lone pair located on the oxygen (the base). It is helped by the bit-playing second water molecule, which forms an assisting support to the (lets call her) ballerina via a strong hydrogen bond. Having grabbed the proton from the ballerino, the catalyst transforms (temporarily) into a hydronium cation, paired now with a thiolate anion as an ion-pair. Temporarily, because this sort of arrangement is called a “hidden intermediate” in that this ion-pair is hidden, never actually forming. The water needs considerable help to become protonated (remember, it is a weak base), with the assisting water bit-player helping to stabilize the hydronium cation by a strong hydrogen bond it has formed.

The transition state for the reaction. Click to view 3D model. The vibration is that of the “transition state normal mode” as the molecule goes over the top of the barrier.

We now introduce the (relative) energy of the entire collection of molecules and have reached the stage of IRC=-1 on the X-axis. One final push is now needed, in which two things happen. Firstly, a S-C bond is formed (IRC = 0.0, the transition state) but as soon as it starts forming so does the rather unhappy hydronium cation relieve itself of the unwelcome proton it just acquired, by off-loading it onto the oxygen of the acrolein. You can see the structure of this transition state above (click on the image to turn it into a rotatable 3D model)

The catalyst is back to where it started (along with its bit-playing partner) and we now have a completed reaction and it all happened as a single act ballet (we call this a concerted performance). The products are lower in energy than the starting point, which is always good! Molecules tend to be lazy and do not much like becoming higher in energy (ATP, or adenosine triphosphate is a famously unlazy molecule which is very good at acquiring lots of energy and redistributing it about our bodies to feed our muscles).

We can look at another property which tells us a bit more about the curly arrows, which represent rearrangement of electrons within the molecule. If they get separated, their charges also become separated and this is reflected in the dipole moment along the reaction coordinate. In the early stages, blue arrow 1 starts to form a hydrogen bond from the lone pair of the water to the hydrogen on the S. As it does this, the dipole moment decreases. At the point that the proton finally decided to hop from the sulfur to the approaching water oxygen, the charge separation shoots up, reaching its maximum at IRC = -1 (IRC = 0 by the way represents the energy high point for the process, called the transition state).

I want now to address the vital point of why I drew two different arrangements of curly arrows, one with two blue arrows (1 and 5) and the other with just one red arrow (6). If we had instead used just the latter, then we would have been obliged to transfer both protons at exactly the same time. So blue arrow 1 is a better representation of what is actually going on. Only now do the black arrows 24 get into the performance, forming the S-C bond (2), reducing the first double bond in the acrolein to single, whilst reforming it adjacently (3) and transforming the second C=O double bond into C-O and O-H bonds (4). This encourages the second blue arrow (5) to, concurrently with the black arrows, transfer a proton and reform the lone pair onto the original oxygen of the water catalyst.

Let us now change the cast, replacing the original water catalyst with an ammonia (X=NH). Because N has a smaller nuclear charge than oxygen, it is happier at sharing its lone pair with a proton; it is said to be more basic. This means that an ammonium cation is a more willing performer than the hydronium cation. The ballet now occurs in two acts rather than one. The first act involves that now basic nitrogen removing the proton from the SH (arrow 1+2), but with arrow 2 ending up residing entirely on the S (as a sulfur lone pair) rather than immediately going on to form a S-C bond.

Act 1: Proton transfer from N to S.

There is then an intermission when the newly formed ion-pair takes a break, followed by the second act starting with a slightly different arrow 2 (it starts not at the S-H bond, but put on a new costume during the break to start as a new lone pair formed on the S) creating the new S-C bond. There is another difference compared to the water catalyst; the ammonium cation is now slightly reluctant to relinquish that proton and this only happens right at the end.

Act 2: Carbon-sulfur bond formation/Proton transfer. Click to view 3D model.

The energy high point is again S-C bond formation (IRC = 0.0), and the barrier the molecules needed to overcome to reach the energy high point is much lower than before. The nitrogen hangs on to its newly acquired proton until IRC = -2 and the reaction does look complete by IRC = -10. But in a final flourish (let’s call it an encore) something happens between IRC -10 to -15. Miffed at having to part with a hydrogen it had become fond of, the nitrogen lone pair instead now makes friends with a C-H bond (as part of a hydrogen bond; it is not basic enough to entirely remove a hydrogen from a carbon). 

The language has been slightly anthropomorphic, but we have covered a lot of chemistry with this reaction and learnt a lot about the sequence in which bonds form and how curly arrows can be used to relate to this process.


The encore: We can check to see if this last part comes purely from the fevered imagination of the density functional calculation or whether there is a basis in reality for this new friendship. The plot below comes from a search of all known crystal structures for organic molecules (which recently passed one million). Of these, 21 exhibit a CH…N distance < 2.45Å and the “hotspot” (in red) indicates that the strongest of these is ~2.15Å and that the C-H…N angle is approximately linear. So the effect is real!


See also this post for the non-catalysed version of this reaction.


This post has DOI: http://doi.org/dr96

A cascading tutorial in finding rich NMR data using the Datacite datasearch engine.

April 11th, 2020

In the previous post, I introduced three of a new generation of search engines specialising in the discovery of data. Data has some special features which make its properties slightly different from the conceptual (or natural language) searches we are used to performing for general information and so a search engine specifically for data is invariably going to reflect this. At the simplest level, the data search can retain much of the generic simplicity of a regular search, but to exploit the unique features of data, one really does have to move on to an advanced mode. Here, by introducing a set of search definitions that gradually increase in specificity and power, I hope to convey some of the flavour of one way in which this could be done.


Let me first introduce the search: we want to track down raw NMR FID data for the 11B nucleus associated with the chemical concepts of catalytic amidation.


To understand how to construct a search query which is specific to this set of constraints, one has to understand metadata and in particular its context of describing data. This is done via a specification known as a schema. We are going to exploit one of the better known schemas for describing data, that produced by DataCite[1] (DOI: 10.14454/f2wp-s162). It can be illustrated by just three small metadata components, which can be implemented in say an XML language and the properties controlled by their specification in the schema and shown below, with the actual value of the metadata highlighted in red.

  1. <titles>
      <title>
      16b. 2-((2-aminoethyl)-λ4-azaneyl)-2,4,6-tris(3,4,5-trifluorophenyl)-1,3,5,2,4,6-trioxatriborinan-2-uide
      </title>
    </titles>
    
  2. <descriptions>
      <description descriptionType="Other">NMR spectra for 1H, 13C, 19F and 11B nuclei.</description>
    </descriptions>
    
  3. <subjects>
      <subject subjectScheme="inchi" schemeURI="http://www.inchi-trust.org/">
      InChI=1S/C20H14B3F9N2O3/c24-12-3-9(4-13(25)18(12)30)21-35-22(10-5-14(26)19(31)15(27)6-10)37-23(36-21,34-2-1-33)11-7-16(28)20(32)17(29)8-11/h3-8H,1-2,33-34H2/q-1
       </subject>
      <subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">BHYQUOWHUMNGMD-UHFFFAOYSA-N
       </subject>
      <subject subjectScheme="NMR_Nucleus">11B</subject>
      <subject subjectScheme="NMR_Solvent">CDCl3</subject>
    </subjects>
    

The metadata is registered with a store (MDS, DataCite in this instance) in this form and then indexed there. To search that index, we need to learn the query syntax and expression. This is illustrated below for various examples, which can be broken down into components:

  1. The prefix https://search.datacite.org/works?query= is common to all the queries, and hence is only shown for example 1.
  2. The syntax e.g. titles.title: derives from the hierarchy of the metadata, as in 1 above.
  3. Immediately followed by a search string. The * character means the string may be part of a longer string, both preceding and following the actual search string. A literal string would be enclosed in quotes, “…”
  4. Two or more separate queries can be related by a Boolean operator, as +AND+ or +OR+.
  5. The Boolean operations can be grouped using (…) to ensure the logic is unambiguous.

With the syntax dealt with, we can now proceed to some actual queries. The hits shown were obtained on the day this post was written, and may change with time (hopefully but not necessarily upwards). A brief attempt at a natural language expression of each search appears in the table below, with the Boolean operators indicated in red. Each example is elaborated below to show the logic of their evolution.

Examples 1-9 deal with keywords typically found in either the title or the description metadata fields. Because there are no hard and fast rules as to which of these two any particular keyword might be found in, searches have to be defined which allow both possibilities. Search 2 seeks to find datasets where both keywords are found in a title (or indeed titles, since multiple titles for the same dataset are allowed). Search 3 allows each term to be found in either the title(s) or the description(s) using grouping operators; the difference in hits shows the necessity of doing this. The search outlined at top also indicated we specifically wanted NMR data. Searches 4-6 search for this term in either the title or the description. We are now assuming that NMR really does relate to spectroscopy and not some other acronym in use by another community. This can be a real problem if the same term has different meanings across different subject areas. In example 7-9, we now turn to boron, since 11B NMR requires a boron compound! Allowing any of the terms to appear either as a title or a description increases the hits compared to more restricted searches.

Time now to restrict the searches even more. In the previous searches, we had identified a potential discovery lead (i.e. one we might wish to follow up in more detail). Looking this lead up, we find its molecular formula, a very useful chemical search term. Because this is quite subject specific, we now turn to <subject> rather than <title> or <description>. Search 10 illustrates how this might be done. Search 11 is even more specific; whereas it is possible that two different chemical species might share a common molecular formula (as isomers), their chemical identifier (InChI and InChiKey) should be more unique. These latter two can be generated algorithmically for any given compound and so should return information about that specific molecule. Search 12 now combines this search with the 11B nucleus specified as a description, and search 13 generalises it to title as well.

We are now ready to go to the next level of refinement, that of media types. These are descriptors which identify the type of document in which the data is held. We are all familiar with e.g. .docx as belonging to the Microsoft Word family, originating in early computer operating systems where each document or file name had two components, with the suffix indicating the application (family) likely to be able to process it or the application to be used when the document is double clicked on the desktop. So in search 14, we combine a search of NMR in the title or description with the media type application/zip. We know that Bruker spectrometers export their data in a folder containing about 24 components and this is generally packaged up as a ZIP archive to make it tractable for submission and exchange. We do not know for sure what will be in the ZIP archive, but in combination with the title/description we may be reasonably optimistic (but not certain). However, a ZIP file identified and downloaded by this procedure still has to be accessed in a manner that will recognise any NMR data therein. This function must now be devolved to whatever program is used to access the ZIP file. 

In search 15, we try to be a bit more specific by combining the molecular identifier (InChiKey) with 11B (an NMR active nucleus) in a title or description and a JCAMP-DX media type. This latter type is more clearly associated with NMR spectroscopic data in JCAMP format, so the expectation is that any hits for this search sequence should provide us with an actual NMR spectrum! There is a slight spanner in the works; we do not yet know whether to expect processed NMR data (i.e. a spectrum) or raw NMR data (i.e. an FID), since JCAMP can hold either (but not both. Most examples in fact relate to spectra). Example 16 takes us to a media type which IS known to hold both raw and spectral data concurrently, the Mnova format. But this again leads to a new issue. Mnova is commercial software and to use it you need a license. It would be indeed cruel if you managed to find some data, but then had to pay money to view it in its commercial format (although of course that is how some journals operate). Example 17 addresses that problem. The media type is associated not with a data file as such, but with a single-use license file which can be read by Mnova to license the program to read the actual data file. You can now view the data in either FID or spectral form and process the data to your heart’s content. This largely encapsulates the aspiration of the acronym FAIR. We have Found and Accessed the data, Interoperated (i.e. converted an FID to a spectrum) and Re-used it (having checked the re-use license in the metadata) to e.g. analyze the spectrum.

Example 18 takes us to our final level. Previously the acronym NMR was used as a search term. You might be surprised to learn that it can have up to 33 meanings! In this context, we are interested in only one of them (nuclear magnetic resonance). So rather than imprecisely specify it in a title or a description, we are now going to (also) give it a more precise meaning using <subject>. The exact way in which to do this is still being debated; here is one possibility. Elaborating list item 3 above, we get
subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B
which is used to disambiguate from the other 32 possible meanings of NMR. Hence we are interested specifically in the 11B nucleus. We are controlling the data itself to relate to NMR data about that nucleus, using the media type. And example 19 now specifies also that the measurement must be made in a particular solvent. There are of course many other parameters which could be used.

# Search query Hits Plain(er) English
description
General keywords such as Title and Description
1 https://search.datacite.org/works?query=titles.title:*amidation* 161 Amidation in title.
2 titles.title:*amidation*+AND+titles.title:*catalytic* 2 Amidation AND catalytic in title.
3 (titles.title:*amidation*+OR+descriptions.description:*amidation*)+AND+(titles.title:*catalytic*+OR+descriptions.description:*catalytic*) 28 Amidation in either title OR description AND Catalytic in either title OR description.
4 descriptions.description:*NMR* 17,978 NMR in description
5 descriptions.description:*NMR*+OR+titles.title:*NMR* 26,152 NMR in either title OR description.
6 titles.title:*boron*+AND+titles.title:*catalysed* 20 Boron AND Catalysed in title.
7 titles.title:*boron*+AND+titles.title:*catalysed*+AND+titles.title:*NMR* 1 Boron AND Catalysed AND NMR in title.
8 titles.title:*boron*+AND+titles.title:*catalysed*+AND+(titles.title:*NMR*+OR+descriptions.description:*NMR*) 3 Boron AND Catalysed in Title and NMR in either title OR description.
9 (titles.title:*boron*+OR+descriptions.description:*boron*)+AND+(titles.title:*catalysed*+OR+descriptions.description:*catalysed*)+AND+(titles.title:*NMR*+OR+descriptions.description:*NMR*) 6 Boron AND Catalysed AND NMR in either title OR description.
Discovery lead: 10.14469/hpc/2247
Subject keywords
10 subjects.subjectScheme:inchi+AND+subjects.subject:*C20H14B3F9N2O3* 1 Molecular formula in subject.
11 subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N* 1 InChIkey in subject.
12 subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+descriptions.description:*11B* 1 InChI in Subject AND 11B in description.
13 subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+(descriptions.description:*11B*+OR+titles.title:*11B*) 1 InChI in Subject AND 11B in either description OR title.
Discovery lead:10.14469/hpc/2365
14 media.media_type:application/zip+AND+(descriptions.description:*NMR*+OR+titles.title:*NMR* 219 NMR in either title OR description AND media type which might contain (Bruker spectrometer) FID data. As it happens, all 219 ZIP files in this instance do.
15 media.media_type:chemical/x-jcamp*+AND+subjects.subjectScheme:inchikey+AND+
subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+(descriptions.description:*11B*+OR+titles.title:*11B*)
1 InChIkey in subject AND 11B in either subject OR title AND Media type known to contain spectral NMR data (and possibly raw NMR data).
16 media.media_type:chemical/x-mnova*+AND+subjects.subjectScheme:inchikey+AND+
subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+descriptions.description:*11B*
1 InChIkey in subject AND 11B in either subject OR title AND Media type known to contain both raw and spectral data (probably NMR)
17 media.media_type:chemical/x-mnpub*+AND+subjects.subjectScheme:inchikey+AND+
subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+descriptions.description:*11B*
1 InChIkey in subject AND 11B in either subject OR title AND Media type known to contain a license for use of MestreNova.
18 media.media_type:chemical/x-mnpub*+AND+(subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*)+AND+(subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B) 1 InChIkey in subject AND 11B Nucleus in Subject AND Media type known to contain a license for use of MestreNova for the dataset.
19 media.media_type:chemical/x-mnpub*+AND+(subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*)+AND+(subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B)+AND+(subjects.subjectScheme:NMR_Solvent+AND+subjects.subject:CDCl3) 1 InChIkey in subject AND 11B Nucleus in Subject AND Media type known to contain both raw and spectral data AND solvent chloroform in subject.

The searches above are meant to be illustrative and to serve as a tutorial showing one way of constraining a data search to have very specific, in this example chemical, properties. Many of the examples could be tightened up further (thus making them look even more intimidating). Also, some of the precise ways of defining such constraints are still being debated. In the above, I use both the definitions found in the Schema coupled with the media types property. It would also be possible to e.g. dispense with the media types and achieve this using the other properties obtained from the schema. When the dust settles (if it ever does) on this, it is quite possible the searches will look rather different from the above. The purpose here was not to set any standards in stone, but simply to illustrate the potential of searching for data in this manner. Other methods may emerge; the Google dataset search system does not use the same schema for example and so the searches themselves would also look different.

It should also be mentioned that the examples in the table above are not likely, in their present form, to be willingly used by most chemists. These queries are largely formulated in a syntax more suited for machines than for humans. But there is nothing to prevent a more human-friendly “front end” being written that takes the quite complex syntax above and render it more usable by people. Such a front end could also absorb queries formulated against different schemas and unify them for the user.


You can see a more complete set here. Of course, the 11B nucleus can have many properties other than NMR. Programs such as MestreNova can do this, but you will need a commercial license to process in this way. If there is a media type chemical/x-mnpub also associated with the ZIP file, then this can be used in lieu of such a license key for that dataset only. See examples 17-19. Bagit is one schema for adding metadata to a container such as ZIP to indicate the contents, albeit with the requirement that the software reading the ZIP file must process this information for it to be of use. This post has DOI: drrm.

References

  1. DataCite Metadata Working Group., "DataCite Metadata Schema for the Publication and Citation of Research Data v4.3", DataCite, 2019. https://doi.org/10.14454/f2wp-s162

New generations of globally aggregating search engines – for (chemical) data.

April 7th, 2020

Chemists have long been familiar with search engines that aspire to index a large proportion of the chemical literature. Think for example the old-generation (and commercial) SciFinder (Scholar) and Reaxys or those that arrived in the 1990s in the online era such as the non-commercial Pubchem or ChemSpider (there are more). But you may not be as familiar with the latest generation of global search engines and here I will focus on three relatively new ones that specialise specifically in tracking down data rather than just publications.

I will illustrate first using a regular or non-advanced search. The keyword will be obtusallene, which is selected largely because it is a relatively unique string which is likely to result in fewer false positives. It is a family of marine alkaloids containing, unusually, bromine and /or chlorine[1] and the citation here is to a journal article describing some of its chemistry. But what if you want to find data associated with such molecules?

  1. DataCite (the name gives a clue) specialises in finding data. It was launched ten years ago and has been rapidly expanding its index since. A regular search can be formulated using the string

    As these three advanced queries imply, there are many more ways of constraining the search, which I will describe at a later time.

  2. A more recent introduction is DataSetSearch from Google.
    • https://datasetsearch.research.google.com/search?query=obtusallene (20 hits). Google cites as its sources DataCite itself and the specific repository Figshare (for this search query). 
    • Which leaves a slight mystery. Whilst there is considerable overlap between the DataCite and Google searches, the latter should clearly be potentially a superset of the former, but in fact it is slightly less comprehensive (by at least 5 hits).
  3. My third new engine is OpenAIRE (a European project supporting Open Science). It is also the search engine provided by Zenodo.
    • https://explore.openaire.eu/search/find?keyword=obtusallene (20 hits on research data, 6 hits on publications, 5 hits on “other research products” and zero hits on “software”).
    • Which introduces not just data but other concepts associated with “research objects”, clearly more useful than data alone. One of these may well shortly be Instruments (as eg used to acquire data) and another is e.g. the software used to analyze the data.

I think these new-generation search engines specialising in data have lots of exciting potential. They are still maturing and I hope we will see some interesting new capabilities emerge which we have not had before.


All are on-line nowadays, but engines such as SciFinder had two previous existences, from about 1980 as CAS online using merely a terminal interface, and prior to that as printed copies to be searched manually.

References

  1. J. Clarke, K.J. Bonney, M. Yaqoob, S. Solanki, H.S. Rzepa, A.J.P. White, D.S. Millan, and D.C. Braddock, "Epimeric Face-Selective Oxidations and Diastereodivergent Transannular Oxonium Ion Formation Fragmentations: Computational Modeling and Total Syntheses of 12-Epoxyobtusallene IV, 12-Epoxyobtusallene II, Obtusallene X, Marilzabicycloallene C, and Marilzabicycloallene D", The Journal of Organic Chemistry, vol. 81, pp. 9539-9552, 2016. https://doi.org/10.1021/acs.joc.6b02008

Substituent effects on the mechanism of Michael 1,4-Nucleophilic addition.

March 29th, 2020

In the previous post, I looked at the mechanism for 1,4-nucleophilic addition to an activated alkene (the Michael reaction). The model nucleophile was malonaldehyde after deprotonation and the model electrophile was acrolein (prop-2-enal), with the rate determining transition state being carbon-carbon bond formation between the two, accompanied by proton transfer to the oxygen of the acrolein.

Here I look at the effect of changing one of the aldehyde groups on the malonaldehyde to a variety of others and in particular how this might affect the relative timing of the C-C formation and the accompanying proton transfer to oxygen. Will this vary with substituents?

The activation free energies for TS2 are shown below, showing that as the acidity of the proton on the incipient nucleophile decreases along the series R=NO2 to R=H, the free energy barrier goes up. 

Substituent ΔΔG298 (TS2)

NO2

11.5

CHO

16.3

CN

16.7

OMe

31.9

H

35.8

The asynchrony of the C-C formation and the PT is clearly shown for R=NO2. This can be seen most clearly when the gradient norm along the reaction path is plotted. This has TWO maxima at IRC 0.5 and 1.4, with a hidden (zwitterionic) intermediate in-between.

For R=H the gradient norm peaks are at IRC 0.8 and 2.1; the reaction is equally asynchronous. If you are wondering why the barrier looks smaller for R=H than for R=NO2 it is because Int1 is a lot less stable for R=H (= more reactive) than for nitro.

So this was a surprise in the end. Unlike substituent effects on electrophilic peracid epoxidation of an alkene,[1] nucleophilic addition to an alkene does not seem to exhibit a large substituent effect on its choreography.

References

  1. J.E.M.N. Klein, G. Knizia, and H.S. Rzepa, "Epoxidation of Alkenes by Peracids: From Textbook Mechanisms to a Quantum Mechanically Derived Curly‐Arrow Depiction", ChemistryOpen, vol. 8, pp. 1244-1250, 2019. https://doi.org/10.1002/open.201900099

The mechanism of Michael 1,4-Nucleophilic addition: a computationally derived reaction pathway.

March 25th, 2020

In 2013, I created an iTunesU library of 115 mechanistic types in organic and organometallic chemistry, illustrated using video animations of the intrinsic reaction coordinate (IRC) computed using a high level quantum mechanical procedure. Many of those examples first derived from posts here. That collection  is still available and is viewable  in the iTunesU app on an iPhone or an iPad. The realisation struck me now that one of the types not described in that library was Michael-type 1,4-nucleophilic addition to an activated alkene, as described at Wikipedia. So here is that addition.

The base used will be NH3 and the activating groups R will all be formyl. The DFT computational method will be ωB97XD/Def2-TZVPP/SCRF=water and the FAIR data will collect at DOI: 10.14469/hpc/7027

The full reaction mechanism can be represented as below

Species ΔΔG298, kcal/mol FAIR Data DOI
Reactant 0.0 7028
TS1 6.7 7029
Int1 -7.7 7036
TS2 16.3 7031
Int2 -8.7 7033
Int3 -8.7 7035
TS3 9.6 7030
Product -13.2 7034

The rate-limiting step of C-C bond formation is coupled with almost synchronous protonation on the remote oxygen. It is driven by reducing the dipole moment of the zwitterion Int1, as shown below.

Attempts to find an analogous route with carbon protonation leading directly to the product did not succeed.

By varying parameters such as the nature of the R groups or the base, one might be able to control the choreography of the C-C bond formation relative to the accompanying proton transfer to oxygen in TS2 (in the manner that was possible for e.g. peracid epoxidation[1]). These changes could then be subjected to e.g. the measurement of kinetic isotope effects and comparison with values calculated from the computational mechanism.


With Coronavirus now changing our lives and our work patterns, and having done my allowed quota of one exercise walk for the day at 06.30 (to avoid social contact, although in fact the park we went to had lots of other people exercising, even at that time) I settled down to think about what else could be done. The Michael reaction suddenly appeared! Locating transition states is one of those things that gives me considerable pleasure, and I have not reported any for a few posts now.

References

  1. J.E.M.N. Klein, G. Knizia, and H.S. Rzepa, "Epoxidation of Alkenes by Peracids: From Textbook Mechanisms to a Quantum Mechanically Derived Curly‐Arrow Depiction", ChemistryOpen, vol. 8, pp. 1244-1250, 2019. https://doi.org/10.1002/open.201900099

The Persistent Identifier ecosystem expands – to instruments!

March 21st, 2020

A PID or persistent identifier has been in common use in scientific publishing for around 20 years now. It was introduced as a DOI (Digital Object Identifier), and the digital object in this case was the journal article. From 2000 onwards, DOIs started appearing for most journal articles, journals having obtained them from a registration agency, CrossRef. This is a not-for-profit organisation set up by a publishers association for the purpose. Most readers of journal articles started to use this DOI as an easier way of navigating through invariably different and sometimes confusing metaphors set up by any given journal to navigate through its issues. Readers slowly learnt to prepend the URL http://dx.doi.org/ to the DOI to “resolve” it directly to what is known as the “landing page” of the article. More recently, the prefix recommendation has changed to the slightly shorter https://doi.org/ form. Few readers are aware  however that the DOI can serve a much more interesting purpose than just taking you to the article landing page. This post will explore a few of these extras.

  1. Firstly, a DOI has something called metadata associated with it, and you can view this metadata by prepending a different prefix, such as https://api.crossref.org/works/ to a DOI (as in https://api.crossref.org/works/10.1021/acsomega.8b03005) This returns a “machine response”, since this is very much the audience this version of a resolved DOI is intended for. A simple example of why this can be useful can be seen at the end of this blog post.
  2. An alternative prefix is https://data.datacite.org/application/vnd.datacite.datacite+xml/ and this brings us to the next big deployment of persistent identifiers, starting around 2010 with the focus now on data. The PID is still called a DOI, but the digital object is now data (or software) rather than a journal article and the agency registering the metadata is now DataCite rather than CrossRef. So e.g. https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/4844 now returns metadata about data. The usefulness of this has in recent times become encapsulated by the expression FAIR data. The metadata can help you Find the data, Access it, how it might be Interoperable and how to Reuse it.
  3. In 2012 a third prepend of the type https://orcid.org/ was introduced to provide metadata about researchers, as in https://orcid.org/0000-0002-8635-8390
  4. Then in 2019, the growing ecosystem expanded to organisations, as with the new resolver  https://ror.org/ and with PID e.g. 041kmwe10, hence https://ror.org/041kmwe10

After this long introduction, its time to turn to the latest proposed PID type. As the title suggests, it is for instrumentation and it is introduced at https://doi.org/10.5438/tdk2-2g94 (and yes, metadata at https://data.datacite.org/application/vnd.datacite.datacite+xml/10.5438/tdk2-2g94). An example describing the properties of an instrument can be found at DOI: 10.7914/SN/SH and in the chemistry community we can already start asking ourselves questions such as what types of instrument deserve their own PID, and what sort of information about the instrument might be usefully associated with the data and be of interest to other researchers.

This is early days yet for this latest proposal, but already one can start to see how this ecosystem might be operating in the future. Consider the scenario. A research team at a specified institute (PID) consisting of say four individuals (PID for each) uses a recently funded (PID) NMR spectrometer (PID) fitted with an special ultra-sensitive low temperature probe (PID), record a collection of individual solution spectra and then publish both the collection and the raw data from which the spectra are derived, each with their own PID.  With the help of quantum simulations (PID) of the spectra, they interpret the molecular structures and confirm this with a crystal structure (PID). A student graduates with a PhD based on this work (PID). Finally they publish their story (PID) in a journal that releases open citations (PID), thank their instrument funders (PID) and perhaps blog about it (PID). Since machines can access the metadata records of all these PIDs, the entire endeavour becomes linked with exchanged information. Starting at any single PID, one should easily be able to trace all the others and locate the data and other information associated with all the aspects of the project.

I used the term future above, but in fact much of the above infrastructure is already operating, albeit in early days mode. So this is one to keep an eye out for; things might happen more quickly than you might think!


Documented at eg https://github.com/CrossRef/rest-api-doc#queries If you want a more human readable version, use this JSON to XML converter This is how the citations at the end of this post are generated. In the post itself they are inserted using e.g. ⌈cite⌉10.1021/acsomega.8b03005⌈/cite⌉ and a plug-in then expands this to a query of the above resource and formats the response to generate the bibliographic details at the end.[1] Documentation for how to implement this is found at https://github.com/rdawg-pidinst/schema/blob/master/schema.rst and before you ask, no this one does NOT have a PID! This blog post has PID: 10.14469/hpc/7016

References

  1. A. Barba, S. Dominguez, C. Cobas, D.P. Martinsen, C. Romain, H.S. Rzepa, and F. Seoane, "Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data", ACS Omega, vol. 4, pp. 3280-3286, 2019. https://doi.org/10.1021/acsomega.8b03005

The singlet and open shell higher-spin states of [4], [6] and [8]-annulenes and their Kekulé vibrational modes

March 11th, 2020

In 2001, Shaik and co-workers published the first of several famous review articles on the topic A Different Story of π-Delocalization. The Distortivity of π-Electrons and Its Chemical Manifestations[1]. The main premise was that the delocalized π-electronic component of benzene is unstable toward a localizing distortion and is at the same time stabilized by resonance relative to a localized reference structure.  Put more simply, the specific case of benzene has six-fold symmetry because of the twelve C-C σ-electrons and not the six π-electrons. In 2009, I commented here on this concept, via a calculation of the quintet state of benzene in which two of the six π-electrons are excited from bonding into anti-bonding π-orbitals, thus reducing the total formal π-bond orders around the ring from three to one. I focused on a particular vibrational normal mode, which is usefully referred to as the Kekulé mode, since it lengthens three bonds in benzene whilst shortening the other three. In this case the stretching wavenumber increased by ~207 cm-1 when the total π-bond order of benzene was reduced from three to one by spin excitation. In other words, each C-C bond gets longer when the π-electrons are excited, but the C-C bond itself gets stronger (in terms at least of the Kekulé mode). This behaviour is called a violation of Badger’s rule[2] for the relationship between the length of a bond and its stretching force constant. 

This blog has come about because I wanted to revisit my original calculations and complete them with a calculation for a heptet state of benzene in which three π-electrons are promoted from bonding into anti-bonding π-orbitals, thus resulting in a total π-bond order of zero. For completness, I here present the results not only for benzene but for some other small-annulene systems, both charged and neutral. These are all done at the coupled-cluster level of theory, both CCSD and CCSD(T), along with two basis set levels (see DOI: 10.14469/hpc/6624 for the whole collection of calculations).

Before discussing the other systems, let your eye drop down the table below to the entries in red. These show the force constants for the singlet, quintet and heptet states of benzene vs the optimized C-C bond length for each (at the same level of theory). These confirm the earlier result in revealing that the quintet state (total ring π-bond order 1) has a longer bond but a stronger force constant for the Kekulé mode than the singlet state (total ring π-bond order 3). The heptet state now has a normal length C-C single bond (total ring π-bond order 0) but a Kekulé distorsion force constant higher than benzene itself! 

Things now start to get more complicated. Firstly, for benzene itself, reducing the remaining π-bond order from 1 to 0 on exciting from quintet to heptet substantially reduces the force constant. So one might conclude that reducing an annulene total π-bond order does not always result in an increase in force constant. Badger’s rule is not always violated and the distortivity of π-electrons may not be a linear phenomenon.

State Method “Kekule”
Mode, cm-1
FC,
mDyne/Å
Reduced
mass, AMU
Bond
length, Å

Data

DOI

Cyclobutadiene, dication
1A1g CCSD(T)/Def2-TZVPP 1383a 5.4102 6.5340 1.449 6920
1329b 8.0442 7.7338
3B1g CCSD/Def2-TZVPP -2195a -33.0943 11.6515 1.589 6944
-2211b -17.6740 6.1313
3A1g CCSD/Def2-TZVPP -2171a -32.2296 11.6023 1.593 6933
-2189b -16.0614 5.6844
Cyclobutadiene
3A1g CCSD/Def2-SVP 1422a 7.7848 6.5340 1.444 6634
1373b 7.9700 7.1807
CCSD(T)/Def-SVP 1395 7.2647 6.3381 1.449 6643
1345 7.6931 7.2202
CCSD/Def2-TZVPP 1392 6.9173 6.0600 1.438 6671
1342 8.0252 7.5582
CCSD(T)/Def2-TZVPP 1360 7.2647 6.3381 1.449 6672
1310 7.7044 7.6151

5B1g CCSD/Def2-SVP 1192 2.2102 2.6391 1.566 6635
1088 4.9656 7.1153
CCSD(T)/Def-SVP 1176 2.0739 2.5452 1.569 6644
1069 4.7829 7.0983
CCSD/Def-TZVPP 1177 1.8783 2.3023 1.563 6636
1067 5.0294 7.5000
CCSD(T)/Def-TZVPP 1157 1.7555 2.2253 1.568 6678
1045 4.8322 7.5073
Cyclobutadiene Di-anion (isoelectronic with benzene)
1A1g CCSD/Def2-SVP 1283 5.9023 6.0872 1.470 6652
1233 6.3831 7.1172
CCSD(T)/Def2-SVP 1258 5.5888 5.9952 1.475 6653
1209 6.1657 7.1565
CCSD/Def2-TZVPP 1216 4.3487 4.9883 1.467 6676
1165 6.2282 7.7827
CCSD(T)/Def2-TZVPP 1187 4.1223 4.9621 1.473 6679
1138 6.0103 7.8734
Benzene
1A1g CCSD/Def2-SVP 1337 4.9922 4.7425 1.401 6623
1308 10.7126 10.6244
CCSD(T)/Def2-SVP 1359 6.9812 6.4190 1.405 6647
1339 11.3860 10.7785
CCSD/Def2-TZVPP 1309 3.5714 3.5358 1.392 6646
1273 10.1936 10.6709
CCSD(T)/Def2-TZVPP 1328 5.6130 5.4004 1.398 6710
1306 10.9097 10.8588

5A1g CCSD/Def2-SVP 1600 17.4689 11.5855 1.463 6626
1597 17.4799 11.6274
CCSD/Def2-TZVPP 1572 17.1041 11.7511 1.455 6669
1571 17.1770 11.8198

7B1u CCSD/Def2-SVP 1361 11.0797 10.1574 1.550 6632
1355 12.4274 11.4839
Cyclo-octatetraene Dication (isoelectronic with benzene)
1A1g CCSD/Def2-SVP 1750 21.4531 11.8876 1.414 6648
1750 21.5770 11.9620
CCSD(T)/Def2-SVP 1707 20.4695 11.9226 1.420 6673
1707 20.5570 11.9773

5A1g CCSD/Def2-SVP 1702 20.4349 11.9684 1.444 6663
1702 20.4690 11.9899

7B1g CCSD/Def2-SVP 1460 14.8761 11.8407 1.511 6675
1460 15.0578 11.9890
Cyclo-octatetraene
3B2g CCSD/Def2-SVP 1777 22.3143 11.9956 1.409 6637
1777 22.3176 11.9976

7A1g CCSD/Def2-SVP 1637 18.9335 11.9851 1.484 6639
1637 18.9464 11.9941

9B1g CCSD/Def2-SVP 1441 14.6711 11.9988 1.553 6640
1441 14.6723 11.9998
CCSD(T)/Def2-SVP 1421 13.4795 11.3256 1.555 6677
     
Cyclo-octatetraene dianion
1A1g CCSD/Def2-SVP 1731 21.1621 11.9903 1.419 6695
1731 21.1668 11.9937

5A1g CCSD/Def2-SVP 1642 19.0214 11.9711 1.461 6698
1642 19.0408 11.9852

9B2u CCSD/Def2-SVP 1517 16.0290 11.8174 1.528 6700
1517 16.1516 11.9186

aUnprojected, with possible Dushinsky coupling bProjected from Dushinksky coupling. In all cases, the excited states show -ve force constants for out of plane deformations, but the in-plane Kekule modes are all +ve except for the first entry.

I will now make some short comments about the other ring systems reported above.


  1. Cyclobutadiene, dication. The Kekulé mode is very similar to benzene, but based clearly on just two π-electrons rather than six. There are two ways of forming a triplet state by exciting one of the two π-electrons to give a total π-bond order of zero. Both give a C-C distance a little longer than that typical of cyclobutanes (1.56Å). These triplet states however are not equilibrium species but transition states for the dissociation into two molecules of acetylene radical cation, a reaction driven no doubt by the large coulomb repulsions found for a di-cation.
  2. Cyclobutadiene. The singlet ground state has Jahn-Teller effects (which by the way are absent from all the other excited states reported here), but the triplet state again has a Kekulé mode is very similar to benzene. Removing all the π-bond orders in the quintet reduces the Kekulé force constant. This is in contrast to benzene itself.
  3. Cyclobutadiene, di-anion. Only the singlet state was calculable (the excited states did not converge), and now the Kekulé force constant is distinctly lower than benzene, probably again due to coulombic repulsions of the di-anion coupled with greater Pauli repulsions of the additional electrons. One other vibrational mode is worth showing here, the Eu mode (ν 1267 cm-1) which shows interesting charge localisation into a carbon-centred anion and a delocalised allylic anion.
  4. Cyclo-octatetraene Dication. Although isoelectronic with benzene, it shows very different behaviour. As the spin-multiplicity increases, so the Kekulé force constant decreases and the bond length increases, in accordance with Badger's rule. Again, another vibration (E3u, ν 1544 cm-1) shows charge localisation to give a 1,4-separated di-cation.
  5. Cyclo-octatetraene. The triplet has a total ring π-bond order 3 (with two electrons in non-bonded orbitals) and a C-C bond length similar to benzene itself. The nonet state total ring π-bond order is reduced to 0, with a C-C length again identical to a single bond. As with the di-cation, the force constant is reduced as the bond length increases, in accordance with Badger's rule.
  6. Cyclo-octatetraene di-anion is similar to the neutral system in following Badger's rule.

I do need to insert some caveats here. The original hypothesis[1] of distortive π-electrons was based on the singlet states (both ground and excited) and the results reported here are based on higher spin states, the assumption being that these are well-described by single reference states or configurations. It may well be that these higher spin states need more complex multi-reference determinants to describe them properly, in which case the coupled-cluster calculations reported here would be inappropriate. Thus for the larger rings, some of the CC calculations either failed to converge for large basis sets or gave some unphysical force constants (i.e. huge). This does tend to suggest that the internal MP expansion performed for coupled-cluster calculations is failing to converge, a well known propensity for systems where a multi-reference determinant is needed.

So one should not conclude too firmly that only benzene itself (in this series; there are many other examples to be found in [1]) exhibits Badger’s rule violations. Nonetheless, it would be valuable in the future to know whether the concept of distortivity of π-electrons can be applied to the small ring annulenes where the π-bond orders have been progressively reduced down to zero by specifying higher-spin π-states.


In 2014, I looked at some of the historical origins of this attribution to Kekulé, and you might also want to read the fascinating discussion by others on this topic.I thank Sason Shaik for his comments on the above results!

References

  1. S. Shaik, A. Shurki, D. Danovich, and P.C. Hiberty, "A Different Story of π-DelocalizationThe Distortivity of π-Electrons and Its Chemical Manifestations", Chemical Reviews, vol. 101, pp. 1501-1540, 2001. https://doi.org/10.1021/cr990363l
  2. R.M. Badger, "A Relation Between Internuclear Distances and Bond Force Constants", The Journal of Chemical Physics, vol. 2, pp. 128-131, 1934. https://doi.org/10.1063/1.1749433

Encouraging Submission of FAIR Data at the Journal of Organic Chemistry and Organic Letters

February 14th, 2020

In a welcome move, one of the American chemical society journals has published an encouragement to submit what is called FAIR data to the journal.[1]. A reminder that FAIR data is data that can be Found (F), Accessed (A), Interoperated(I) and Re-used( R). I thought I might try to explore this new tool here.

You start at the ACS Research Data Center  with the tag line Submit your NMR Data. By this they mean the primary or “raw” NMR data as it emerges from a spectrometer. At this point I would note that primary data is not necessarily FAIR data yet. It is however a great deal more easily inter-operated and re-used than say the more conventional form of such data, which is a visual spectrum stored as a PDF file. If you did want to re-analyse the data, the primary data is the place to start, not the PDF spectrum!

The tool next asks to to drop your FID file into the upload area. Depending on the spectrometer type, this can take the form of a ZIP archive of various instrument files (typical of Bruker spectrometers) or just a single file (JDF, typical of Jeol spectrometers). The next request is for some “metadata” such as Title, Funder and Author(s), with an additional request to provide an ORCID for the latter. All these are easily provided. It was the next step where my exploration on this occasion had to stop, since the next button takes you to the Manuscript submission page, which can only be followed if you have a manuscript to complete! 

What would I expect to happen next? Well, this metadata has to be augmented with molecule metadata, such as for example an InChI of the molecule. This is what would turn our primary data in fully FAIR data. To complete the process, the data and its now completed metadata descriptors would need to be Registered, in order to facilitate its discovery and hence enable the F of FAIR. This is normally completed with the DataCite registration agency, and in exchange you get a DOI corresponding to the registered metadata and you can then infer a link of the type https://data.datacite.org/application/vnd.datacite.datacite+xml/…your-allocated…DOI which allows you to inspect the metadata and search for it (see eg DOI: 10.14469/hpc/5920 for examples of such searches). Currently I do not know if this happens with this ACS tool. I would certainly like to inspect the collected metadata before I could comment on whether the title of this post is accurate, ie the encouragement of FAIR data. It would also be interesting to see what (if any) procedures are used to generate an InChI for the molecule and its NMR data, and exactly how that is also included in the metadata.

I would also note one other crucial aspect of this process, how to enable the A of FAIR. Primary or raw NMR data is entirely opaque (the files themselves are often binary encoded files) and you do need a tool to transform this data into visual or spectral form. So you will need to acquire such a tool, most often in the form of software such as MestreNova or Topspin. This can be a complex process, and may well involve paying the vendors money. In this context, I would note the Mpublish tool,[2] which allows a single-free-to-use license to be generated which allows e.g. MestreNova to be freely used for that dataset only. Some form of suitable Access to a FAIR dataset is an essential (if often unmentioned) component of the process.

At this stage therefore, there are quite a few questions about this new ACS system which I cannot provide answers to. On these answers will depend whether the process can be truly described as the submission of FAIR data. If anyone reading this manages to complete the process above, do please describe the subsequent experiences. I fancy there will have to be a future follow up to this post! Meanwhile, if you do have a manuscript you are ready to submit, give it a go and perchance report your experiences here!

References

  1. A.M. Hunter, E.M. Carreira, and S.J. Miller, "Encouraging Submission of FAIR Data at <i>The Journal of Organic Chemistry</i> and <i>Organic Letters</i>", Organic Letters, vol. 22, pp. 1231-1232, 2020. https://doi.org/10.1021/acs.orglett.0c00383
  2. A. Barba, S. Dominguez, C. Cobas, D.P. Martinsen, C. Romain, H.S. Rzepa, and F. Seoane, "Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data", ACS Omega, vol. 4, pp. 3280-3286, 2019. https://doi.org/10.1021/acsomega.8b03005