Archive for the ‘Chemical IT’ Category

How many of the compounds that appear in the chemical literature are mentioned just once?

Friday, June 6th, 2025

Tom recently emailed me this question: Do you know how to find out how many of the compounds that appear in the chemical literature are mentioned just once? Intrigued, I first set out to find out how many substances, as Chemical Abstracts refers to the them, there were as of 5 June, 2025. There is a static estimate here (219 million), but to get the most up to date information, I asked CAS directly. They responded immediately (thanks Lee!) with 294,778,693 on the date mentioned above. It is not actually possible to answer the first question itself using CAS SciFinder, but again CAS came up with a value: “there are 113,383,649 substances in CAS Registry with only one CAplus citation” equivalent to  “38.5% of the current substances have only 1 reference.” I should add this estimate was qualified by “that can be misleading, since that includes salts, multicomponents, etc. But that’s a first pass.” I am actually impressed that as many as 61.5% are mentioned more than once, since before learning the answer, I had intuitively guessed that percentage as being much lower.

My mind then went back to the year 1974, when my PhD thesis was published.[1] As part of this research, I had managed to synthesize several sterically hindered indoles, culminating in the preparations of 2-Methyl-3,5-di-t-butylindole (3, R=Me)and  2,4,6-tri-t-butylindole (3, R=t-Butyl) by the route shown below (R= Me, t-Butyl –  a different route also gave the same product). I was very proud of this, since my research supervisor intimated to me a few years later that he had not believed I would succeed, on the grounds that making sterically hindered systems can be quite challenging! This work was published in a journal in 1975.[2]

Next, to find out what “impact” this work has had in the intervening 50 years. Well, a CAS SciFinder search revealed that 2-Methyl-3,5-di-t-butylindole (3, R=Me) was one of the 38.5% of the current substances that have only 1 reference, to just our own work. Zero impact then! But worse was to come –  2,4,6-tri-t-butylindole (3, R=t-Butyl) did not even have 1 reference – as far as CAS was concerned, it was an unknown compound! So too were the precursors 2-methyl-3,5-di-t-butylaniline (1) and the anilides 2 (R=Me, t-butyl).

The explanation can be found – at least  in part – by reading our article[2] and from  the computational modelling I did some forty years later.[3] We were measuring kinetic isotope effects on the rate of diazo-coupling of these indoles and had noted in the article that 2,4,6-tri-t-butylindole was so hindered it simply did not diazo-couple at any measurable rate. As a result, it was not included by us in the experimental section detailing its synthesis (we really should have). The absence of the anilides 2 in the CAS database is perhaps understandable, since they are merely precursors to the final cyclisation and these are not always characterised as fully as final products. I have retrieved the experimental information in my PhD thesis[1]  and reproduce it here so that you can see it as well.  I note that the anilide 2, R=Me) is mentioned only in passing (red text below) whilst for 2, R=t-Butyl, only an m.p. and mass spec weight are included.

I have now set myself the challenge of whether substances 1 and especially 3 (R=t-Butyl) at least can be retrospectively added to the CAS database. Watch this space!


2-Methyl-3,5-di-t-butylaniline.

Bromine (8g) was added to dimethylsulfide (3.2g) in dichloromethane (40 ml) at -46° (chorobenzene/N2 cooling bath) with no precautions taken to exclude moisture. A yellow crystalline precipitate of bromosulfonium bromide salt was formed. 3,5-Di-t-butyl aniline (10g) and triethylamine (5g) in dichoromethane (10 ml) were added dropwise, during the course of which the yellow salt dissolved and white crystals of triethylammonium bromide were deposited. After 2 hours at -46°, a solution of sodium (2.5g) in methanol (15 ml) was added, resulting in the production of a white precipitate of sodium bromide. After 8 hours at 20° the rearrangement was essentially complete and the solution was shaken with water, the solvent separated and evaporated to give a yellow oil (12g, 95%) which crystallised on standing. δ 1.30 (9H, s), 1.47 (9H, s) 2.13 (3H, s), 4.12 (4H, br), 6.53, 6.83 (2H, dd, JAB 2Hz). m/e 265 (M+), 218 (M+-CH3S+).

Raney nickel (prepared from 210g of 50% Na/Al alloy) was stirred with a solution of the 2-methylthiomethyl-3,5-di-t-butylaniline (32g) in ethanol (150 ml) at 70° for 1 hour. Filtration and evaporation of the solvent gave an oil which on distillation gave 2-methyl-3,5-di-t-butylaniline (66%), b.p. 126°/2.7 mm. δ 1.25, 1.38 (18H, d), 2.17 (3H, s), 3.27 (2H, s), 6.43, 6.75 (2H, dd, JAB 2Hz).

2-Methyl-3,5-di-t-butylindole.

2-Methyl-3,5-di-t-butylaniline (2g) in ether (20 ml) and triethylamine (1g) was mixed with acetyl chloride (1.2 g) in ether. After 1 hour the ether was washed with 0.01N HCl and the solvent removed to give the acetyl derivative (90%). The acetyl derivative was cyclised by potassium t-butoxide at 360° to give a melt which was boiled up with water. Ether extraction followed by crystallisation from hexane gave 2-methyl-4,6-di-t-butylindole (30%), m.p. 176°. νmax 3370, 1617, 1538, 849, 784, 755 cm-1. δ 1.35, 1.45 (18H, d), 2.37 (3H, s), 6.27 (1H, m), 6.97 (1H, s), 7.4 (1H, br, exchanges with D2O). λmax (log ε) 223 (4.35), 272 (3.95). m/e 243 (M+), 225 (M+-15). Found C, 81.95; H, 11.41; N, 6.19%. C15H25N requires C, 82.12; H, 11.48; N 6.38%.

2,4,6-Tri-t-butyl indole.

2-Methyl-3,5-di-t-butyl aniline was acylated with trimethyl acetyl chloride in ether to give the anilide (97%), m.p. (ether) 215°, m/e 303 (M+). Fusion with potassium t-butoxide at 350C gave on cooling a solid which was treated with water, giving brown crystals of the 1:1 t-butanol complex. These were dried and sublimed very slowly at 70° to give a colourless glass (25%), pure by nmr and tlc. νmax 3450, 3310, 2960, 2870, 1645, 1600, 1370, 800 cm-1. δ 1.30, 1.35, 1.48 (27H, t), 6.25 (1H, d, 2Hz), 6.95 (1H, d, 2Hz, 7.72 (1H, s, exchanges with D2O). m/e 285 (M+), 270 (M+-15). Found C, 84.12; H, 10.97; N, 4.76%. C20H31N requires C, 84.14; H, 10.94; N 4.90%.


References

  1. H.S. Rzepa, "Hydrogen transfer reactions of indoles", 1974. https://doi.org/10.14469/spiral/20860
  2. B.C. Challis, and H.S. Rzepa, "The mechanism of diazo-coupling to indoles and the effect of steric hindrance on the rate-limiting step", Journal of the Chemical Society, Perkin Transactions 2, pp. 1209, 1975. https://doi.org/10.1039/p29750001209
  3. H. Rzepa, "I've started so I'll finish. The mechanism of diazo coupling to indoles – forty (three) years on!", 2015. https://doi.org/10.59350/1jhn9-9v717

Referencing and citing a science-based blog post.

Tuesday, April 8th, 2025

Back in early 2012, I pondered about the relationships between a science-based blog post and a science-based journal article[1]. This was in part induced by my discovering a blog plugin called Kcite, which allow a journal articles to be appended to the blog in the form of a numbered reference list. The only required input for Kcite was the DOI of the article (as you can see earlier in this paragraph). For around 500 posts after that moment, I always strove to add such references to my posts. Around 2016, I started including references to data in the form of repository DOIs to sit alongside the journal references, but this feature stopped working a year or two later because of changes in the metadata resolved by the DOI. Kcite itself lasted until January 2024 for this blog, when a required update to the software running the blog (WordPress) meant that it no longer worked and had to be removed as a plugin. Two years ago, Rogue Scholar (Science blogging on steroids) started coming along to the rescue.[2] ,[3] It provides some amazing automated features and infrastructure to blogs; I will illustrate from those listed on the top page of Rogue Scholar itself:

  1. No waiting time — blogs can join via a simple form. Blog posts are automatically archived within minutes after publication on your blog.
  2. No fees — blog posts are archived without fees to either readers or authors. Rogue Scholar is sustained by donations and sponsorships.
  3. Archived — blog posts are archived by Rogue Scholar, and semiannually by the Internet Archive Archive-It service.
  4. Findable — every blog post is searchable via rich metadata and full-text search.
  5. Citeable — every blog post is assigned a Digital Object Identifier (DOI), to make them citable and trackable. Rogue Scholar shows citations to blog posts found by Crossref.
  6. Interoperable — metadata are distributed via Crossref and ORCID, and downstream services using their metadata catalogs.
  7. Reusable — the full-text of every blog post is distributed under the terms of the Creative Commons Attribution 4.0 license.
  8. Communities — blog posts automatically become part of communities for your blog, the blog subject area, and topic communities based on blog post tags.

Part of the stuff that goes on behind the scenes is integration with CrossRef (which handles information about journal articles) and that in turn enables insights such as how Blogs abstracted by Rogue Scholar can be cited within journal articles and other blogs and gives some idea of the impact that these blogs are making. Here I illustrate some searches so enabled by having Rogue Scholar abstract a blog;

  1. https://rogue-scholar.org/search?q=references:*&sort=newest This shows that Rogue Scholar has captured (currently) 2003 references abstracted from blogs.
  2. https://rogue-scholar.org/communities/rzepa/records?q=references:*&sort=newest Of these (currently) 504 have come from mostly identifying the [4] entries in my own blogs.
  3. https://rogue-scholar.org/search?q=citations:*&sort=newest shows all citations of the blogs in the Rogue Scholar community, currently at 519.
  4. https://rogue-scholar.org/search?q=citations:10.59350/*&sort=newest This lists the number of citations originating from the  DOI prefix 10.59350 (which is that of the Rogue Scholar community).
  5. https://docs.rogue-scholar.org/dashboard lists other statistics. This are revealing, indicating currently only 6% of posts currently have references, although the uptake of institutional origins (ROR) and researcher ID (ORCID) is much better.
  6. The distribution amongst subject areas is 6.8% in the chemical sciences:

Meanwhile, work is under way to resuscitate the Kcite plugin, so that references are once again collected at the bottom of each post. Meanwhile, such a list can instead be found at the archived version of the posts at Rogue Scholar, as for example for this post itself. Also for the future is identifying how many of the references cited in blogs relate to research objects such as journal articles, and how many are instead to data held in e.g. data repositories. Such data reference richness in journal articles themselves is gradually increasing[5],[6] and it to be hoped also in science-based blogs themselves in the future.

References

  1. H. Rzepa, "The blog post as a scientific article: citation management", 2012. https://doi.org/10.59350/3pbz1-vcd67
  2. M. Fenner, "Automatically list all your publications in your blog", 2013. https://doi.org/10.53731/axtz227-73n18e7
  3. M. Fenner, "Rogue Scholar now shows citations of science blog posts", 2025. https://doi.org/10.53731/4bvt3-hmd07
  4. https://doi.org/
  5. H. Rzepa, "Finding and Discovery Aids as part of data availability statements for research articles.", 2025. https://doi.org/10.59350/th26w-gev67
  6. D.C. Braddock, S. Lee, and H.S. Rzepa, "Modelling kinetic isotope effects for Swern oxidation using DFT-based transition state theory", Digital Discovery, vol. 3, pp. 1496-1508, 2024. https://doi.org/10.1039/d3dd00246b

Crystallography meets DFT Quantum modelling.

Monday, March 17th, 2025

X-ray crystallography is the technique of using the diffraction of x-rays by the electrons in a molecule to determine the positions of all the atoms in that molecule. Quantum theory teaches us that the electrons are to be found in shells around the atomic nuclei. There are two broad types, the outermost shell (also called the valence shell) and all the inner or core shells. The density of the core electrons is much higher (more compact) than the more diffuse valence shell for all but the hydrogen atom, which only has valence electrons. How does this relate to x-ray diffraction by electrons? Well, core electrons, because of their relative compactness, diffract X-rays more strongly than the valence electrons. This compactness of the core also means that its electron density distribution can be well (but not exactly) approximated by a sphere, with the nucleus at the centre of that sphere. And from this it follows that the density for each atom can be treated independently, the so-called IAM or independent atom model. For example all the carbon atoms in a molecule are approximated as having the same value for the electron density of their core shell. But the IAM approximation is much less good for hydrogen atoms, especially when they are attached to very polar atoms (Li, O, F, etc) and even atoms such as carbon or oxygen have noticeable deviations as illustrated in  figure 1 below. [1]


Figure 1 from [1] with caption: Deformation Hirshfeld densities for the carbon (left) and oxygen (right) atoms in the carboxylate group of Gly-l-Ala, i.e. difference between the spherical atomic electron density used in the IAM and the non-spherical Hirshfeld atom density used in Hirshfeld atom refinement=HAR (IAM minus HAR). Red = negative, blue = positive. Isovalue = 0.17 eÅ−3.

X-ray crystallography is all about matching the electron density map of a model structure with the electron density map derived from the diffraction data. In “conventional” X-ray crystallography – i.e. that used by most crystallographers –  the electron density map of the model is calculated using the IAM approach, where no consideration is given to any distortion of the electron density distribution caused by things like bonds – each atom is treated independently (hence the name). This method especially struggles with hydrogens and hence the inferred position of the hydrogen nucleus at the centre of an assumed spherical distribution is often difficult to obtain accurately. Enter quantum crystallography, whereby a model of the electron density distribution in a molecule can be calculated by solving the Schrodinger equation, nowadays to a very reasonable approximation in a reasonable time (minutes) using so-called density functional theory, or DFT. The resulting electron density map for the model structure might be expected to more closely match reality than the IAM approach. Most obviously affected by this change is the handling of hydrogen atoms. If one considers a C–H bond from an sp3 carbon atom, using an IAM approach the hydrogen atom (i.e. its nucleus or proton) would be placed at the centre of maximum electron density, in the full knowledge that this is not actually where the hydrogen atom nucleus itself is. The direction of the C–H vector would be correct, but the distance would be too short. In the quantum crystallography approach, the positions of e.g. hydrogen atom nuclei are not exactly coincident with the electron density maxima, amounting in effect to non-spherical atoms, thus avoiding the systematic errors seen in the IAM approach. Smaller, but possibly still significant such errors might be expected for e.g. the 2nd row elements and beyond.

Getting reliable hydrogen atom positions has previously required a neutron diffraction study, which is difficult, expensive and time consuming. So the idea of using the non-spherical DFT densities rather than the spherical IAM approach to build a model using X-ray diffraction data is very appealing. But does it work? To test this, we decided to go back to some previously published structures that were handled using the IAM approach, and re-refining them using quantum crystallography. We do not have the corresponding neutron studies to check the answers against, but we can still see how well the structures themselves refine and what new problems this approach might throw up.

Method

The original published structures[2] were refined with SHELX-2014[3] which uses an independent atom model (IAM) approach. The results reported here employed NoSpherA2[1], [4] using Hirshfeld atom refinement[5] and selecting Def2-SVP as the (all-atom) basis set and ωB97X-V as the DFT method (the results seem relatively insensitive to either), implemented in the ORCA program.[6] For the first attempts no changes were made to the structures beyond the anisotropic refinement of the now unconstrained hydrogen atoms. For four of the structures a number of the hydrogen atoms went non-positive definite (i.e. one of the radii of a thermal ellipsoid refined to a negative length), which is physically nonsensical and would be a significant barrier to publication. (we don’t quite want to say “unpublishable” as there are almost always exceptions, but a non-positive definite thermal parameter is pretty close to being unacceptable.) For these cases, a second version was created (V2) where all of the hydrogen atoms were refined isotropically but with the distances and thermal parameters still allowed to refine. For AB1709 (18b), this still had the isotropic thermal parameter of one of the hydrogen atoms (H11) go non-positive definite, so for that one hydrogen atom the free isotropic thermal parameter was replaced with a riding one.

The results

We chose a set of seven structures published in 2017[7] and refined as noted above using conventional methods. These seven also comprise one of the very first sets of crystal structures for which full diffraction data were made available,[2] rather than just the refined structure in the form of a CIF file. The  new results have also been deposited[8] to augment the record for these compounds. Spreadsheets corresponding to the images below can be obtained by clicking on the image.

  1. All seven structures saw a reduction in the final R-factor.[8] However, all of the structures also saw a significant increase in the number of parameters (as the hydrogen atoms went from using zero parameters each in a fully riding model to nine parameters each in a fully free anisotropic model). However, all the QM refinements passed the Hamilton test, suggesting that the reduced R-factors do indeed reflect a better model, rather than just being a consequence of the significantly increased number of parameters.
  2. All four of the structures containing bromine atoms had a number of the hydrogen atoms go non-positive definite when refined anisotropically. It is not clear exactly why this happened – there does not appear to be any correlation with data quality or intensity (as crudely measured by R(int) and mean I/σ respectively), and though the redundancy for these structures is fairly low (between 1.5 and 1.7), those for the non-bromine structures are not much better (1.5, 2.3 and 4.9). These data sets were the result of experiments designed to collect 98.5% of the symmetry unique data with no consideration for redundancy at all. However comparison of the initial and secondary versions of the refinements of these four structure does show that the substantial majority of the observed R-factor decrease can be achieved without using anisotropic hydrogen atoms.
  3. As regards the precision of the structures, using one C(sp2)–C(sp3) bond as a proxy (the C7–C8 bond) we can see that the estimated standard deviation is either the same or only slightly lower in all seven structures, suggesting that getting lower e.s.d.s would not be a motivating factor for using quantum crystallography.
  4. One of the more unexpected results was the variation in F(000). In X-ray crystallography (deliberate emphasis on X-ray, as neutron diffraction is different) F(000) is supposed to be the total number of electrons present in the unit cell, and is used as an overall scale factor for the electron density map. It is very much not supposed to be variable, and any discrepancy would indicate an error in the calculated or reported formula and should be corrected. We do not understand why the QM refinements give a different answer than the IAM ones (some up and some down — normalised to a per molecule basis the range is –1.1 to +2.2), though it seems likely to be associated with cut-offs (boundaries) in measuring the “smeared out” electron density in the QM models, The IAM models all give the expected “correct” values.
  5. Based on the checkCIF reports for the QM structures, if quantum crystallography catches on in a big way, then checkCIF will probably need to be updated, there now being a number of high level alerts for long X–H bonds.
  6. One of the major areas of uncertainty with quantum crystallography is what/how much data needs to be collected. Symmetry unique data to 0.84 Å seems insufficient, but what would be sufficient — full sphere, redundancy, higher resolution? Would the final results be worth the extra time investment? None of the above aspects are clear at this stage, but it will be interesting to see how the technique develops.

These seven crystal structures also occupy an interesting position for posterity. Data for them has been made available spanning eight years which illustrates two significantly different refinement methods being used during this period, as well as having access to  the original complete diffraction image data to allow any completely new analysis to be made in the future.  Who knows, maybe in eight years time an even better method may become available for comparison with the results reported here.


To put this into context, 0.17 eA-3 would generally be regarded as a pretty low level background noise, similar to the value of the maximum residual electron density a crystallographer might be happy with. The structure which showed the smallest change in R factor on using quantum crystallography, i.e. AB1608b, was re-run with the triple-ζ Def2-TZVPP basis set. This did give lower R factors but by very little (3.38% to 3.36% aniso with npd; 3.39 to 3.38 iso).

References

  1. F. Kleemiss, O.V. Dolomanov, M. Bodensteiner, N. Peyerimhoff, L. Midgley, L.J. Bourhis, A. Genoni, L.A. Malaspina, D. Jayatilaka, J.L. Spencer, F. White, B. Grundkötter-Stock, S. Steinhauer, D. Lentz, H. Puschmann, and S. Grabowsky, "Accurate crystal structures and chemical properties from NoSpherA2", Chemical Science, vol. 12, pp. 1675-1692, 2021. https://doi.org/10.1039/d0sc05526c
  2. J. Almond-Thynne, "Crystal structure data for Synthesis and Reactions of Benzannulated Spiroaminals; Tetrahydrospirobiquinolines", 2017. https://doi.org/10.14469/hpc/2297
  3. G.M. Sheldrick, "Crystal structure refinement with<i>SHELXL</i>", Acta Crystallographica Section C Structural Chemistry, vol. 71, pp. 3-8, 2015. https://doi.org/10.1107/s2053229614024218
  4. O.V. Dolomanov, L.J. Bourhis, R.J. Gildea, J.A.K. Howard, and H. Puschmann, "<i>OLEX2</i>: a complete structure solution, refinement and analysis program", Journal of Applied Crystallography, vol. 42, pp. 339-341, 2009. https://doi.org/10.1107/s0021889808042726
  5. S.C. Capelli, H. Bürgi, B. Dittrich, S. Grabowsky, and D. Jayatilaka, "Hirshfeld atom refinement", IUCrJ, vol. 1, pp. 361-379, 2014. https://doi.org/10.1107/s2052252514014845
  6. F. Neese, "The ORCA program system", WIREs Computational Molecular Science, vol. 2, pp. 73-78, 2011. https://doi.org/10.1002/wcms.81
  7. J. Almond-Thynne, A.J.P. White, A. Polyzos, H.S. Rzepa, P.J. Parsons, and A.G.M. Barrett, "Synthesis and Reactions of Benzannulated Spiroaminals: Tetrahydrospirobiquinolines", ACS Omega, vol. 2, pp. 3241-3249, 2017. https://doi.org/10.1021/acsomega.7b00482
  8. H. Rzepa, "Crystallography meets DFT Quantum modelling", 2025. https://doi.org/10.14469/hpc/15030

Finding and Discovery Aids as part of data availability statements for research articles.

Wednesday, February 19th, 2025

Starting around 2016, journal publishers started including mandatory “Data Availability” statements as part of research articles; a typical (dated) example is linked here, including guidelines for how to cite the data itself. I wrote about these aspects last year in a blog post for the RSC journal Digital Discovery[1] and here I follow up with more news.

In a recently published article about Direct Amidation Reactions[2], the following version of a data availability statement appears: An IUPAC FAIRSpec Finding Aid for the NMR spectroscopic data is available at DOI: 10.14469/hpc/14884. A selection of data discovery searches can be found at DOI: 10.14469/hpc/14822 and it introduces the concept of a Finding Aid. Put simply, knowing where the data supporting a research is available will not necessarily lead you to the particular datum you might be looking for, especially if there is a lot of data. Data is still frequently made available in the form of a supporting document called ESI, and such documents can contain many tens of compounds and possibly hundreds of associated spectra. The aim of a Finding Aid is to help you find the ones you are interested in.

If you are interested in how this works, go explore either of the two links given above.  The Finding Aid tool was created by Bob Hanson as part of an IUPAC working party on how to create spectroscopic data in so-called FAIR form (The F of FAIR and the F of Finding Aid are one and the same of course!). This represents its first deployment for a newly published article. The creation tool itself is still α-stage – further tools are being developed – of which more later.

References

  1. H. Rzepa, "The evolving roles of data and citations in journal articles", 2024. https://doi.org/10.26434/chemrxiv-2024-dz2dv
  2. R.J. Procter, C. Alamillo-Ferrer, U. Shabbir, P. Britton, D. Bučar, A.S. Dumon, H.S. Rzepa, J. Burés, A. Whiting, and T.D. Sheppard, "Borate-catalysed direct amidation reactions of coordinating substrates", Chemical Science, vol. 16, pp. 4718-4724, 2025. https://doi.org/10.1039/d4sc07744j

The secrets of FAIR Metadata: optimisation for Chemical Compounds.

Wednesday, December 11th, 2024

The idea of so-called FAIR (Findable, Accessible, Interoperable and Reusable) data is that each object has an associated metadata record which serves to enable the four aspects of FAIR. Each such record is itself identified by a persistent identifier known as a DOI. The trick in producing useful FAIR data is defining what might be termed the “granularity” of data objects that generate the most readily findable and which most usefully enable the other three attributes of FAIR.

To set the scene for how to do this optimally, I first set out two extreme examples of FAIR objects relating to chemical spectroscopy such as NMR. These will be directly associated with a journal article describing for arguments sake say 50 compounds new to science, with the existence of these data objects identified via a data availability statement appended to the article. Each compound might be characterised by say spectroscopic and crystallographic information and perhaps some computational analysis. For the spectroscopic analysis, perhaps 5 types of NMR experiments might be included, giving a total of around 10 separate types of datasets for each compound, or in round numbers lets say 500 data sets for the 50 compounds reported in such an article.

  • Method A: The data associated with an articles takes the form of a ZIP (or other type of compressed) archive containing all 500 of the intended FAIR data sets. The resulting ZIP file is then described with a single metadata record and assigned a single DOI using e.g. the tools of a data repository. That one metadata record has the (mammoth) task of describing all of these datasets, across perhaps ten different kinds of experiment. This type of monolithic object is in fact not unusual, for several reasons. Some repositories impose a significant charge for each deposition, and so the temptation to reduce costs would be to adopt this expedient.
  • Method B: The other extreme is to literally deposit all 500 data sets separately and assign 500 DOIs, each with a separate metadata record. The issue now is less how well the metadata record can describe each dataset, but more of to establish the relationships between these 501 objects (the journal article and each dataset). Such relationships could include:
    • that between the compound molecular structure and the dataset
    • that between say the dataset and the type of spectroscopic experiment (e.g. IR, MS, NMR, XRD, Comp)
    • that between different eg NMR experiments for the same compound (the nucleus, the pulse sequence, the solvent, etc).
    • These could in total represent a great many individual relationships between both the 500 data sets and the article itself (formally around 5012/2!)

Before setting our solution, I show below how a typical repository such as Zenodo handles the relationships between data objects noted above.

ggg

The relation type is selected from a controlled list of about 30, and is entered for each individual metadata record associated with a DOI. So clearly, relationships in the second category would have to be individually entered, hardly feasible for 5012/2 entries. And in the first category, only one relationship between the single large archive of data and the journal DOI can be added. One of the  more important relationships in this context are the “Has part”  or “Is part of” ones (diagram above).

The use of this now constitutes Method C.

  1. One starts by creating what could be called a top or level 1 entry, which will contain important  core metadata information such as the contributing authors, the institute where the data was obtained,  the title and overall description of the datasets to come, a license,  a date, a declaration of the published article associated with the data and finally the  DOI of this metadata record. This top-level entry would  also list all the compounds on level 2 for which data is available  and each being referenced by a “Has part” declaration via a  DOI for each compound.
  2. Each compound on level 2 would in turn point back to level 1 by an “Is part of” metadata declaration. Each compound on level 2 would also  list the spectroscopic experiments available that compound, for example the NMR method as part of level 3. It would have an “Is part of” declaration pointing back to the compound level  2 entry.
  3. The  list of the different NMR experiments on  level 3 also have “Has part” declarations pointing  to the list of NMR experiments on level 4.
  4. Each NMR experiment conducted on level 4 would contain an “Is part of” declaration back to level 3 and a list of “Has part” entries which describe the individual data files available for that experiment in the metadata record for level 4.

If you wish, you can inspect all “Has part”/”Is part of” declarations in the metadata records for these various levels by invoking e.g. https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/11446 (replacing e.g. 11446 by any of the DOI suffixes shown in red in the diagram below). They are all associated with this published article.[1]

What does this use of relational parts declarations achieve? Well, compared to method  A, where everything had to be achieved within a single metadata record (and in practice never is) or method  B, where a very large number of relationships would have to be declared (and again never are), Method C achieves a good balance between the two. By collecting the metadata information into groups, one can achieve a more readily navigable structure for the information and also allow sub-groups to effectively inherit properties from the higher group.

I end by noting that far too few FAIR data collections associated with published journal articles adopt such procedures, in large part because of very little current exploitation of relationships between the data such as the one used above (“Has part”/”Is part of”). The repository itself has to  be carefully designed to do this as automatically as possible and not require the human depositor to invoke each instance by hand (as shown for e.g. Zenodo above). An example of just such a repository is described here.[2]


The data sets themselves might be made available in more than one form (for NMR, a Bruker ZIP archive, an Mnova file, a JCAMP-DX format or just a PDF spectrum), thus increasing the number even further.
It reminds me of when I used to teach molecular orbital theory using the Hückel method, which requires a secular matrix to be diagonalised. For e.g. naphthalene, this operation would have to be conducted on a 10*10 matrix, something almost impossible by hand. However, one could use group theory to block diagonalise this matrix into much smaller matrices with the off-diagonal elements between them set to zero, thus considerably reducing the task at hand.

References

  1. T. Mies, A.J.P. White, H.S. Rzepa, L. Barluzzi, M. Devgan, R.A. Layfield, and A.G.M. Barrett, "Syntheses and Characterization of Main Group, Transition Metal, Lanthanide, and Actinide Complexes of Bidentate Acylpyrazolone Ligands", Inorganic Chemistry, vol. 62, pp. 13253-13276, 2023. https://doi.org/10.1021/acs.inorgchem.3c01506
  2. M.J. Harvey, A. McLean, and H.S. Rzepa, "A metadata-driven approach to data repository design", Journal of Cheminformatics, vol. 9, 2017. https://doi.org/10.1186/s13321-017-0190-6

Raw data and the evolution of crystallographic FAIR data. Journals, processed and raw structure data.

Monday, March 28th, 2022

In my previous post on the topic, I introduced the concept that data can come in several forms, most commonly as “raw” or primary data and as a “processed” version of this data that has added value. In crystallography, the chemist is interested in this processed version, carried by a CIF file. However on rare occasions when a query arises about the processed component, this can in principle at least be resolved by taking a look at the original raw data, expressed as diffraction images. I established with much appreciated help from CCDC that since 2016, around 65 datasets in the CSD (Cambridge structural database) have appeared with such associated raw data. The problem is easily reconciling the two sets of data (the raw data is not stored on CSD) and one way of doing this is via the metadata associated with the datasets. In turn, if this metadata is suitably registered, one can query the metadata store for such associations, as was illustrated in the previous post on the topic. Here I explore the metadata records for five of these 65 sets to find out their properties, selected to illustrate the five data repositories thus far that host such data for compounds in the CSD database.

Raw data
repository
Raw Data
DOI
Raw data
→CSD?
CSD→
Raw data?
⇐Journal⇒
Zenodo 10.5281/zenodo.4271549 No No 10.1039/C6RA28567H
Imperial College research data repository 10.14469/hpc/2298 Yes Yes 10.1021/acsomega.7b00482
RepoD, a Harvard Dataverse instance 10.18150/repod.6628285 No No 10.1021/acs.cgd.0c01252
Cambridge university repository 10.17863/CAM.21968 No No 10.1016/j.inoche.2018.08.024
Isis neutron and muon source data journal 10.5286/ISIS.E.RB1620465 No No 10.1039/D0CC02418J

Ideally, one is looking for bidirectional links between the data as expressed in the metadata and in both directions. As you can see from the above, these links are present in only one of the five sets. More common is that both the raw and the processed data will contain links to the journal article where the data is discussed. Very much less commonly are there links from the journal article to the raw data, although such links are slightly more likely to exist from the journal to the processed data. If you click on the link in any of the last three columns, a copy of the metadata will download for you to inspect. There you can verify if the assertions made above are correct. 

What the metadata records demonstrate above is a very small scale so-called PID graph (DOI: [1] 10.5438/jwvf-8a66) where each DOI is a node in that graph and if a connection exists, it is shown by a line connecting the nodes. The PID graph can be extended to include a third type of node, the journal article and then it starts to get interesting! I will investigate if I can generate the PID graph for the above, although be prepared, it will not (yet) contain very many lines between nodes!

References

  1. M. Fenner, and A. Aryani, "Introducing the PID Graph", 2019. https://doi.org/10.5438/jwvf-8a66

Raw data: the evolution of FAIR data and crystallography.

Tuesday, March 1st, 2022

Scientific data in chemistry has come a long way in the last few decades. Originally entangled into scientific articles in the form of tables of numbers or diagrams, it was (partially) disentangled into supporting information when journals became electronic in the late 1990s. The next phase was the introduction of data repositories in the early naughties. Now associated with innovative commercial companies such as Figshare and later the non-commercial Zenodo, such repositories are also gradually spreading to institutional form such as eg the earlier SPECTRa project of 2006[1] and still evolving.[2] Perhaps the best known, and certainly the oldest example of curated data in chemistry is the CCDC (Cambridge crystallographic data centre) CSD (Cambridge structural database) which has been operating for more than 55 years now. Curation here is the important context, since there you will find crystal diffraction data which has been refined into a structural model, firstly by the authors reporting the structure and then by CSD who amongst other operations, validate the associated data using a utility called CheckCIF.[3] What perhaps is not realised by most users of this data source is that the original or “raw” data, as obtained from a X-ray diffractometer and which the CSD data is derived from, is not actually available from the CSD. This primary form of crystallographic data is the topic of this post.

Most chemical data now emerges from an instrument, where it is already partially processed internally before being offered. Such raw/primary data is perhaps best known in the form of NMR information, where it is offered by the instrument in the form of an FID or free induction decay. Its transformation from this form into what all chemists know as a spectrum requires further software processing, and including other operations such as peak integration. It is this processed spectrum that had traditionally been offered as part of a scientific article (often only in visual, or peak listed form) and rarely has the FID form been made available to anyone interested. It is important to state that the transformation to spectrum also incurrs significant loss of data. An interesting project led by the editors of two organic chemistry journals[4],[5] had the aim of encouraging the submission of FAIR data to the journal, although in fact the project actually concentrated on the submission of raw NMR data. As it turned out, only a very small proportion of all the submissions to these journals over the period of a year actually provided such data (~113 datasets) in the form of ZIP archives and containing anywhere between one and ~100 actual sets of raw NMR data per archive. One should make the point that raw data is not necessarily FAIR data. The latter requires rich metadata describing the data to become findable, accessible, interoperable and reusable (FAIR), and such metadata was not actually generated as part of this project. 

Here I will take a closer look at potentially FAIR raw data in the area of crystallography. This project is perhaps less well known than the previous one,[4],[5] hence the present post strives to make it better known. As with NMR, a useful starting point is to describe the various stages in the lifecycle of crystal data.

  1. A crystal is mounted in the diffractometer and x-ray diffraction images are recorded. These are considered the raw data, and as with most instruments, their form is determined both by the instrument itself and the software used to start the refinement process into a molecular structure
  2. This refinement then assigns a space group to the data and derives so-called structure factors or hkl data. This data can now be captured in a much more standard form known as a CIF (crystallographic information file) and is nowadays the format that is deposited with CSD.
  3. A reduced form of the CIF file, containing a sub-set of the information but lacking the hkl data is much the more common, and was the form originally sent to CSD until a few years ago.
  4. Very often an image of the resulting model for the molecular structure is also included. Whilst it is based on the data in the CIF file, it does not contain reusable data as such and is considered as being made available only for human use and perception

It is form 1 that is missing from the CSD datasets. Because it can be quite large (~0.5-9 Gbyte), the current recommendation is that it is not stored on the CSD but on local data repositories. So now we see a need to establish if possible bidirectional links between type 1 and types 2-4 and to identify what characteristics of FAIR each has. Primarily, the F (findable) of FAIR will be explored here. This is done by illustrating some searches for this data, based on the metadata registered for it with DataCite.

  1. https://commons.datacite.org/?query=relatedIdentifiers.relatedIdentifier:10.5517/ccdc.csd*  (72 works)
    This simple search identifies any entry in any repository which cites in its metadata record the DOI for an entry in CSD, taking the form 10.5517/ccdc.csd* which is common to all entries.
  2. https://commons.datacite.org/?query=relatedIdentifiers.relatedIdentifier:*10.5517/ccdc.csd*+AND+(media.media_type:chemical/x-cif+OR+media.media_type:application/x-7z-compressed+OR+media.media_type:application/gzip+OR+media.media_type:application/zip) (8 works).
    This also specifies that search 5 is further constrained by requiring one of four media types to ALSO be present in the repository metadata record. These types are standard compressed archives which the raw crystal data is likely to be stored as, along with a CIF entry that is clearly associated with crystal structure data. The Boolean OR indicates that any one of them can be present! One can now be a little more certain that these entries contain crystal structure data. That we cannot be absolutely certain is clearly a current deficiency of the metadata present for the entries! 
  3. https://commons.datacite.org/?query=identifier:*10.5517/ccdc.csd*+AND+(relatedIdentifiers.relatedIdentifier:*10.14469/hpc/*) (7 works)
    The 8 works from search 6 originate from a repository with the prefix 10.14469/hpc/* and so now one can reverse the direction and ask how many are referenced in the metadata for each published item in the CSD. Around 327,064 entries in the CSD currently have a persistent DOI identifier associated with them, all starting with 10.5517/ccdc.csd (this is only around 25% of the total depositions there however) and so now one can search for how many of these also reference a related identifier at 10.14469/hpc/*  Seven of them show up there.
  4. Also in the CSD metadata records is an item with the attribute relationType=”IsDerivedFrom” carrying the meaning that the CSD data is itself derived from (raw) data held elsewhere. This information is captured during the deposition process with CCDC as per below.

    It should be possible to incorporate this property into a search as above, but its currently not working. When that is sorted, I will add that as search 8 here. This will give more idea of how many datasets in the  CSD are actually associated with additional raw data (CCDC tell me its around 65).

So with these projects aiming to capture data from chemical instrumentation are just starting to reveal the potential of this modern system for storing data in two or more locations and reconciling various forms of this data, from raw form to derived or processed data. The interested user can then use whichever form is most relevant to their needs, and having found one form can then trace back to the other form(s). We might anticipate many developments in this area in the near future. 


One has to expand the archive to find out how many actual raw datasets are inside, which is not ideal. 


This post has DOI: 10.14469/hpc/10177


References

  1. J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
  2. M.J. Harvey, A. McLean, and H.S. Rzepa, "A metadata-driven approach to data repository design", Journal of Cheminformatics, vol. 9, 2017. https://doi.org/10.1186/s13321-017-0190-6
  3. A.L. Spek, "Structure validation in chemical crystallography", Acta Crystallographica Section D Biological Crystallography, vol. 65, pp. 148-155, 2009. https://doi.org/10.1107/s090744490804362x
  4. A.M. Hunter, E.M. Carreira, and S.J. Miller, "Encouraging Submission of FAIR Data at <i>The Journal of Organic Chemistry</i> and <i>Organic Letters</i>", The Journal of Organic Chemistry, vol. 85, pp. 1773-1774, 2020. https://doi.org/10.1021/acs.joc.0c00248
  5. A.M. Hunter, E.M. Carreira, and S.J. Miller, "Encouraging Submission of FAIR Data at <i>The Journal of Organic Chemistry</i> and <i>Organic Letters</i>", Organic Letters, vol. 22, pp. 1231-1232, 2020. https://doi.org/10.1021/acs.orglett.0c00383

Data base or Data repository? – A brief and very selective history of data management in chemistry.

Wednesday, January 26th, 2022

Way back in the late 1980s or so, research groups in chemistry started to replace the filing of their paper-based research data by storing it in an easily retrievable digital form. This required a computer database and initially these were accessible only on specific dedicated computers in the laboratory. These gradually changed from the 1990s onwards into being accessible online, so that more than one person could use them in different locations. At least where I worked, the infrastructures to set up such databases were mostly not then available as part of the standard research provisions and so had to be installed and maintained by the group itself. The database software took many different forms and it was not uncommon for each group in a department to come up with a different solution that suited its needs best. The result was a proliferation of largely non-interoperable solutions which did not communicate with each other. Each database had to be searched locally and there could be ten or more such resources in a department. The knowledge of how the system operated also often resided in just one person, which tended to evaporate when this guru left the group.

After the millennium, two newcomers started to appear, one being called an ELN (electronic laboratory notebook) and the second a data repository. The first was a heavily customised database containing research data as obtained from instruments, computers, images/video, chemical structure drawings etc. ELNs, even to this day, have limitations of interoperability with other ELNs and the contents of an ELN are often closed, requiring authentication credentials to access. The data repository also started to appear in chemistry around this period. Even in its early incarnations, it could be associated with an ELN “front end” as part of the data pipeline; an early example of this coupling is described here.[1] Another key phrase that became associated with repositories starting around 2014 was the concept of FAIR, including ideas such as the Findability (discoverability) and Interoperablity of data, a theme often explored and illustrated on this blog.

These last seventeen years has seen organisations such as funding agencies and publishers increasingly mandating the use of such data management methods, using either a repository on its own or a combination of an ELN and repository as routine operations in research activity and publication processes. The close coupling of an ELN and repository is still however uncommon. 

A colleague recently alerted me to a computational chemistry repository first launched in 2014; www.iochem-bd.org  Reading the about text, I found these statements;

  • Chem-BD is a digital repository aimed to manage and store Computational Chemistry files.
  • Goals: Build a distributed database of computational chemistry results: reduce size and increase value.
  • Set a common data standard among all quantum chemistry legacy formats (XML – CML[2])

So this is both a database and a data repository, as well as espousing a commendable common data standard![2] I decided to explore the first two aspects here using this resource as an example.

  • Whilst the absolute distinction between the two types can be blurry, the crucial difference between the two is that a database functions on curation via a structured index of the data, whilst a repository aspires to having FAIR attributes primarily through its metadata as exposed by registration (metadata is data describing the data).
  • A database holds this data index locally and the Findability of the data is associated purely with the functionality of  the database. The data structures are defined by a database schema, describing in detail all the terms indexed (a key and its value) and searched using the values of these key pairs. This schema is unlikely to be exactly the same as e.g. databases on related topics, largely because the database is self-contained and self-consistent.
  • A data repository also uses a schema (DOI: 10.14454/3w3z-sa82 and[3]) to express the key pairs, but this time it is expressed as metadata. Now, this metadata is registered externally to the repository using a registration agency.[3] The metadata for each deposited object is assigned a persistent identifier known as a DOI. Although it might be indexed and searchable locally, it must be capable of also being searched in aggregated/federated form using services provided by registration or other agencies. This independence of metadata is part of those FAIR criteria.
  • Whereas a database can be very finely grained in order to describe individual properties of an object, repository metadata tends to be more coarsely grained to describe the object as a whole, to place it in context and to impart provenance.
  • Both databases and repositories can have what is called an API (application programmer interface) to allow machine access (the A of FAIR) to the contents. Accessing the former would normally require bespoke code to be written and possibly authentication credentials, whereas information to access to repository held data is provided via the registered metadata (which does not normally require credentials). Access to the repository may also require code, but if the metadata is carefully standardised by adherence to the schema, the code can be made more general than that required for a database.
  • A typical entry in the www.iochem-bd.org repository has a DOI of 10.19061/iochem-bd-4-36
  • This DOI is registered with the CrossRef agency, one normally used for registering journal articles, rather than DataCite which is used for registering data and other research objects. The metadata for this DOI can be viewed using the resolution service https://api.crossref.org/works/10.19061/iochem-bd-4-36/transform/application/vnd.crossref.unixsd+xml and shows that it largely contains the bibliographic information typical of a journal article. So in this sense it is certainly a repository, but using a metadata schema that is more frequently used for journal articles than for data sets.
  • The CrossRef metadata record also has an item <resource>https://www.iochem-bd.org/handle/10/235025</resource> which points to the so-called landing page for that item, but information about the properties of the actual data itself must be instead obtained directly from the repository. 
  • Because the metadata describing the data is only held at this repository and not elsewhere (a local metadata record), it can only be queried locally and the query cannot be upon aggregated metadata  provided by the registration agency. A machine query would have to be constructed by coding a suitable request using the API provided for the database aspect of this repository. 

This example has served to highlight just a few of the often quite subtle distinctions between eg a database and a data repository and that some examples can indeed be both.  It also highlights that repositories can have the attributes of  FAIR, which in themselves are driven by asking “what could a machine do to obtain data?” rather than what could a human achieve by browsing. So another question that arises when evaluating the characteristics of a repository is whether each item held there has a FAIR-enabling metadata record describing the data, a record which is registered in a manner that can be aggregated and hence used to find and access content across multiple independent repositories.


This post has DOI 10.14469/hpc/10043


Indeed in that era, few online/Internet infrastructures were available as part of departmental resources. See also here.  In this last regard, I note a workshop devoted largely to such interoperability and machine access in chemistry coming up soon; https://www.cecam.org/workshop-details/1165 The CrossRef schema is not referenced using an assigned DOI: data.crossref.org/reports/help/schema_doc/5.3.1/.An example can be seen at doi: 10.14469/hpc/10059 Here, invoking a hyperlink based purely on the data DOI and the data media type required in turn calls code (Javascript) which retrieves the metadata held for that DOI and parses it to identify whether it indicates the presence of a file manifest. If it does, it identifies the type of manifest (ORE in this case) and the media types the manifest points to and finally uses that manifest to then retrieve data filtered by media type and pipes it into a visualiser (JSmol). In this case the endpoint is visualisation, but it could also be eg piped into an AI/ML program for analysis. In this case only one instance of data is machine retrieved, but in principle it could be a multitude of data files obtained from a multitude of different locations and based on a multitude of criteria as filtered by suitable searches of registered metadata.[4]


References

  1. M.J. Harvey, N.J. Mason, and H.S. Rzepa, "Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks", Journal of Chemical Information and Modeling, vol. 54, pp. 2627-2635, 2014. https://doi.org/10.1021/ci500302p
  2. P. Murray-Rust, and H.S. Rzepa, "Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles", Journal of Chemical Information and Computer Sciences, vol. 39, pp. 928-942, 1999. https://doi.org/10.1021/ci990052b
  3. H. Cousijn, T. Habermann, E. Krznarich, and A. Meadows, "Beyond data: Sharing related research outputs to make data reusable", Learned Publishing, vol. 35, pp. 75-80, 2022. https://doi.org/10.1002/leap.1429
  4. H.S. Rzepa, and S. Kuhn, "A data‐oriented approach to making new molecules as a student experiment: artificial intelligence‐enabling FAIR publication of NMR data for organic esters", Magnetic Resonance in Chemistry, vol. 60, pp. 93-103, 2021. https://doi.org/10.1002/mrc.5186

Quantum chemistry interoperability (library): another step towards FAIR data.

Saturday, January 1st, 2022

To be FAIR, data has to be not only Findable and Accessible, but straightforwardly Interoperable. One of the best examples of interoperability in chemistry comes from the domain of quantum chemistry. This strives to describe a molecule by its electron density distribution, from which many interesting properties can then be computed. The process is split into two parts:

  1. Computation of the wavefunction. This can be very very compute intensive process, which can take quite a few days even using 64 or more processors in parallel and requires highly specialised programs to achieve this.
  2. Analysis of the wavefunction. The range of properties that can be computed is impressively large, but again this requires specialised algorithms and programs.

So one can see that the need to Interoperate wavefunction data computed during process 1 into analysis in process 2 is crucial. This is normally achieved using intermediate data files, and clearly the semantics of the data in these files must be perfectly communicated between the two processes.

With this introduction over, my attention was drawn to a recent post on the CCL (Computational Chemistry List, http://www.ccl.net), a veritable resource that has been running for many decades and where many aspects of computational chemistry are discussed. One recent such relates to quantum chemistry interoperability; http://www.ccl.net/cgi-bin/ccl/day-index.cgi?2021+12+30 where many interesting points were made. I highlight just two here (but urge you to read the entire thread).

  1. The first, by Mike Frisch (http://www.ccl.net/cgi-bin/ccl/message-new?2021+12+30+003) introduces two interoperability formats (the binary array file format) along with a library of routines in both Fortran and Python which facilitate interoperability between wavefunction calculating and the post-processing analysis programs. The advantages of this include “Like the fchk file, this is a self-defining file, but it is binary so that full precision can be retained and reading/writing the file is much faster” and is described at https://gaussian.com/interfacing/ Output in this format is controlled by the keyword Output=MatrixElement or use of environment variables. As a long time user of an older interoperability mechanism, the so-called WFN and WFX formats for use with programs such as AIMALL and MultiWFN, I have often set this keyword to eg Output=wfn and when generated, such files are routinely included in our FAIR data publications which are often mentioned both in this blog and in the journal articles we write. If you read the post by Mike, you will understand both the deficiencies of these earlier formats and how the binary array file is an important advance. 
    • I make one “user interface plea” here in the hope that Gaussian might be able to do something about it. By default, the output key word is not set and so no wavefunction data is produced other than a binary .CHK file. This in turn requires an extra step to convert it into the interoperable non-binary .FCHK file. When needing a WFN file, very often I forget to set the output keyword flag to a value and have to re-run the program to obtain it. So my plea is to consider setting the program defaults to write out some form of the binary array file when the job completes. There are additional flags that can be set for specialised applications, but assuming a default option would be practical, it would be good to have.
  2. The second email is a response to Mike’s post by Tian Lu  who is well known for his amazing “swiss army knife” program MultWFN, which can compute a large variety of molecular properties using wavefunction files. He had in fact proposed his own interoperability format to eliminate many of the recognised issues with the older WFN, FCHK and WFX formats and which is called MWFN (documented here[1]). Currently this particular format is not yet widely supported by wavefunction-computing programs such as e.g. Gaussian, but perhaps Output=mwfn will come one day!
  3. This is a later email describing the Trexio Project (https://trex-coe.github.io/trexio/ and specifically https://trex-coe.github.io/trexio/trex.html) in which a metadata group is specifically identified because “we need to give the possibility to the users to store some metadata inside the files.” In fact, metadata is also useful for registration with metadata agencies.

This increasing discussion of Interoperability in Quantum Chemistry has to be warmly welcomed. It directly feeds into FAIR data and may even set a trend for other areas of chemistry, such as e.g. NMR spectroscopy!


I have now learnt that inserting one of the environment variables below as per

export GAUSS_OMDEF=fortranbinaryarray.faf
or
export GAUSS_ORDEF=rawbinaryarray.baf

into job scripts will achieve this (proposed media types chemical/x-rawbinaryarray  .baf and chemical/x-fortranwbinaryarray  .faf).

Currently doing both at the same time is not supported (G16 C C.01), so the second file can be generated from a .chk file using the post-processing commands appended to the job script:

formchk -raw mychk.chk rawbinaryarray.baf
or
formchk -mat mychk.chk fortranbinaryarray.faf


This post has DOI: 10.14469/hpc/10043


References

  1. T. Lu, and Q. Chen, "mwfn: A Strict, Concise and Extensible Format for Electronic Wavefunction Storage and Exchange", 2021. https://doi.org/10.26434/chemrxiv-2021-lt04f-v5

First came Molnupiravir – now there is Paxlovid as a SARS-CoV-2 protease inhibitor. An NCI analysis of the ligand.

Saturday, November 13th, 2021

Earlier this year, Molnupiravir hit the headlines as a promising antiviral drug. This is now followed by Paxlovid, which is the first small molecule to be aimed by design at the SAR-CoV-2 protein and which is reported as reducing greatly the risk of hospitalization or death when given within three days of symptoms appearing in high risk patients.

The Wikipedia page (first created in 2021) will display a pretty good JSmol 3D model of this; the coordinates being generated automatically on the fly from a SMILES string, which specifies only what atoms are connected in the structure by bonds. Given that the structure of this molecule as embedded in the SARS-CoV-2 main protease[1] has been determined (and can be viewed here), I thought I might display those coordinates as an alternative to the Wikipedia/JSmol generated structure.

Click to get 3D model

I extracted the ligand from the PDF file and then added hydrogens manually to obtain the above result. There are two noteworthy points about these representations:

  1. A mystery concerns the nominal C≡N group on the top right, which displays an angle at the carbon of 117°. A cyano group is of course linear (180°). This is not a defect of the crystal structure determination, but an indication of a rather stronger interaction occurring (as indeed noted[1]). The distance between the carbon of the cyano group and an adjacent sulfur is 1.814Å, which indicates a covalent bond has formed to the cyano group. The nitrogen of the erstwhile cyano group is 3.013Å away from an adjacent NH group, which suggests it is stabilised by a hydrogen bond.
  2. Crystal structure searching of units with S…C…N in which the N has only one bond reveals zero hits, but searches of S…C…NH reveal nine hits, with S…C distances in the range 1.74 – 1.80Å and C…N distances in the region 1.25-1.27&Aring. The reported CN distance is 1.251&ARing, confirming that when bound to the protein, the cyano group is replaced by an S-C=NH group and hence is clearly an important component of the mode of action of Paxlovid.
  3. The conformation of Paxlovid is in one respect not fully represented by the Wikipedia diagram, as shown below. This implies the t-butyl group (on the left) as being well separated from the pyrrolidinone ring system at the right of the molecule.

    In fact the two groups are adjacent, being held in that conformation by probably a combination of weak dispersion forces and a contribution from the surrounding protein in the crystal structure. This is more graphically shown by the NCI (non-covalent-interaction) diagram below (DOI: 10.14469/hpc/9964), where the green areas in the region between the two groups (ringed in red) represent stabilising interactions between them. You might also spot other green/cyan regions indicating additional weak hydrogen bonds between C-H groups and oxygen!

PAXLOVID NCI analysis

There are only a small number of crystal structures of small molecules containing the S-C=NH motif. I will try to find out how common this is in protein-ligand structures.


There are many tools for performing this operation. I used the following procedure. I downloaded the PDB file (https://files.rcsb.org/download/7vh8.cif), opened it in CSD Mercury, selected the ligand (by identifying the CF3 group and clicking on one atom), inverted the selection so that everything but the ligand was then selected and using edit/structure, I deleted the selected atoms, leaving only the ligand.

Postsript

The cyanopyrrolidine group such as in Paxlovid is well known as a specific probe.[2],[3],[4] CovalentInDB is a comprehensive database facilitating the discovery of such covalent inhibitors[5] and is available here. There is also a program called DataWarrior that is potentially able to find such probes.

References

  1. Y. Zhao, C. Fang, Q. Zhang, R. Zhang, X. Zhao, Y. Duan, H. Wang, Y. Zhu, L. Feng, J. Zhao, M. Shao, X. Yang, L. Zhang, C. Peng, K. Yang, D. Ma, Z. Rao, and H. Yang, "Crystal structure of SARS-CoV-2 main protease in complex with protease inhibitor PF-07321332", Protein & Cell, vol. 13, pp. 689-693, 2021. https://doi.org/10.1007/s13238-021-00883-2
  2. N. Panyain, A. Godinat, A.R. Thawani, S. Lachiondo-Ortega, K. Mason, S. Elkhalifa, L.M. Smith, J.A. Harrigan, and E.W. Tate, "Activity-based protein profiling reveals deubiquitinase and aldehyde dehydrogenase targets of a cyanopyrrolidine probe", RSC Medicinal Chemistry, vol. 12, pp. 1935-1943, 2021. https://doi.org/10.1039/d1md00218j
  3. N. Panyain, A. Godinat, T. Lanyon-Hogg, S. Lachiondo-Ortega, E.J. Will, C. Soudy, M. Mondal, K. Mason, S. Elkhalifa, L.M. Smith, J.A. Harrigan, and E.W. Tate, "Discovery of a Potent and Selective Covalent Inhibitor and Activity-Based Probe for the Deubiquitylating Enzyme UCHL1, with Antifibrotic Activity", Journal of the American Chemical Society, vol. 142, pp. 12020-12026, 2020. https://doi.org/10.1021/jacs.0c04527
  4. C. Bashore, P. Jaishankar, N.J. Skelton, J. Fuhrmann, B.R. Hearn, P.S. Liu, A.R. Renslo, and E.C. Dueber, "Cyanopyrrolidine Inhibitors of Ubiquitin Specific Protease 7 Mediate Desulfhydration of the Active-Site Cysteine", ACS Chemical Biology, vol. 15, pp. 1392-1400, 2020. https://doi.org/10.1021/acschembio.0c00031
  5. H. Du, J. Gao, G. Weng, J. Ding, X. Chai, J. Pang, Y. Kang, D. Li, D. Cao, and T. Hou, "CovalentInDB: a comprehensive database facilitating the discovery of covalent inhibitors", Nucleic Acids Research, vol. 49, pp. D1122-D1129, 2020. https://doi.org/10.1093/nar/gkaa876