Quantum chemistry interoperability (library): another step towards FAIR data.

January 1st, 2022

To be FAIR, data has to be not only Findable and Accessible, but straightforwardly Interoperable. One of the best examples of interoperability in chemistry comes from the domain of quantum chemistry. This strives to describe a molecule by its electron density distribution, from which many interesting properties can then be computed. The process is split into two parts:

  1. Computation of the wavefunction. This can be very very compute intensive process, which can take quite a few days even using 64 or more processors in parallel and requires highly specialised programs to achieve this.
  2. Analysis of the wavefunction. The range of properties that can be computed is impressively large, but again this requires specialised algorithms and programs.

So one can see that the need to Interoperate wavefunction data computed during process 1 into analysis in process 2 is crucial. This is normally achieved using intermediate data files, and clearly the semantics of the data in these files must be perfectly communicated between the two processes.

With this introduction over, my attention was drawn to a recent post on the CCL (Computational Chemistry List, http://www.ccl.net), a veritable resource that has been running for many decades and where many aspects of computational chemistry are discussed. One recent such relates to quantum chemistry interoperability; http://www.ccl.net/cgi-bin/ccl/day-index.cgi?2021+12+30 where many interesting points were made. I highlight just two here (but urge you to read the entire thread).

  1. The first, by Mike Frisch (http://www.ccl.net/cgi-bin/ccl/message-new?2021+12+30+003) introduces two interoperability formats (the binary array file format) along with a library of routines in both Fortran and Python which facilitate interoperability between wavefunction calculating and the post-processing analysis programs. The advantages of this include “Like the fchk file, this is a self-defining file, but it is binary so that full precision can be retained and reading/writing the file is much faster” and is described at https://gaussian.com/interfacing/ Output in this format is controlled by the keyword Output=MatrixElement or use of environment variables. As a long time user of an older interoperability mechanism, the so-called WFN and WFX formats for use with programs such as AIMALL and MultiWFN, I have often set this keyword to eg Output=wfn and when generated, such files are routinely included in our FAIR data publications which are often mentioned both in this blog and in the journal articles we write. If you read the post by Mike, you will understand both the deficiencies of these earlier formats and how the binary array file is an important advance. 
    • I make one “user interface plea” here in the hope that Gaussian might be able to do something about it. By default, the output key word is not set and so no wavefunction data is produced other than a binary .CHK file. This in turn requires an extra step to convert it into the interoperable non-binary .FCHK file. When needing a WFN file, very often I forget to set the output keyword flag to a value and have to re-run the program to obtain it. So my plea is to consider setting the program defaults to write out some form of the binary array file when the job completes. There are additional flags that can be set for specialised applications, but assuming a default option would be practical, it would be good to have.
  2. The second email is a response to Mike’s post by Tian Lu  who is well known for his amazing “swiss army knife” program MultWFN, which can compute a large variety of molecular properties using wavefunction files. He had in fact proposed his own interoperability format to eliminate many of the recognised issues with the older WFN, FCHK and WFX formats and which is called MWFN (documented here[1]). Currently this particular format is not yet widely supported by wavefunction-computing programs such as e.g. Gaussian, but perhaps Output=mwfn will come one day!
  3. This is a later email describing the Trexio Project (https://trex-coe.github.io/trexio/ and specifically https://trex-coe.github.io/trexio/trex.html) in which a metadata group is specifically identified because “we need to give the possibility to the users to store some metadata inside the files.” In fact, metadata is also useful for registration with metadata agencies.

This increasing discussion of Interoperability in Quantum Chemistry has to be warmly welcomed. It directly feeds into FAIR data and may even set a trend for other areas of chemistry, such as e.g. NMR spectroscopy!


I have now learnt that inserting one of the environment variables below as per

export GAUSS_OMDEF=fortranbinaryarray.faf
or
export GAUSS_ORDEF=rawbinaryarray.baf

into job scripts will achieve this (proposed media types chemical/x-rawbinaryarray  .baf and chemical/x-fortranwbinaryarray  .faf).

Currently doing both at the same time is not supported (G16 C C.01), so the second file can be generated from a .chk file using the post-processing commands appended to the job script:

formchk -raw mychk.chk rawbinaryarray.baf
or
formchk -mat mychk.chk fortranbinaryarray.faf


This post has DOI: 10.14469/hpc/10043


References

  1. T. Lu, and Q. Chen, "mwfn: A Strict, Concise and Extensible Format for Electronic Wavefunction Storage and Exchange", 2021. https://doi.org/10.26434/chemrxiv-2021-lt04f-v5

Molecule of the year 2021: Infinitene.

December 16th, 2021

The annual “molecule of the year” results for 2021 are now available … and the winner is Infinitene.[1] This is a benzocirculene in the form of a figure eight loop (the infinity symbol), a shape which is also called a lemniscate [2] after the mathematical (2D) function due to Bernoulli. The most common class of molecule which exhibits this (well known) motif are hexaphyrins (hexaporphyrins; porphyrin is a tetraphyrin)[3],[4],[5], many of which exhibit lemniscular topology as determined from a crystal structure. Straightforward annulenes have also been noted to display this[6] (as first suggested here for a [14]annulene[7]) and other molecules show higher-order Möbius forms such as trefoil knots.[8],[9] This new example uses twelve benzo groups instead of six porphyrin units to construct the lemniscate. So the motif is not new, but this is the first time it has been constructed purely from benzene rings.

The molecule has D2 chiral symmetry and is shown below (click on the image for the 3D model obtained from the crystal structure).

The authors suggest that the aromaticity in a D2-symmetric [12]-circulene is confined to six “Clar” rings each of six electrons, and is not delocalised around the entire molecule. For a molecule with this topology (defined by a linking number, Lk = 2π[10]) the entire system would be defined as aromatic (delocalised) for 4n+2 electrons and antiaromatic for 4n electrons around a continuous annulene loop. In this example outer annulene circuits of either 34 or 38 carbons can be constructed which retain D2-symmetry and which both follow the 4n+2 rule, whilst a small inner circuit of 14 carbons can be also be constructed. There are probably other D2-symmetric circuits that could be constructed.

When I saw the molecule, I asked myself what the calculated chiroptical properties for the molecule might be; the optical rotation of the two (separated) enantiomers of [12]-circulene were reported as +1130° (P,P) and -1112° (M,M). The calculated value (ωB97XD/Def2-TZVPP) is in excellent agreement. I have also included versions of this system with [11] and [10] benzo rings, which will be discussed in a future post.

Benzene units optical rotation (589nm), ° DOI
12 (P,P) +1143 10.14469/hpc/10000
11 (P,P) +1025 10.14469/hpc/10037
10 (P,P) -163 10.14469/hpc/10001

For good measure, the calculated VCD spectrum

Now to the geometry, as obtained from the crystal structure. The [12]circulene shows in total 12 short lengths of 1.348ű0.014, indicating significant localisation in the system. The D2-symmetric C34 path through the system shows a mean length for each bond of 1.405Å, with a maximum value of 1.443Å and a minimum 1.334Å. For this path, the topology of the system indicates Lw = 2π = 0.393Tw + 1.607Wr[11] This means that most of the coiling of the molecule that results in that figure eight is actually comprised of a topological property known as writhe (Wr) rather than adjacent twisting (Tw) of the p-orbitals. This retains much p(π)-p(π) overlap and hence stabilisation. The values for the inner C14 route are Lw = 2π = 1.256Tw + 0.744Wr which is more highly twisted than the larger outer pathway and so aromaticity via this route is less favoured due to less favourable p(π)-p(π) overlaps.

I also note that the Lw = 2π is an alternative chiral descriptor to the helical notation of (P,P). The (M,M) form would have Lw = -2π. The linking number is more general for more complex helical forms such as trefoils, cinquefoils, hexafoils etc.

So it turns out that this molecule has a fascinating challenge for trying to describe its extended delocalised aromaticity (rather than localised six-membered Clar rings), since more than one “annulene route” for which the “Hückel/Möbius rules” might apply exists.[10] Given that the maximum bond length for one of those routes (the [34]annulene) is 1.443Å, there may well be a contribution from this mode of aromaticity other than that from the Clar rings.

I hope to take a look at the [11] and [10]circulenes in a future post.


The explanation for this sign inversion is delightful but too complex to give here.[12]


This post has DOI: 10.14469/hpc/10036


References

  1. K. Itami, M. Krzeszewski, and H. Ito, "Infinitene: A Helically Twisted Figure-Eight [12]Circulene Topoisomer", 2021. https://doi.org/10.26434/chemrxiv-2021-pcwcc
  2. C.S.M. Allan, and H.S. Rzepa, "Chiral Aromaticities. AIM and ELF Critical Point and NICS Magnetic Analyses of Möbius-Type Aromaticity and Homoaromaticity in Lemniscular Annulenes and Hexaphyrins", The Journal of Organic Chemistry, vol. 73, pp. 6615-6622, 2008. https://doi.org/10.1021/jo801022b
  3. H. Rath, J. Sankar, V. PrabhuRaja, T.K. ChandrashekarPresent address: The D, B.S. Joshi, and R. Roy, "Figure-eight aromatic core-modified octaphyrins with six meso links: syntheses and structural characterization", Chemical Communications, pp. 3343, 2005. https://doi.org/10.1039/b502327k
  4. H. Rath, J. Sankar, V. PrabhuRaja, T.K. Chandrashekar, and B.S. Joshi, "Aromatic Core-Modified Twisted Heptaphyrins[1.1.1.1.1.1.0]:  Syntheses and Structural Characterization", Organic Letters, vol. 7, pp. 5445-5448, 2005. https://doi.org/10.1021/ol0521937
  5. S. Shimizu, N. Aratani, and A. Osuka, "<i>meso</i>‐Trifluoromethyl‐Substituted Expanded Porphyrins", Chemistry – A European Journal, vol. 12, pp. 4909-4918, 2006. https://doi.org/10.1002/chem.200600158
  6. T. Perera, F.R. Fronczek, and S.F. Watkins, "2,9,16,23-Tetrakis(1-methylethyl)-5,6,11,12,13,14,19,20,25,26,27,28-dodecadehydrotetrabenzo[<i>a</i>,<i>e</i>,<i>k</i>,<i>o</i>]cycloeicosene", Acta Crystallographica Section E Structure Reports Online, vol. 67, pp. o3493-o3493, 2011. https://doi.org/10.1107/s1600536811048604
  7. H.S. Rzepa, "A Double-Twist Möbius-Aromatic Conformation of [14]Annulene", Organic Letters, vol. 7, pp. 4637-4639, 2005. https://doi.org/10.1021/ol0518333
  8. G.R. Schaller, F. Topić, K. Rissanen, Y. Okamoto, J. Shen, and R. Herges, "Design and synthesis of the first triply twisted Möbius annulene", Nature Chemistry, vol. 6, pp. 608-613, 2014. https://doi.org/10.1038/nchem.1955
  9. S.M. Bachrach, and H.S. Rzepa, "Cycloparaphenylene Möbius trefoils", Chemical Communications, vol. 56, pp. 13567-13570, 2020. https://doi.org/10.1039/d0cc04190d
  10. P.L. Ayers, R.J. Boyd, P. Bultinck, M. Caffarel, R. Carbó-Dorca, M. Causá, J. Cioslowski, J. Contreras-Garcia, D.L. Cooper, P. Coppens, C. Gatti, S. Grabowsky, P. Lazzeretti, P. Macchi, . Martín Pendás, P.L. Popelier, K. Ruedenberg, H. Rzepa, A. Savin, A. Sax, W.E. Schwarz, S. Shahbazian, B. Silvi, M. Solà, and V. Tsirelson, "Six questions on topology in theoretical chemistry", Computational and Theoretical Chemistry, vol. 1053, pp. 2-16, 2015. https://doi.org/10.1016/j.comptc.2014.09.028
  11. S.M. Rappaport, and H.S. Rzepa, "Intrinsically Chiral Aromaticity. Rules Incorporating Linking Number, Twist, and Writhe for Higher-Twist Möbius Annulenes", Journal of the American Chemical Society, vol. 130, pp. 7613-7619, 2008. https://doi.org/10.1021/ja710438j
  12. M.S. Andrade, V.S. Silva, A.M. Lourenço, A.M. Lobo, and H.S. Rzepa, "Chiroptical Properties of Streptorubin B: The Synergy Between Theory and Experiment", Chirality, vol. 27, pp. 745-751, 2015. https://doi.org/10.1002/chir.22486

Protein-Biotin complexes. Crystal structure mining.

December 12th, 2021

In the previous post, I showed some of the diverse “non-classical”interactions between Biotin and a protein where it binds very strongly. Here I take a look at two of these interactions to discover how common they are in small molecule structures.

The first search is of a CH hydrogen bond to the face of the aromatic ring in a tryptophane residue

The search is shown below, in which the distance of the hydrogen to the ring centroid is defined, as is the angle subtended at that centroid, constrained to lie within 20° of a vertical approach.

The resulting heat plot shows 2772 entries (no disorder, no errors, R < 0.05), with a rather diffuse red spot at around 2.7-2.8Å (but which can be as short as 2.3Å) and an angle of approach of ~90±5°. This matches the concept of a region of interaction rather than a more focused “hydrogen bond”. It is seen as a relatively common motif!


The next search is for “hydrogen bonding” between the sulfur of an C-S-C unit (as found in Biotin) and an OH group.
This is less common, with 151 entries in the Cambridge small molecule database, the red spot having a relatively short S…H distance of 1.65Å and a slightly non linear angle.

The NH analogue of this search is shown below (422 hits) shows two clusters. The one with a large angle at H is more concentrated and reveals a distance of ~2.9Å whilst the second cluster has smaller angle and a long tail out to ~2.5Å

So we conclude there is ample evidence in small molecule crystal structures for the types of interaction mooted for Biotin with proteins.

Biotin’s biggest lesson is the importance of nonclassical H-bonds in protein−ligand complexes.

November 27th, 2021

The title comes from the abstract of an article[1] analysing why Biotin (vitamin B7) is such a strong and effective binder to proteins, with a free energy of (non-covalent) binding approaching 21 kcal/mol. The author argues that an accumulation of both CH-π and CH-O together with more classical hydrogen bonds and augmented by a sulfur centered hydrogen bond, oxyanion holes and water solvation, accounts for this large binding energy.

Here, I thought I would present a visualisation of the surroundings of biotin using the method of NCI (non-covalent-interaction) analysis, which looks at the behaviour of the electron density in the “weak” (i.e. non-covalent) regions of the biotin. This provides a more objective measure of the important interactions, independent of what we might consider important by virtue of having labels attached (such as e.g. “hydrogen bond”).

  1. I started by getting the coordinates of streptavidin (DOI: 10.2210/pdb3RY2/pdb) a protein where biotin has been co-crystallised.[2]
  2. Loaded into the CCDC Mercury program, I selected the molecule biotin itself and then added to the selection its close contacts with various groups in the streptavidin protein. These additions were truncated and capped with a methyl group to allow a wavefunction for the assembly to be calculated.
  3. Hydrogens were then added to this structure to complete atom valencies, using “idealised” positions and ensuring that when rotamers were possible, they were set up to form hydrogen bonds.
  4. A calculation (DOI: 10.14469/hpc/9982 at the ωB97XD/Def2-TZVPP/SCRF=water level) was performed.
  5. The heavy atom coordinates (i.e. not hydrogens) are unaltered from the X-ray structure. Since atom positions as measured by X-ray diffraction and as computed using a DFT procedure are slightly different, the original coordinates were also subjected to three cycles of DFT-based geometry optimisation (DOI: 10.14469/hpc/9983) to better reflect the electron density in the molecule.
  6. The resulting wavefunctions in the form of an .fchk file (for both unoptimised and partially optimised geometries) were then used to compute a grid of total electron density points
  7. The density, in the form of a cube of points, was fed to Jmol using the commands
    load biotin_den.cub; isosurface parameters [0.5 1 0.0005 0.05 0.95 1.00] NCI ""; color isosurface "bgyor" range -0.04 0.04;
    and the resulting NCI surface was written out using the command write biotin.jvxl for inclusion here.
  8. This is the NCI plot obtained from the raw coordinates from the PDB file.
  9. This is the NCI plot obtained from the coordinates from the PDB file after three geometry optimisation cycles. Can you spot any differences?

  10. These models are now available for you to explore by clicking on the images above.
    • Blue regions represent “strong” or classical hydrogen bonds. There are four of these in the NCI diagrams above and they are all compact, another characteristic of strong hydrogen bonds.
    • The hydrogen bond to sulfur is somewhat weaker, and appears in the display as a compact, albeit now cyan-coloured surface.
    • The remaining regions are both diffuse and green and represent weaker “interactions”. They are less compact than the classical hydrogen bonds. They do not represent a bond so much as an attractive region in the molecule and hence the term non-classical. Most are CH groups close to the π-surface of an aromatic ring, but some are also CH…O interactions.

Do go ahead and load the 3D surface. You should particularly explore the CH-π regions and note that they are not necessarily associated with a particular CH bond, but with several of these combining to form an interaction with an aromatic π region.

What might emerge is the realisation that binding interactions are not always between specific atoms as in classical hydrogen “bonds”, but also constitute “stabilising regions” between the ligand and the protein. You will probably spot several of these regions that are not actually listed in the article itself.[1] I suggest that we do not refer to CH…π bonds such as in the quoted title of this post but instead as CH…π regions.

It would be great if the entire complex could be subjected to an NCI analysis. Wavefunctions for >2000 atoms can be obtained nowadays, but it would require a bit of work to ensure the density can be computed accurately enough and at high enough cubic resolution to be useful in the context of NCI analysis.


This blog has DOI: 10.14469/hpc/9984


References

  1. D.B. McConnell, "Biotin’s Lessons in Drug Design", Journal of Medicinal Chemistry, vol. 64, pp. 16319-16327, 2021. https://doi.org/10.1021/acs.jmedchem.1c00975
  2. I. Le Trong, Z. Wang, D.E. Hyre, T.P. Lybrand, P.S. Stayton, and R.E. Stenkamp, "Streptavidin and its biotin complex at atomic resolution", Acta Crystallographica Section D Biological Crystallography, vol. 67, pp. 813-821, 2011. https://doi.org/10.1107/s0907444911027806

First came Molnupiravir – now there is Paxlovid as a SARS-CoV-2 protease inhibitor. An NCI analysis of the ligand.

November 13th, 2021

Earlier this year, Molnupiravir hit the headlines as a promising antiviral drug. This is now followed by Paxlovid, which is the first small molecule to be aimed by design at the SAR-CoV-2 protein and which is reported as reducing greatly the risk of hospitalization or death when given within three days of symptoms appearing in high risk patients.

The Wikipedia page (first created in 2021) will display a pretty good JSmol 3D model of this; the coordinates being generated automatically on the fly from a SMILES string, which specifies only what atoms are connected in the structure by bonds. Given that the structure of this molecule as embedded in the SARS-CoV-2 main protease[1] has been determined (and can be viewed here), I thought I might display those coordinates as an alternative to the Wikipedia/JSmol generated structure.

Click to get 3D model

I extracted the ligand from the PDF file and then added hydrogens manually to obtain the above result. There are two noteworthy points about these representations:

  1. A mystery concerns the nominal C≡N group on the top right, which displays an angle at the carbon of 117°. A cyano group is of course linear (180°). This is not a defect of the crystal structure determination, but an indication of a rather stronger interaction occurring (as indeed noted[1]). The distance between the carbon of the cyano group and an adjacent sulfur is 1.814Å, which indicates a covalent bond has formed to the cyano group. The nitrogen of the erstwhile cyano group is 3.013Å away from an adjacent NH group, which suggests it is stabilised by a hydrogen bond.
  2. Crystal structure searching of units with S…C…N in which the N has only one bond reveals zero hits, but searches of S…C…NH reveal nine hits, with S…C distances in the range 1.74 – 1.80Å and C…N distances in the region 1.25-1.27&Aring. The reported CN distance is 1.251&ARing, confirming that when bound to the protein, the cyano group is replaced by an S-C=NH group and hence is clearly an important component of the mode of action of Paxlovid.
  3. The conformation of Paxlovid is in one respect not fully represented by the Wikipedia diagram, as shown below. This implies the t-butyl group (on the left) as being well separated from the pyrrolidinone ring system at the right of the molecule.

    In fact the two groups are adjacent, being held in that conformation by probably a combination of weak dispersion forces and a contribution from the surrounding protein in the crystal structure. This is more graphically shown by the NCI (non-covalent-interaction) diagram below (DOI: 10.14469/hpc/9964), where the green areas in the region between the two groups (ringed in red) represent stabilising interactions between them. You might also spot other green/cyan regions indicating additional weak hydrogen bonds between C-H groups and oxygen!

PAXLOVID NCI analysis

There are only a small number of crystal structures of small molecules containing the S-C=NH motif. I will try to find out how common this is in protein-ligand structures.


There are many tools for performing this operation. I used the following procedure. I downloaded the PDB file (https://files.rcsb.org/download/7vh8.cif), opened it in CSD Mercury, selected the ligand (by identifying the CF3 group and clicking on one atom), inverted the selection so that everything but the ligand was then selected and using edit/structure, I deleted the selected atoms, leaving only the ligand.

Postsript

The cyanopyrrolidine group such as in Paxlovid is well known as a specific probe.[2],[3],[4] CovalentInDB is a comprehensive database facilitating the discovery of such covalent inhibitors[5] and is available here. There is also a program called DataWarrior that is potentially able to find such probes.

References

  1. Y. Zhao, C. Fang, Q. Zhang, R. Zhang, X. Zhao, Y. Duan, H. Wang, Y. Zhu, L. Feng, J. Zhao, M. Shao, X. Yang, L. Zhang, C. Peng, K. Yang, D. Ma, Z. Rao, and H. Yang, "Crystal structure of SARS-CoV-2 main protease in complex with protease inhibitor PF-07321332", Protein & Cell, vol. 13, pp. 689-693, 2021. https://doi.org/10.1007/s13238-021-00883-2
  2. N. Panyain, A. Godinat, A.R. Thawani, S. Lachiondo-Ortega, K. Mason, S. Elkhalifa, L.M. Smith, J.A. Harrigan, and E.W. Tate, "Activity-based protein profiling reveals deubiquitinase and aldehyde dehydrogenase targets of a cyanopyrrolidine probe", RSC Medicinal Chemistry, vol. 12, pp. 1935-1943, 2021. https://doi.org/10.1039/d1md00218j
  3. N. Panyain, A. Godinat, T. Lanyon-Hogg, S. Lachiondo-Ortega, E.J. Will, C. Soudy, M. Mondal, K. Mason, S. Elkhalifa, L.M. Smith, J.A. Harrigan, and E.W. Tate, "Discovery of a Potent and Selective Covalent Inhibitor and Activity-Based Probe for the Deubiquitylating Enzyme UCHL1, with Antifibrotic Activity", Journal of the American Chemical Society, vol. 142, pp. 12020-12026, 2020. https://doi.org/10.1021/jacs.0c04527
  4. C. Bashore, P. Jaishankar, N.J. Skelton, J. Fuhrmann, B.R. Hearn, P.S. Liu, A.R. Renslo, and E.C. Dueber, "Cyanopyrrolidine Inhibitors of Ubiquitin Specific Protease 7 Mediate Desulfhydration of the Active-Site Cysteine", ACS Chemical Biology, vol. 15, pp. 1392-1400, 2020. https://doi.org/10.1021/acschembio.0c00031
  5. H. Du, J. Gao, G. Weng, J. Ding, X. Chai, J. Pang, Y. Kang, D. Li, D. Cao, and T. Hou, "CovalentInDB: a comprehensive database facilitating the discovery of covalent inhibitors", Nucleic Acids Research, vol. 49, pp. D1122-D1129, 2020. https://doi.org/10.1093/nar/gkaa876

More examples of crystal structures containing embedded linear chains of iodines.

October 17th, 2021

The previous post described the fascinating 170-year history of a crystalline compound known as Herapathite and its connection to the mechanism of the Finkelstein reaction via the complex of Na+I2 (or Na22+I42-). Both compounds exhibit (approximately) linear chains of iodine atoms in their crystal structures, a connection which was discovered serendipitously. Here I pursue a rather more systematic way of tracking down similar compounds.

Here is one search query which can be used in the CSD database of crystal structures. A chain of eight iodine atoms is defined, and the six angles subtended at iodine restricted to the range 150-180° (i.e. linear). The inner six iodines are also defined as having only two bonded atoms.

This results in four hits (October 2021), three of which are shown below (the fourth, JOPLEH, contains chains of I82- anions which do not appear to be infinitely repeating).

  1. IQIVIP, containing the heterocyclic unit pyrroloperylene and connected chains of I29.[1] See also DOI: 10.5517/ccdc.csd.cc1m1tj0

    Click to load 3D model of IQIVIP



    The truly remarkable feature is that the iodine chain appears to adopt a gentle right-handed helix in this isomer. One has to wonder how this might respond to light!
  2. IQIVOV, closely related to IQIVIP, this time containing connected chains of gently spiralling I10 groups.[1] See also DOI: 10.5517/ccdc.csd.cc1m1tk1

    Click to load 3D model of IQIVOV

  3. WEVFAE, containing a tetramethyl stilbonium cation (an analogue of a tetramethylammonium cation) and this time infinite chains of I83- anions.[2]

    Click to load 3D model of WEVFAE

The list is not long, but contains some fascinating examples of how iodine can catenate into infinitely long chains, sometimes linear (on the time averaged scale at the temperature of the data recording), sometimes gently helical and as with Herapathite, a rather more undulating motif. Again how the crystals of these compounds respond to light remains to be established. However it may be that since these three molecules are reported variously as being black-green, black and golden, some may be opaque to light in any orientation. I also note that linear chains of Ag, Ga In and Tl have also been reported in inorganic metal nitrides.[3]


The same result is obtained if the specification of iodine in this search is replaced by “any” element. This post has DOI: 10.14469/hpc/9540. See also DOI: 10.1016/j.hm.2005.11.005 for a connection between coiled chains of iodine atoms and Einstein’s theory of teleparallel spacetime, invoking torsional geometries.

References

  1. S. Madhu, H.A. Evans, V.V.T. Doan‐Nguyen, J.G. Labram, G. Wu, M.L. Chabinyc, R. Seshadri, and F. Wudl, "Infinite Polyiodide Chains in the Pyrroloperylene–Iodine Complex: Insights into the Starch–Iodine and Perylene–Iodine Complexes", Angewandte Chemie International Edition, vol. 55, pp. 8032-8035, 2016. https://doi.org/10.1002/anie.201601585
  2. U. Behrens, H.J. Breunig, M. Denker, and K.H. Ebert, "Iodine Chains in (Me<sub>4</sub>Sb)<sub>3</sub>I<sub>8</sub> and Discrete Triiodide Ions in Me<sub>4</sub>AsI<sub>3</sub>", Angewandte Chemie International Edition in English, vol. 33, pp. 987-989, 1994. https://doi.org/10.1002/anie.199409871
  3. P. Höhn, G. Auffermann, R. Ramlau, H. Rosner, W. Schnelle, and R. Kniep, "(Ca<sub>7</sub>N<sub>4</sub>)[M<sub><i>x</i></sub>] (M=Ag, Ga, In, Tl): Linear Metal Chains as Guests in a Subnitride Host", Angewandte Chemie International Edition, vol. 45, pp. 6681-6685, 2006. https://doi.org/10.1002/anie.200601726

Herapathite: an example of (double?) serendipity.

October 14th, 2021

On October 13, 2021, the historical group of the Royal Society of Chemistry organised a symposium celebrating ~150 years of the history of (molecular) chirality. We met for the first time in person for more than 18 months and were treated to a splendid and diverse program about the subject. The first speaker was Professor John Steeds from Bristol, talking about the early history of light and the discovery of its polarisation. When a slide was shown about herapathite[1] my “antennae” started vibrating. This is a crystalline substance made by combining elemental iodine with quinine in acidic conditions and was first discovered by William Herapath as long ago as 1852[2] in unusual circumstances. Now to the serendipity!

Herapath was able to get small crystals of this substance and discovered that when he placed one crystal upon another at “right angles”, the combination went “black as midnight”. He recognised that it was functioning as an excellent linear light polarizer, absorbing virtually all the light polarized along the shorter axis of the best-developed facet of the crystal. A number of well known scientists investigated this substance at the time, but by about 1951 it had largely been forgotten. The person to rediscover it was Edwin Land, of Polaroid camera fame.[3] He oriented the microcrystals into an extruded polymer to stabilize them and hence produce the first large-aperture light polarizer, which enabled him to manufacture his first camera. The serendipity resulted from him spotting the by then forgotten properties of Herapathite (I wonder if he recorded how this actually came about) and recognising how to exploit it.

In 2009 Bart Kahr had noticed that the crystal structure of this material had never been reported. It was a challenging structure to solve[1] but established that the polarizing property of the crystals was in large measure due to the presence of infinite chains of I3 units aligned in an almost linear channel in the crystal structure. And so it was that in October 2021, John Steeds showed the structure containing these iodine chains in his slide on the topic. The crystal structure is in the CCDC database as WEYDOV and can be seen here at DOI: 10.5517/ccsdg7v I show below part of the extended lattice, showing that chain of iodines.

Click to view 3D model of WEYDOV

So the next (possible) instance of serendipity. From the audience, I immediately recognised this structural motif as being related to the crystal structure of both Na+I (NAIACE) and Na+I2 (GADMOO)[4] which I discussed in one of the very first posts on this blog in 2009 as part of a story about the Finkelstein reaction. Both these structures were obtained from acetone solution, and this solvent very much forms part of the crystal structures, serving to coordinate the sodium cations and playing the role of the quinine in herapathite. The iodine chains, comprising in GADMOO units of I3 and I, are almost exactly linear!

Click to view 3D model of NAICE

Click to view 3D model of GADMOO

So, the question arises as to whether crystals of Na+I2 have ever been examined for light polarisation? One might also ask whether eg the chiral quinine imparts a critical property to the herapathite crystal, or could the achiral acetone also serve the purpose? What would happen if substituted versions of acetone were used (halo, methyl etc)? Would they destroy those linear chains, or would they survive? Are repeating chains of I3 units essential, or can chains of alternating units of I3 and I also serve the purpose? All questions that can only be answered by experiments! Anyone up for trying?


This post has DOI: 10.14469/hpc/9537


References

  1. B. Kahr, J. Freudenthal, S. Phillips, and W. Kaminsky, "Herapathite", Science, vol. 324, pp. 1407-1407, 2009. https://doi.org/10.1126/science.1173605
  2. W.B. Herapath, "XXVI. <i>On the optical properties of a newly-discovered salt of quinine, which crystalline substance possesses the power of polarizing a ray of light, like tourmaline, and at certain angles of rotation of depolarizing it, like selenite</i>", The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 3, pp. 161-173, 1852. https://doi.org/10.1080/14786445208646983
  3. E.H. Land, "Some Aspects of the Development of Sheet Polarizers*", Journal of the Optical Society of America, vol. 41, pp. 957, 1951. https://doi.org/10.1364/josa.41.000957
  4. R.A. Howie, and J.L. Wardell, "Polymeric tris(μ<sub>2</sub>-acetone-κ<sup>2</sup><i>O</i>:<i>O</i>)sodium polyiodide at 120 K", Acta Crystallographica Section C Crystal Structure Communications, vol. 59, pp. m184-m186, 2003. https://doi.org/10.1107/s0108270103006395

A comparison of searches based on metadata records from three (update: five) research repositories.

September 28th, 2021

In the previous blog post, I looked at the metadata records registered with DataCite for some chemical computational modelling files as published in three different repositories. Here I take it one stage further, by looking at how searches of the DataCite metadata store for three particular values of the metadata associated with this dataset compare.

Search 1: The metadata value of -1705.490787 is actually the Gibbs Free energy computed for the molecule associated with the data set, a molecule which featured in this blog post https://commons.datacite.org/?query=*\-170* is an un-fielded search for the truncated string -170* (where * is a wild card character and \ is said to “escape” the minus sign, since on its own a minus can also indicate a Boolean NOT operator), resulting in 70,918 works matching the query. From what we know about the dataset in question, this is a vast number of false positives. How can we reduce them?

Search 1a: https://commons.datacite.org/?query=subjects.subject:\-170* is a fielded search, specifying that the string must occur in the subject field (62 works) but this still has 57 false positives.

Search 1b: https://commons.datacite.org/?query=subjects.subject:\-1705.490787* (in fact precision of -1705.4* is also sufficient) removes all the false positives (5 works). But are there any false negatives? In fact, for other reasons, we know that there are two works in the Figshare repository where the value of of -1705.490787 appears in the keyword items on the landing page of e.g. 10.6084/m9.figshare.16685497 and is indexed and searchable locally, but does not appear in the registered metadata and hence is not included in the results of the above searches.

Search 2: A further, formally much stronger constraint on the search is https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-1705.490787* whereby a subjectScheme is added to search 1b, constrained to the value Gibbs_Energy. This now returns 3 works, two less than search 1b. There are two further false negatives because, as noted previously, the subjectScheme term is not defined in the Zenodo repository metadata record, where the missing two items are located. 

Search 2a: https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:*1705.490787*+AND+subjects.schemeUri:*goldbook* is even further constrained to specify a  Gibbs _Energy according to the  IUPAC Gold book definition.

Search 2b: https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:*1705.490787*+AND+subjects.schemeUri:*goldbook*+AND+subjects.valueUri:*gaussian* is the highest level of constraint, implying not only that the term  Gibbs_Energy is specified by the IUPAC Gold book definition, but that its value is that determined by (in this example) the Gaussian (implementation). 

So to summarise what we have thus far established, we can successfully eliminate false positives by specifying a fielded search with a requirement that the field specifically relates to Gibbs_Energy. But because of omissions in the metadata records, we also have four false negatives resulting from doing this.

Search 3https://commons.datacite.org/?query=subjects.subject:VELNVPXNOKVVTC-VJKZSTDTSA-N searches for another subject term, the InChI key for the molecule relating to the data (5 works). Here again however context for the string VELNVPXNOKVVTC-VJKZSTDTSA-N is missing, although again the string is long enough to ensure it is unique. But we could go one step further.

Search 4: https://commons.datacite.org/?query=subjects.subjectScheme:inchikey+AND+subjects.subject:VELNVPXNOKVVTC-VJKZSTDTSA-N constrains the subject term to only those strings describing an InChIkey (3 works). This again is due to Zenodo not specifying the subjectScheme and Figshare not even containing the InChIkey in its metadata record.

Search 4a: https://commons.datacite.org/?query=subjects.subjectScheme:inchikey+AND+subjects.schemeUri:*inchi-trust*+AND+subjects.subject:VELNVPXNOKVVTC-VJKZSTDTSA-N constrains the inchikey further by specifying the authority for the scheme definition as the InChI Trust. 

Search 5https://commons.datacite.org/?query=subjects.subject:InChI=1S/C25H39NO9* is query 1, but on the InChI string rather than the InChI key, and with the same results as before (5 works). Here, the string is deliberately truncated to return only the molecular formula of the molecule.

Search 5a: https://commons.datacite.org/?query=subjects.subjectScheme:inchi+AND+subjects.subject:InChI=1S/C25H39NO9* is query 4, with the subjectScheme changed to only the molecular formula component of an InChI (3 works). 

Search 5b: https://commons.datacite.org/?query=subjects.subject:InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14\(31-2\)10-23\(29,16\(13\)17\(12\)33-4\)25\(26,30* truncates much less of the InChI string, extending it to the molecular connection table. Notice how characters such as ( or ) have been escaped with a \ prefix. Such characters are used for grouping in the search query and so must be escaped to be included in the query.

Search 5c: https://commons.datacite.org/?query=subjects.subject:InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14\(31-2\)10-23\(29,16\(13\)17\(12\)33-4\)25\(26,30\)19\(34-5\)18\(24\)22\(11-27,21\(28\)35-20\)8-7-15\(24\)32-3\/h12-20,27,29-30H,6-11H2,1-5H3* For this length string (and InChI strings can get very long!) an unidentified error can occur, suggesting that the full InChI string is best not used for such searches.

Search 6: 

From these experiments, we learn that the quality and completeness/richness of the metadata record is vital to ensure no false positives or negatives are returned by the search. Ensuring such metadata richness is something that a repository should do, and it is interesting that two of the best known repositories both currently have failings in this regard. I might try one or two other popular repositories to see how they behave and will report back if I find anything interesting.


Thus https://commons.datacite.org/doi.org?query=subjects.subjectScheme:*inchikey* reveals all entries that specify an InChIkey in the subject metadata (185,414 works) but https://commons.datacite.org/doi.org?query=subjects.subjectScheme:*inchikey*+AND+subjects.schemeUri:*inchi-trust* reveals only 1748 of these further specify the InChI trust as the authority. Two more depositories, Mendeley Data and Harvard Dataverse have been populated with the same data. See here.


This post has DOI: 10.14469/hpc/9162

A comparison of descriptive metadata across different data repositories.

September 28th, 2021

The number of repositories which accept research data across a wide spectrum of disciplines is on the up. Here I report the results of conducting an experiment in which chemical modelling data was deposited in three such repositories and comparing the richness of the metadata describing the essential properties of the three depositions.

The three repositories are as follows:

  1. Figshare as a repository dates from 2012. The computational chemistry dataset used was manually uploaded. Most of the metadata was entered manually by copy/paste operations and included three keywords which comprised the InChI key for the molecule, the corresponding InChI string and the calculated Gibbs Energy obtained from the computed vibrational frequencies.
  2. Zenodo started in 2013 and has been updated several times since then. The same data and metadata were used as for Figshare, including the the same keywords, but with the difference that the upload was not manual but automated using the Zenodo API as implemented in the new computational portal described in the previous post (DOI: 10.14469/hpc/9010). Publication here was a simple button click and so is a much shorter process than that for Figshare.
  3. The original 2006 version of the  Imperial College data repository was based on DSpace, and updated to version 2 in 2016 with entirely new code. It too is populated by publication from the same portal as used for Zenodo.
  4. Mendeley data:
  5. Harvard Dataverse:

Each deposition results in the generation of a DOI, and these, together with the link that allows access to the associated metadata can be seen in the table below.

Repository Dataset DOI Dataset
metadata
Figshare 10.6084/m9.figshare.16685497 XML
JSON
Zenodo 10.5281/zenodo.5511966 XML
JSON
Imperial College 10.14469/hpc/9031 XML
JSON
Harvard Dataverse 10.7910/DVN/4BWOYK XML
XML
Codebook
Mendeley Data 10.17632/dgtvds3xn5.1 XML
JSON

I would note that manual deposition can be rather dependent on how fastidious the depositor is and how they interpret the descriptive keywords that Figshare and Zenodo accept. Automated deposition is a more controlled process, in which the required keywords are a property programmed into the submitting portal tool. Such a process also allows metadata to describe relationships between different datasets, such as a dataset collection, and is inherited from project descriptor on the portal. Additionally, the automated process can then be augmented by manual editing of the metadata record, as for example, the addition of the DOI for this descriptive post which can be added to the metadata records retrospectively. In the case of e.g. Zenodo, retrospective changes to the metadata record require a new DOI to be generated to reflect the changes. 

You can inspect the results of these three depositions yourself by downloading the respective metadata records and viewing the downloaded file using a simple text or XML editor. 

  1. All three repositories contain the ORCID of the depositor, as e.g. from Figshare:
    <creator> 
    <creatorName>Rzepa, Henry S.</creatorName> 
    <givenName>Henry S.</givenName> 
    <familyName>Rzepa</familyName> 
    <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org">
    https://orcid.org/0000-0002-8635-8390
    </nameIdentifier>
    </creator>

    The widespread addition of the unique ORCID researcher identifier is very welcome.

  2. The more interesting component is keyword metadata, populated manually in Figshare and using the automated API in the other two repositories.
    1. Below is the Figshare metadata entry, which displays the assigned categories (from a controlled list) in the <subject> container:
      <subjects>
          <subject>Computational Chemistry</subject>
          <subject>Organic Chemistry</subject>
          <subject subjectScheme="Fields of Science and Technology (FOS)" schemeURI="http://www.oecd.org/science/inno/38235147.pdf">FOS: Chemical sciences</subject>
        </subjects>

      The context of these keywords is clearly defined by the value of the subjectScheme (chemical sciences) but this term is very broad and does not relate very specifically to the deposited data. The more chemically specific keywords themselves are only displayed on the landing page for the entry as shown below and are not expressed in any metadata container, which means that they are not indexed and hence searchable using the DataCite metadata store.

    2. Zenodo interpret this differently, with the keywords now included in the <subject> container.
      <subjects>
          <subject>-1705.490787</subject>
          <subject>InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14(31-2)10-23(29,16(13)17(12)33-4)25(26,30)19(34-5)18(24)22(11-27,21(28)35-20)8-7-15(24)32-3/h12-20,27,29-30H,6-11H2,1-5H3/t12-,13-,14+,15+,16-,17+,18-,19+,20+,22+,23-,24+,25+/m1/s1</subject>
          <subject>VELNVPXNOKVVTC-VJKZSTDTSA-N</subject>
        </subjects>

      However, you might be wondering what the keyword -1705.490787 is all about. Put simply, in this form of expression it has absolutely no context. I previously explained why it might be useful if context is added, it being a persistent identifier for (some) quantum chemical calculations in the form of a computed total energy corrected thermally into a Gibbs energy. The persistence in this case is acquired not by registration with an agency but generation by an algorithm. That algorithm in turn would require additional metadata for its specification, but that is something I will not address in this post. At any rate, because it is part of the metadata record, it is search-enabled in the Zenodo version.

    3. Imperial follows the Zenodo approach, with further addition of context:
      <subjects>
          <subject subjectScheme="Gibbs_Energy" schemeURI="https://doi.org/10.1351/goldbook.G02629" valueURI="http://gaussian.com/thermo/">-1705.490787</subject>
          <subject subjectScheme="inchi" schemeURI="http://www.inchi-trust.org/">InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14(31-2)10-23(29,16(13)17(12)33-4)25(26,30)19(34-5)18(24)22(11-27,21(28)35-20)8-7-15(24)32-3/h12-20,27,29-30H,6-11H2,1-5H3/t12-,13-,14+,15+,16-,17+,18-,19+,20+,22+,23-,24+,25+/m1/s1</subject>
          <subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">VELNVPXNOKVVTC-VJKZSTDTSA-N</subject>
      </subjects>

      The context is added by addition of the attributes subjectScheme, schemeURI and valueURI. The top level context is the definition provided by the IUPAC Gold Book, and the actual implementation of the algorithm is described on the Gaussian site (although the algorithm there is not explicit in a machine implementable sense). These additions allow an indexed search not only of the numerical value (as a simple string and not as a floating point number) but which can be constrained by specifying the value of e.g. the subjectScheme so that any other random number specified as a keyword which does not have this attribute is excluded. This also allows a search where the floating point number is replaced by wild-cards (*), which would then retrieve ANY reported Gibbs energy, which could in turn be constrained by say the nature of the molecule as expressed using  InChI. 

  3. The final aspect of the metadata analysed here is the relatedIdentifier record. This is increasingly recognised as a crucial component for the construction of so-called PID graphs, which are generated to reveal connections between entities in the research landscape such as data, people, organisations, funders, publications and any other object that is assigned a registered PID (such as perhaps in the future connecting data to its origins from a large instrument). So here are these records for the three repositories:
    1. Although the landing page for the Figshare record has three such entries, including pointers to the other two depositions being discussed here, they are not propagated to the metadata record and so cannot participate in any generated PID graph.
    2. Zenodo has the following record
      <relatedIdentifiers>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.5511965</relatedIdentifier>
        </relatedIdentifiers>

      which relates to an earlier version of the metadata for this entry.

    3. The Imperial record is:
      <relatedIdentifiers>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasMetadata">https://data.hpc.imperial.ac.uk/resolve/?ore=9031</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=1</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=2</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=3</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=4</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="References">10.5281/zenodo.5511966</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="References">10.6084/m9.figshare.16685497</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="IsPartOf">10.14469/hpc/9158</relatedIdentifier>
        </relatedIdentifiers>

      where a large number of related PIDs would result in a rich PID graph. These entries include relationType=”HasMetadata” which is a pointer to additional metadata expressed using a different schema (ORE) and which provides a machine-actionable manifest for the files present, specifying the Media Types of each file and a machine method of accessing them. relationType=”HasPart” provides an access URL for each specific item in the fileset. relationType=”References”  is the analogue of the Figshare entries above, citing the other two repositories we are discussing here and finally relationType=”IsPartOf” indicates the deposition is part of a larger collection (in this case the collection generated for this blog) and which could also correspond to e.g. a project comprising multiple researchers at multiple institutions, or say a PhD dissertation containing multiple chapters. The extensive nature of this list of identifiers means that the PID graph would reveal many connections.

I have only covered three repositories here; many more could be added to the list and analyzed for their metadata records. The bottom line is that generally the more metadata that is added, the richer the resulting services and analyses based on PIDs can become. It can only be hoped that this aspect of the operation of repositories continues to improve over time and eventually most will broadcast very rich metadata, including at the very specific subject level. This should enrich the research landscapes, especially at the finely grained subject level.

In the next post, I will analyse the results of searches enabled by this metadata.


Figshare also has an available API, which has not been implemented in the current version of this portal. Policies regarding editing of metadata vary. Some repositories editing updates to the record held by DataCite against the existing DOI. Others require the generation of a new DOI for each new version of the metadata, no matter how small a change (e.g. spelling mistakes in the title etc). An unsolved problem in DataCite metadata is datatypes and units. This entry is a floating point data type, with units of Hartree. How this information can be added is still being discussed.

HPC Access and Metadata Portal (CHAMP).

September 13th, 2021

You might have noticed if you have read any of my posts here is that many of them have been accompanied since 2006 by supporting calculations, normally based on density functional theory (DFT) and these calculations are accompanied by a persistent identifier pointer to a data repository publication. I have hitherto not gone into the detail here of the infrastructures required to do this sort of thing, but recently one of the two components has been updated to V2, after being at V1 for some fourteen years[1]  and this provides a timely opportunity to describe the system a little more. 

The original design was based on what we called a portal to access the high performance computing (HPC) resources available centrally. These are controlled by a commercial package called PBS which provides a command line driven interface to batch queues. Whilst powerful, PBS can also be complex, and for every day routine use it seemed more convenient to package up this interface into a Web-accessed portal which also included the ability to specify the resources needed (such as memory, number of CPUs, etc) to run the desired compute program, in our case the Gaussian 16 package and to complete things by adding a simple interface to a data repository for use when the calculation was completed.

The process of using this tool, which functions in essence as an Electronic Laboratory Notebook or ELN for computational chemistry, can be summarised as a workflow, which occurs horizontally in the screenshot of V1 above. Each job is assigned an internal ID, which is associated with a pre-configured project and given a searchable description. Its status in the PBS-controlled queues is indicated and when finished the associated input and output files become available for download, with an option to delete these if they are not in fact needed, and a final option to publish to the accompanying tool which is a data repository. V1 of this portal was in fact written in the PHP scripting language and controlled behind the scenes using a MySQL database, which allows the entries to be filtered by search terms such as the assigned project or the description. This proved particularly useful when the number of entries reached large numbers (> 100,000 eventually) and meant that even 15-year old entries could be easily found and inspected!

Although this workflow proved highly robust, the underlying PHP system and associated code became increasingly unmaintainable and in 2021 we decided to refactor it for greater sustainability. We had noticed that in 2018, another group had taken the basic concept we had used in 2006, written a more flexible and portable opensource toolkit for building such a portal, calling it Open OnDemand: A Web-based client portal for HPC centers and published a description.[2] In effect, a lot of the work in maintenance is now divested to a separate group and accordingly our software engineering group here at Imperial were far happier using such a tool. So now enter V2 of our own portal, which we now call HPC Access and Metadata Portal or CHAMP.

The workflow is very much the same as before, but with added flexibility that allows custom resources to be selected which might include eg special grant-funded priority queues. Additionally, a new directory tool allows inspection of any job inputs or outputs, provided by the Open OnDemand package and which greatly facilitates minute-to-minute management/inspection of jobs to ensure the outputs are those expected for a properly functioning job.

If the job is deemed suitable for sharing, the publish button is pressed. This induces a workflow which, inter alia, converts the system specific checkpoint file to an formatted version which can be used on any system and generates a number of extra files needed for publication of the job.

Also of interest is the METADATA file, which generates calculation-specific metadata suitable for injection into the data repository. Currently, this includes the InChI string and Key for the molecule calculated and the Gibbs_Energy, the purpose of which was described in this post. In the future we plan to make this metadata even richer with further information. This calculation-specific metadata will later be conflated with generic metadata for the final publication on the actual repository. That full metadata record includes information about the person who ran the job (their ORCID etc), the institution they are at, the data licensing etc., garnered in part from the profile entry for that user on the CHAMP portal.

After publication, the CHAMP entry for the job is updated to include the DOI for the data publication, and hyperlinked to allow immediate access to this entry in the repository.

An information page about the job also includes a link to the final full published metadata record(s).

CHAMP currently includes workflows to publish to the Imperial College repository. Zenodo has also now been added and possibly other repositories in the future as demand requires.

You can see here  that  I have described how an  ELN was originally designed from scratch to control quantum calculations, and how an essential symbiotic partner to this resource was considered to be a data repository at the outset, even way back in 2006.  Now, the first of these resources has been refactored into modern form and no doubt the repository end will also be in the future. The code is available for anyone to create a similar compute portal for themselves.

A different version of this description, including more details of the software engineering, will shortly be submitted to the Journal of Open Source Software, along with source code suitable for use with Open OnDemand at https://github.com/ImperialCollegeLondon/hpc_portal/.


Originally in the form of a Handle, which was replaced by the use of a DOI. The DOI for this post itself is 10.14469/hpc/9010

References

  1. M.J. Harvey, N.J. Mason, and H.S. Rzepa, "Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks", Journal of Chemical Information and Modeling, vol. 54, pp. 2627-2635, 2014. https://doi.org/10.1021/ci500302p
  2. D. Hudak, D. Johnson, A. Chalker, J. Nicklas, E. Franz, T. Dockendorf, and B. McMichael, "Open OnDemand: A web-based client portal for HPC centers", Journal of Open Source Software, vol. 3, pp. 622, 2018. https://doi.org/10.21105/joss.00622