Chemical IT « Henry Rzepa's blog

Archive for the ‘Chemical IT’ Category

Tautomeric polymorphism.

Thursday, June 1st, 2017

Conformational polymorphism occurs when a compound crystallises in two polymorphs differing only in the relative orientations of flexible groups (e.g. Ritonavir). At the Beilstein conference, Ian Bruno mentioned another type; tautomeric polymorphism, where a compound can crystallise in two forms differing in the position of acidic protons. Here I explore three such examples.

The term occurs in the title of this article,[1] for a compound known as Omeprazole.

When the bottom structure (the 6-methoxy) is used to search the CSD, two separate series are found. The first of these is UDAVIF (DOI: 10.5517/ccp82qq, 6-Methoxy-2-((4-methoxy-3,5-dimethyl-2-pyridinyl)methylsulfinyl)-1H-benzimidazole). There is no information regarding the absolute configuration of the chiral S-centre. Although the downloaded coordinates show it as R it is probably a racemic mixture. A note added to the structure declares disorder: “Omeprazole exists as solid solutions of the two tautomers. The structure is mixed 5-methoxy/6-methoxy with occupancies 0.078:0.922“, which indicates 7.8% is present as in the upper structure above.

The second hit is VAYXOI (DOI: 10.5517/ccp82pp, rac-6-Methoxy-2-(((4-methoxy-3,5-dimethyl-2-pyridinyl)methyl)sulfinyl)-1H-benzimidazole) which now contains no disorder; the contaminating 5-methoxy tautomer is no longer present. Perhaps not quite a true tautomeric polymorph, since the 5-methoxy tautomer is never observed in pure form.

This does occur with a second example. DEBFAR[2] represents the keto form on the right which crystallises from methanol, whilst YUYDOL as the enol form on the left crystallises from n-hexane.

Calculations shed some light on this behaviour. DEBFAR has a computed (DOI: 10.14469/hpc/2591) dipole moment of 11D, whereas YUYDOL (DOI: 10.14469/hpc/2590) is 2.5D. In chloroform solutions (~half way between the two solvent polarities), the keto form is ~6.1 kcal/mol lower in ΔG than the enol. The crystal packing for the two forms is very different and the differences in this packing must clearly amount to >6.1 kcal/mol to over-ride the lesser stability of DEBFAR in solution.

The final example [3] is illustrated using scheme 2 from that article, one entitled tautomeric species of 4-hydroxynicotinic acid:

The original diagram has two unfortunate bond errors which are NOT reproduced above (and which perhaps are a good topic for discussion in tutorials with students), along with an unusual interpretation of the term tautomerism. The blue arrows above are mine and I suggest the isomerism between the connected species is resonance isomerism, and not tautomerism. So three possible different true tautomers then. Five crystal structures are reported which I list below.

10.5517/cctswjz (KUXPUP, 4-oxo-1,4-dihydropyridine-3-carboxylic acid, no H₂O), 10.5517/ccdc.csd.cc1kfyxv (KUXPUP01 no H₂O) and 10.5517/ccdc.csd.cc1kfyzx (KUXPUP02 no H₂O)
10.5517/ccx59s4 (AVEMUK, 4-Oxo-1,4-dihydropyridine-3-carboxylic acid hemihydrate) and 10.5517/ccdc.csd.cc1kfz21 (AVEMUK01)
10.5517/ccdc.csd.cc1kfz54 (AKIHIN, 4-hydroxypyridin-1-ium-3-carboxylate monohydrate)
10.5517/ccdc.csd.cc1kfz10 (AKIHAF, 4-hydroxypyridin-1-ium-3-carboxylate)

KUXPUP and AVEMUK differ only in the presence of one solvent water molecule and both represent tautomer 2 above. AKIHIN and AKIHAF similarly represent tautomer 3 above; both are represented as 3a in the CSD and not as 3b. There are no examples of tautomer 1 in the crystal structure database; it may only exist in the gas phase. So the equilibrium 2 ⇌ 3 is another genuine example of tautomeric polymorphism, with the keto form favoured by more polar solvents, as was noted for the previous example.

With this last article,[3] comprehensive calculations at a good level were reported, including modelling the periodic cell using the Crystal program and including corrections such as BSSE (basis set superposition error) and dispersion terms. I was hopeful that this might lead me to something as simple as the computed dipole moments of the (isolated) species (as I reported above for the previous system), but these were not mentioned in the text of the article. Unfortunately, the supporting information also had no details of any such calculations, which left me frustrated again at how difficult it can be in (it has to be said) the vast majority of articles which report calculations to get details of such calculations.

Tautomeric polymorphism remains a very rare phenomenon. SciFinder for example only has 19 references citing it (2 of which are to conference talks). Perhaps the most intriguing[4] claims that 2-thiobarbituric acid has the richest collection of tautomeric polymorphs with five. Since no calculations are reported there, I might try these out and report back here.

Postscript: Here is some analysis of 2-thiobarbituric.

THBARB (DOI 10.5517/cctbxcd, 10.5517/cctbxfg and 10.5517/cctbxgh) are three polymorphs of the keto tautomer, the isolated molecule having a small calculated dipole moment (DOI: 10.14469/hpc/2632).
PABNAJ (DOI: 10.5517/cctbxbc) is a polymorph in the enol form, with a much larger calculated dipole moment (DOI: 10.14469/hpc/2633)
PABNIR (DOI: 10.5517/cctbxdf) is a mixed polymorph with one enol paired with one keto form.

The relative free-energies of the isolated molecules are 0.0 (keto) and 9.0 (enol). The keto-enol pair is 0.4 kcal/mol more stable than the isolated components. This again shows the effect that crystal packing can have on the relative energies and also shows that a simple inspection of the dipole moment may cast light on the polymorphism.

References

P.M. Bhatt, and G.R. Desiraju, "Tautomeric polymorphism in omeprazole", Chemical Communications, pp. 2057, 2007. https://doi.org/10.1039/b700506g
Y. Akama, M. Shiro, T. Ueda, and M. Kajitani, "Keto and Enol Tautomers of 4-Benzoyl-3-methyl-1-phenyl-5(2H)-pyrazolone", Acta Crystallographica Section C Crystal Structure Communications, vol. 51, pp. 1310-1314, 1995. https://doi.org/10.1107/s0108270194007389
S. Long, M. Zhang, P. Zhou, F. Yu, S. Parkin, and T. Li, "Tautomeric Polymorphism of 4-Hydroxynicotinic Acid", Crystal Growth & Design, vol. 16, pp. 2573-2580, 2016. https://doi.org/10.1021/acs.cgd.5b01639
M. Chierotti, L. Ferrero, N. Garino, R. Gobetto, L. Pellegrino, D. Braga, F. Grepioni, and L. Maini, "The Richest Collection of Tautomeric Polymorphs: The Case of 2‐Thiobarbituric Acid", Chemistry – A European Journal, vol. 16, pp. 4347-4358, 2010. https://doi.org/10.1002/chem.200902485

Tags:Chemistry, chloroform solutions, Conformational isomerism, Crystal, crystallography, gas phase, Ian Bruno, Isomerism, Polymorphism, Ritonavir, S-centre, Tautomer
Posted in Chemical IT, crystal_structure_mining | No Comments »

Challenges in reliably representing the chemistry of crystal structures.

Monday, May 29th, 2017

The title here is taken from a presentation made by Ian Bruno from CCDC at the recent conference on Open Science. It also addresses the theme here of the issues that might arise in assigning identifiers for any given molecule.

The structure was represented as shown[1] by the original authors, in which the bonding from S to Sn is indicated with both solid lines (a bond) and dotted lines (an “interaction”).

Why would this matter? Well, to enable any entry in the Cambridge structure database as findable (the F of FAIR) it has to be given a unique identifier. There are in general three such identifiers assigned by the CCDC:

The Refcode, in this case XONHIS. These six or seven letter codes are historically the oldest, and started off at least with an attempt if possible to assign some semantic inference from the name, even if only occasionally.
The CCDC deposition number, in this case 650011. This is the number that an author will receive immediately upon deposition, and you often find these identifiers quoted in supporting information files
The DOI (digital object identifier), in this case 10.5517/ccptd3z, which can be used to view the structure even if access to the full CSD is not available to the user. In that sense, the DOI is the FAIRest of the first three of these identifiers.
However, CCDC reported that they are considering adding a 4th very common identifier, based on the InChI (International chemical identifier), which comes as a full string and with the structure of the molecule at least in part inferrable from it, together with a shortened (almost) unique string which has the advantage of being “Googlable”. Both are helpfully FAIR.

It is this 4th identifier that is at issue here. InChIs are derived from atom connection tables; you need to define all bonds present in the molecule. And it is here that the dotted “bond”/”interaction” above becomes a problem. This is the representation shown in the CSD database, which reveals that all the Sn…S interactions are classified as “bonds”, along with some creative(!) representations of the C…S bonds.

So the InChI will very much depend on whether all the Sn…S contacts are termed as bonds or as interactions. To help clarify that, it is useful to show the typical range of lengths of such contacts. Below is a simple search for all Sn and S systems where the pair are either close in space (< 3.5Å) or have a bond specified between the two atoms.

The main cluster occurs at ~2.5Å, but there is some evidence of a second peak at about 3.0Å. The third distribution up to 3.5Å is probably a continuum of very weak dispersion interaction, which most molecules exhibit. The values for XONHIS are 2.521 and 2.996Å, which match the two clusters above.

So perhaps a quantum calculation can shed some light (DOI: 10.14469/hpc/2593)? The values on the right are the optimised bond lengths which are pretty similar to the crystal structure. On the left are the calculated Wiberg bond orders (B3LYP+D3BJ/Def2-TZVPP/chloroform calculation). These reveal both “bonds” have an order less than 1. The value of ~0.6 is probably not contentious, but it does graphically show that when a compound is indexed as having a “single bond” between two atoms, the quantitative bond order may be substantially less. What however would one make of a bond order of 0.214? Should it be classified as a bond, albeit a much weaker one than normal? Or should it instead simply be a rather strong “interaction” which is not classified as a bond? And perhaps one should have in mind the question “how sensitive is this result to the quantum mechanical procedure used?”

Why does this distinction matter? Well, the InChI algorithm is based on simple connectivity; are two atoms connected by a bond or not? There are no nuances here. At the moment, this decision can be made by an algorithm based on the distance between any atom pair (whether computed or measured), but more often I suspect it derives from a “molfile” which is often derived from a human-drawn representation using a structure drawing program. It does rather boil down to the individual preferences of the human drawing the molecule. Due in part to such uncertainties, it was estimated that only 22% of structures in the CSD can be used to generate a reliable InChI. Hydrogen bonds are almost always classified as non-bonds, which means their presence is rarely systematically flagged during the indexing of the structures. Organometallics often pose some of the greatest representational problems (there are many others).

I will end by observing another class of structure that I deal with, “reaction transition states”. As you might imagine these forms are full of pairs of atoms with ambiguous bond lengths and hence connectivity. We currently have no truly reliable method for assigning useful identifiers to them. So lots of challenges for the future then!

References

R. Reyes-Martínez, R. Mejia-Huicochea, J.A. Guerrero-Alvarez, H. Höpfl, and H. Tlahuext, "Synthesis, heteronuclear NMR and X-ray crystallographic studies of two dinuclear diorganotin(IV) dithiocarbamate macrocycles", Arkivoc, vol. 2008, pp. 19-30, 2007. https://doi.org/10.3998/ark.5550190.0009.503

Tags:author, Bruno, chemical identifier, Digital Object Identifier, Ian Bruno, Identifier, InChI algorithm
Posted in Chemical IT | 2 Comments »

Curating a nine year old journal FAIR data table.

Monday, May 29th, 2017

As the Internet and its Web-components age, so early pages start to decay as technology moves on. A few posts ago, I talked about the maintenance of a relatively simple page first hosted some 21 years ago. In my notes on the curation, I wrote the phrase “Less successful was the attempt to include buttons which could be used to annotate the structures with highlights. These buttons no longer work and will have to be entirely replaced in the future at some stage.” Well, that time has now come, for a rather more crucial page associated with a journal article published more recently in 2009.[1]

The story started a few days ago when I was contacted by the learned society publisher of that article, noting they were “just checking our updated HTML view and wanted to test some of our old exceptions“. I should perhaps explain what this refers to. The standard journal production procedures involve receiving a Word document from authors and turning that into XML markup for the internal production processes. For some years now, I have found such passive (i.e. printable only) Word content unsatisfactory for expressing what is now called FAIR (Findable, accessible, inter-operable and re-usable) data. Instead, I would create another XML expression (using HTML), which I described as Interactive Tables and then ask the publisher to host it and add that as a further link to the final published article. I have found that learned society publishers have not been unwilling to create an “exception” to their standard production workflows (the purely commercial publishers rather less so!). That exceptional link is http://www.rsc.org/suppdata/cp/b8/b810301a/Table/Table1.html but it has now “fallen foul of the java deprecation“.

Back in 2008 when the table was first created, I used the Java-based Jmol program to add the interactive component. That page, when loaded, now responds with the message:

This I must emphasise is nothing to do with the publisher, it is the Jmol certificate that has been revoked. That of itself requires explanation. Java is a powerful language which needs to be “sandboxed” to ensure system safety. But commands can be created which can access local file stores and write files out there (including potentially dangerous ones). So it started to become the practise to sign the Java code with the developer certificate to ensure provenance for the code. These certificates are time-expired and around 2015 the time came to renew it. Normally, when such a certificate is renewed, the old one is allowed to continue operation. On this occasion the agency renewing the certificate did not do this but revoked the old one instead (Certificate has been revoked, reason: CESSATION_OF_OPERATION, revocation date: Thu Oct 15 23:11:18 BST 2015). So all instances of Jmol with the old certificate now give the above error message.

The solution in this case is easy; the old Jmol code (as JmolAppletSigned.jar) is simply replaced with the new version for which the certificate is again valid. But simply doing that alone would merely have postponed the problem; Java is now indeed deprecated for many publishers, which is a warning that it will be prohibited at some stage in the future.^‡So time to bite the bullet and remove the dependency on Java-Jmol, replacing it with JSmol which uses only JavaScript.

Changing published content is in general not allowed; one instead must publish a corrigendum. But in this instance, it is not the content that needs changing but the style of its presentation (following the principle of the Web of a clear-cut separation of style and content). So I set out to update the style of presentation, but I was keen to document the procedures used. I did this by commenting out non-functional parts of the style components of my original HTML document (as <!– comment –>) and adding new ones. I describe the changes I made below.

The old HTML contained the following initialisation code: jmolInitialize(".","JmolAppletSigned.jar");jmolSetLogLevel('0'); which was commented out.
New scripts to initialize instead JSmol were added, such as:
<script src="JSmol.min.js" type="text/javascript"> </script>
I added further scripts to set up controls to add interactivity.
The now deprecated buttons had been invoked using a Jmol instance: jmolButton('load "7-c2-h-020.jvxl";isosurface "" opaque; zoom 120;',"rho(r) H")
which was replaced by the JSmol equivalent, but this time to produce a hyperlink rather than a button (to allow the greek ρ to appear, which it could not on a button): <a href="javascript:show_jmol_window();Jmol.script(jmolApplet0,'load 7-c2-020.jvxl;isosurface "" translucent;spin 3;')">ρ(r)</a>,
Some more changes were made to another component of the table, the links to the data repository. Originally, these quoted a form of persistent identifier known as a Handle; 10042/to-800. Since the data was deposited in 2008, the data repository has licensed further functionality to add DataCite DOIs to each entry. For this entry, 10.14469/ch/775. Why? Well, the original Handle registration had very little (chemically) useful registered metadata, whereas DataCite allows far richer content. So an extra column was added to the table to indicate these alternate identifiers for the data.
We are now at the stage of preparing to replace the Java applet at the publishers site with the Javascript version, along with the amended HTML file. The above link, as I write this post, still invokes the old Java, but hopefully it will shortly change to function again as a fully interactive table.
I should say that the whole process, including finding a solution and implementing it took 3-4 hours work, of which the major part was the analysis rather than its implementation.

It might be interesting to speculate how long the curated table will last before it too needs further curation. There are some specifics in the files which might be a cause for worry, namely the so-called JVXL isosurfaces which are displayed. These are currently only supported by Jmol/JSmol. They were originally deployed because iso-surfaces tend to be quite large datafiles and JVXL used a remarkably efficient compression algorithm (“marching cubes”) which reduces their size ten-fold or more. Should JSmol itself become non-operational at some time in the (hopefully) far future (which we take to be ~10 years!) then a replacement for the display of JVXL will need to be found. But the chances are that the table itself will decay “gracefully”, with the HTML components likely to outlive most of the other features. The data repository quoted above has itself now been available for ~12 years and it too is expected to survive in some form for perhaps another 10. Beyond that period, no-one really knows what will still remain.

You may well ask why the traditional journal model of using paper to print articles and which has survived some 350 years now, is being replaced by one which struggles to survive 10 years without expensive curation. Obviously, a 3D interactive display is not possible on paper. But one also hears that publishers are increasingly dropping printed versions entirely. One presumes that the XML content will be assiduously preserved, but re-working (transforming, as in XSLT) any particular flavour of XML into another publishers systems is also likely to be expensive. Perhaps in the future the preservation of 100% of all currently published journals will indeed become too expensive and we might see some of the less important ones vanishing for ever?^†

^‡Nowadays it is necessary to configure your system or Web browser to allow even signed valid Java applets to operate. Thus in the Safari browser (which still allows Java to operate, other popular browsers such as Chrome and Firefox have recently removed this ability), one has to go to preferences/security/plugin-settings/Java, enter the URL of the site hosting the applet and set it to either “ask” (when a prompt will always appear asking if you want to accept the applet) or “on” when it will always do so. How much longer this option will remain in this browser is uncertain.

^†In the area of chemistry, an early pioneer was the Internet Journal of Chemistry, where the presentation of the content took full advantage of Web-technologies and was on-line only. It no longer operates and the articles it hosted are gone.

References

H.S. Rzepa, "Wormholes in chemical space connecting torus knot and torus link π-electron density topologies", Phys. Chem. Chem. Phys., vol. 11, pp. 1340-1345, 2009. https://doi.org/10.1039/b810301a

Tags:Applet, compression algorithm, computing, Cross-platform software, HTML, HTML element, Internet Journal, Java, Java applet, Java platform, jmol, Markup languages, Open formats, publishers site, publishers systems, technology moves, Technology/Internet, the Internet Journal, Web browser, web technologies, Web-components age, XML, XSLT
Posted in Chemical IT | 8 Comments »

Conference report: an example of collaborative open science (reaction IRCs).

Thursday, May 25th, 2017

It is a sign of the times that one travels to a conference well-connected. By which I mean email is on a constant drip-feed, with venue organisers ensuring each delegate receives their WiFi password even before their room key. So whilst I was at a conference espousing the benefits of open science, a nice example of open collaboration was initiated as a result of a received email.^‡

Steven Kirk contacted me with the following query: Do you know of any open-access database of calculated IRCs with coverage of as broad a range of classes of chemical reactions as possible? I recollected that about six years ago, I was exploring the use of iTunesU as a system for delivering course content in a rich-media format. I produced animations for about 115 reactions (many of which as it happens were taken from this blog, but quite a number were also unique to that project) and placed them into iTunesU, and now sending the URL https://itunes.apple.com/gb/course/id562191342 to Steven.

I should at this point explain something of the structure of such an iTunesU course.

An essential feature is the course icon, seen below on the left. Since the course is hosted by Imperial College, it had to be an officially approved icon. I am sure you can believe me if I tell you that this took a month or so to obtain, with a fair bit of persistence required!
I also had to get approval to place the iTunes app on all the teaching computers so that students could open the course. Believe me again when I tell you that I had to persuade the Apple lawyers in Cupertino to release a special license for this app to persuade our administrators here to install it on the Windows teaching clusters. Another few months had passed by.
When creating an entry (using e.g. https://itunesu.itunes.apple.com/coursemanager/ ) one has to specify values for various descriptors, also often called metadata. Thus any one entry has fields for name and description, with the popularity added by Apple. Only a few words are visible in the description field, which can be expanded in iTunes using the i button.
Steven meanwhile had replied asking if the original data that was used to generate the IRC might be available. Specifically his second question was “So the DOIs are only stamped into the animation’s bitmaps, or are they also somewhere in the metadata?“. That little i button is not easy to spot, and there is no indication, in the event, of what information it might actually contain.
Here it is expanded. The contents are unstructured text, into which I have placed the required DOI.
The lesson here is that I had fortunately had the foresight to include a link to the IRC data in anticipation of just such a question from someone in the future. But black mark to Apple here; the text cannot be selected and copied into a clipboard! It is fairly unFAIR data, since it can only be inter-operated (the I of FAIR) by a human re-typing it by hand. And the human has also to recognise the pattern of a DOI; a machine could not obtain this information easily. Moreover Steven is a Linux user; he does not readily have access to the iTunes app on this operating system!
Also, there were 115 such entries, and now the prospect was rearing that each would have to be hand processed. Moreover, because the text was unstructured, there was no guarantee that I would have adopted the same pattern for all 115 entries.
Fortunately Steven was on the ball. I quote again: it turns out iTunes isn’t needed at all. A service I found on the web http://picklemonkey.net/feedflipper-home/ takes an ITunes URL and converts it to an RSS feed. Opening this feed in Firefox and RSSOwl respectively let me save the feed as XML and HTML (both attached).
This is currently where we stand (Steven’s first email was two days ago), but it’s not finished yet. Depending on how assiduous I was five years ago, some DOIs to the data may be acquired from the list. Sometimes I simply wrote e.g. See http://www.ch.imperial.ac.uk/rzepa/blog/?p=6816 knowing that the links to the data were there instead. I can already see that some descriptions have neither a DOI nor a link to the blog. More detective work will be needed, unfortunately.

How might the situation described above been avoided? Well, Apple in iTunesU only provided in effect one metadata field, and this was an unstructured one. Anything went in that field. Had they provided (or had the course creator been able to configure it themselves) there might have been another field entitled say “data source“. This could moreover been made a mandatory field and a structured one. Thus it might have only accepted known types of persistent identifier, such as a DOI. Further, the system could have checked that the DOI was actually resolvable. Before you ask, I did log a “bug” with Apple asking this be done, but nothing ever was. With such a tool to hand, I might have achieved data sources for all the 115 entries. The resulting XML (as generated above) could have been used to automate the retrieval of all 115 datasets describing this course.

At this stage then, Steven can follow-up his interest in building a reaction IRC library and analysing it. I will do all I can to encourage Steven not to make the mistakes I did and to ensure that any further data that is required to augment the library does not suffer the problems above. On the other hand, I console myself that in two days, much of the data for the course I created five years ago was salvageable; I wonder how many other iTunesU courses there are for which that can be said!

I will let (with some blushing) the final word be Steven’s: You are one of the few chemists who has both pioneered and built the principles of ‘open chemistry’ into their actual scientific work. I visit your blog occasionally knowing that there is a very high probability I could download and tinker with the results of real calculations.

^‡Might I assure all the speakers that I concentrated totally on their talks rather than incoming emails!

Tags:animation, chemical reactions, City: Cupertino, Company: Cupertino Elec, Company: Firefox Communic, Computer Hardware - NEC, computing, detective, Digital media, Drip, Electronic documents, Electronic publishing, Email, HTML, Imperial College, Linux, operating system, Password, Person Location, Steven Kirk, Technology/Internet, XML
Posted in Chemical IT | No Comments »

Conference report: OPEN SCIENCE AND THE CHEMISTRY LAB OF THE FUTURE

Tuesday, May 23rd, 2017

This is taking place in the idyllic surroundings of the Niederwald forest, Rüdesheim, Germany. Here I highlight only aspects of the first three talks.

Martin Hicks introduced the conference with concepts such as the global public good. In the area of open access, he reminded us of the terms Platinum/Diamond open access, which are journals with no article processing charges (which can reach £5000 per article for some other OA journals), but which go with the challenge of ensuring that more gatekeepers of this global public good are needed to avoid being overwhelmed. He ended by asking us all to consider what the unit of knowledge is that needs to be shared.

The first talk was by Klaus Tochtermann who (amongst other topics) brought to our attention the Dutch GoFAIR initiative in the European Open Science Cloud, sub-divided into Go-train (i.e. data experts, who will build e.g. metadata tools) and Go-build (eco-systems: Internet of FAIR data and FAIR services). I think the message is that all organisations with chemistry labs should consider this as being an essential part of their future infrastructures.

Jeremy Frey’s title was Reducing Uncertainty: The Raison d’Être for Open Science who defined the fundamental principles of open science as transparency, capability and obtainability and encouraged data publication at source (as opposed to e.g. PhD writing up period) to ensure fidelity in the capture of metadata.

The team of Leah McEwen, Ian Bruno, Stuart Chalk and Richard Kidd told us about Global Data Initiatives and Chemistry and the need for social and technical bridges to enable open data sharing. I learnt for example that the IUPAC Gold book of chemical terms and definitions now has DOIs for each of the terms. Thus chemical shift (DOI: 10.1351/goldbook.C01036[1]), spectroscopy (DOI: 10.1351/goldbook.S05848[2]) and electron density function (DOI: 10.1351/goldbook.ET07024[3]). I will now to associate such links with e.g. deposited NMR data to help increase the semantics of the data (see e.g. DOI: 10.14469/hpc/1975).

Finally, a photo from the region, taken from the gondola adjacent to the venue and riding down to the small town on the banks of the Rhine.

References

"chemical shift", The IUPAC Compendium of Chemical Terminology, 2014. https://doi.org/10.1351/goldbook.c01036
"spectroscopy", The IUPAC Compendium of Chemical Terminology, 2014. https://doi.org/10.1351/goldbook.s05848
"electron density function", The IUPAC Compendium of Chemical Terminology, 2014. https://doi.org/10.1351/goldbook.et07024

Tags:article processing charges, Bad Kreuznach, chemical shift, chemical terms, City: Rüdesheim, Country: Germany, Hesse, Hesse-Nassau, Ian Bruno, Jeremy Frey, Klaus Tochtermann, Leah McEwen, Martin Hicks, metadata tools, Niederwald, Niederwalddenkmal, Quotation, Rheingau-Taunus-Kreis, Rhine, Richard Kidd, Rüdesheim, Rüdesheim am Rhein, Rüdesheim an der Nahe, spectroscopy, States of Germany, Stuart Chalk, Technology/Internet
Posted in Chemical IT | 1 Comment »

The challenges in curating research data: one case study.

Friday, April 28th, 2017

Research data (and its management) is rapidly emerging as a focal point for the development of research dissemination practices. An important aspect of ensuring that such data remains fit for purpose is identifying what curation activities need to be associated with it. Here I revisit one particular case study associated with the molecular structure of a product identified from a photolysis reaction[1] and the curation of the crystallographic data associated with this study.

This particular dataset (CSD, dataDOI: 10.5517/cctnx5j) is associated with an article entitled “Single-Crystal X-ray Structure of 1,3-Dimethylcyclobutadiene by Confinement in a Crystalline Matrix“.[1] Data for crystal structures supporting a research article is required (at least in part) to be deposited into the Cambridge structure database (internal reference MUWMEX) and for which a significant level of curation is performed. Although the definition of the term curation has evolved over the last few years, here I take it to include the following:

Identification of appropriate metadata describing the data. For molecules, this would include any identifiers such as the name of the molecule and the connectivities of the atoms constituting that molecule.
The submission of this metadata to a suitable aggregator, such as e.g. DataCite and its inclusion in any other databases associated with the data. These two tests are part of the FAIR data guidelines[2], covering the F (findable) and A (accessible).
Performing any validation tests for the data that can be identified. With crystal structure data in CIF format, this is defined by the utility checkCIF and helps to ensure the I (inter-operable) of FAIR. The R refers in part to the licenses under which the data can be re-used.

On (it has to be said rare) occasions, these procedures can lead to a disparity between the author’s conclusions arrived on the basis of their acquired data and the metadata identified by the independent curators. This difference is most obviously illustrated in this case study by the chemical names inferred by the curation process for the structure represented by the data in the CSD:

chemical name: “tetrakis(Guanidinium) 25,26,27,28-tetrahydroxycalix(4)arene-5,11,17,23-tetrasulfonate 1,5-dimethyl-2-oxabicyclo[2.2.0]hex-5-en-3-one clathrate trihydrate“
chemical name synonym: “tetrakis(Guanidinium) tetra-p-sulfocalix(4)arene 1,3-dimethylcyclobutadiene carbon dioxide clathrate trihydrate“.

Only the synonym agrees with the title given by the original authors in their publication.[1] One might indeed strongly argue that these two names are not in fact synonyms, since they refer to quite different chemical structures with different atom connectivities. A search of the database for the sub-structure corresponding to 1,3-dimethylcyclobutadiene does not reveal any hits and so the information implied by this synonym is not recorded in the index created for the CSD database.

I asked the scientific editors of the CSD for some guidance on the curation procedures applied to crystal structure datasets and they have kindly allowed me to quote some of this.

“In cases such as this, we as editors are sometimes faced with conflicting information and have to try our best to strike a balance between the data presented in the CIF, a published interpretation and our knowledge based on the information already in the CSD”.
“In areas where there is a particular conflict between these, we often would include a comment (usually in the Remarks or Disorder field as appropriate)”. For this particular dataset, one finds the following under the Disorder field:
- “Under UV radiation the clathrated pyrone molecule converts to a disordered mixture of square-planar 1, 3-dimethylcyclobutadiene and rectangular-bent 1, 3-dimethylcyclobutadiene in van der Waals contact with a carbon dioxide molecule. The ratio of the square-planar to rectangular-bent 1, 3-dimethylcyclobutadiene clathrate is modelled with occupancies 0.6292:0.3708”.
- It is not entirely obvious however whether this last comment originates from the original authors or from the data curators. It does not resolve the difference between the assigned chemical name and the indicated chemical name synonym.
“In the case of MUWMEX, I think that the editor produced a diagram (below) which seems chemically reasonable based on the crystallographic data with which we were provided and tried to cover the situation regarding disorder, van der Waals contacts etc in the ‘Disorder’ field. At this point, it is left to the CSD user to decide for themselves.”

We have arrived at a point where the CSD user must indeed decide what the species described by this dataset actually is. Ideally, the best recourse would be to acquire the original data in full and repeat the crystallographic analysis. This is an aspect of the curation of crystallographic data that is not conducted as part of the current processes, which would require as a minimum a superset known as the hkl information to be present in the data. Again, to quote the CSD scientific editors:

“With regard to your question: Is there any mechanism in the Conquest search to identify structures where the hkl information is present? I understand that it is not currently possible to do this in ConQuest. It is, however, possible … to access structure factor data (where available) using Access Structures.”

For MUWMEX, the hkl information is not present in the CSD dataset and in 2010 when the structure was published would have to be obtained directly from the authors. By 2016 however, its presence in deposited datasets was becoming far more common. It is worth pointing out that even the hkl information is not the complete data recorded for the experiment. That is represented by the original image files recording the X-ray diffractions. This latter is hardly ever available as FAIR data even nowadays.

I hope I have here illustrated at least some of the challenging aspects of curating scientific data and the issues that can arise when derived metadata (in this case the name and the atom connectivities of a molecule) reveal conflicts with the original interpretations. This for an area of chemistry where both the data deposition and its curation is a very mature subject, having operated for ~52 years now. It is still a process that requires the intervention of skilled curators of the data, but perhaps even more importantly it reveals the need to identify even more strictly what the provenance of the interpretations is. Should the CSD curation rest merely at the stage of teasing out and flagging inconsistencies and allowing the user to then take over to resolve the conflicts? Should it be more active, in re-analyzing data for each entry where conflicts have been detected? Perhaps the latter is not practical now, but it might be in the near future. What is certain is that with increasing availability of FAIR data these sorts of issues will increasingly come to the fore. And not just for the very well understood case of crystallographic data but for many other types of data.

References

Y. Legrand, A. van der Lee, and M. Barboiu, "Single-Crystal X-ray Structure of 1,3-Dimethylcyclobutadiene by Confinement in a Crystalline Matrix", Science, vol. 329, pp. 299-302, 2010. https://doi.org/10.1126/science.1188002
M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J. Boiten, L.B. da Silva Santos, P.E. Bourne, J. Bouwman, A.J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C.T. Evelo, R. Finkers, A. Gonzalez-Beltran, A.J. Gray, P. Groth, C. Goble, J.S. Grethe, J. Heringa, P.A. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S.J. Lusher, M.E. Martone, A. Mons, A.L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M.A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, and B. Mons, "The FAIR Guiding Principles for scientific data management and stewardship", Scientific Data, vol. 3, 2016. https://doi.org/10.1038/sdata.2016.18

Tags:assigned chemical name, author, chemical name, chemical name synonym, chemical names, chemical structures, editor, indicated chemical name synonym, Knowledge, radiation, Research, Scientific method, Technology/Internet, X-ray
Posted in Chemical IT, crystal_structure_mining | 5 Comments »

Supporting information: chemical graveyard or invaluable resource for chemical structures.

Friday, March 31st, 2017

Nowadays, data supporting most publications relating to the synthesis of organic compounds is more likely than not to be found in associated “supporting information” rather than the (often page limited) article itself. For example, this article[1] has an SI which is paginated at 907; almost a mini-database in its own right! Here I ponder whether such dissemination of data is FAIR (Findable, accessible, interoperable and re-usable).[2]

I am going to use this article as my starting point.[3] One of the compounds discussed there is shown below; it is not explicitly discussed in the main body of the article. So how findable is it?

A search of Scifinder (Chemical abstracts) using the structure above reveals one hit, the source being the expected one.[3]
A search of Reaxys (used to be Beilstein) reveals no hits in their own database, but one hit is noted in …
Pubchem, where it occurs as substance 163835830. The source is again cited correctly[3]. One of the properties reported is the InChI key: JSLVVAICXSKSEQ-UHFFFAOYSA-N. This is the same key generated from the structure drawing programs Chemdraw or ChemDoodle.
Google on the other hand finds nothing for JSLVVAICXSKSEQ-UHFFFAOYSA-N.[4]
I also tried Google Scholar but again with no luck.

So supporting information does appear to be indexed by both Chemical Abstracts and Pubchem; it is thankfully not a graveyard![5] The chemical databases do return valuable additional information about the molecule, such as e.g. its InChI key and much else besides. Given that presumably the open PubChem resource IS indexed by Google, it must be a policy somewhere that prevents e.g. JSLVVAICXSKSEQ-UHFFFAOYSA-N from being found.

I suppose the next question might be Supporting information: chemical graveyard or invaluable resource for chemical spectra? I confess here that this post was in fact inspired by a previous one on the topic of the provenance of NMR spectra. And perhaps also with some input from the concept of sonification of spectra, in which an instrumental spectrum is converted into a sound signature to allow blind people access to such information.^‡ I wonder whether a sonified unique digital signature could be used to search for spectra, somewhat in the manner that InChI helped in tracking down (or not) the molecule above? I think it would be reasonable to say that e.g. NMR spectra as embedded in say a 907 page supporting information document are likely to be very much less FAIR[2]. The solution there of course is better provenance and better metadata, as I previously mulled.

^‡I cannot help but wonder what a carbonyl group sounds like!

References

J.M. Lopchuk, K. Fjelbye, Y. Kawamata, L.R. Malins, C. Pan, R. Gianatassio, J. Wang, L. Prieto, J. Bradow, T.A. Brandt, M.R. Collins, J. Elleraas, J. Ewanicki, W. Farrell, O.O. Fadeyi, G.M. Gallego, J.J. Mousseau, R. Oliver, N.W. Sach, J.K. Smith, J.E. Spangler, H. Zhu, J. Zhu, and P.S. Baran, "Strain-Release Heteroatom Functionalization: Development, Scope, and Stereospecificity", Journal of the American Chemical Society, vol. 139, pp. 3209-3226, 2017. https://doi.org/10.1021/jacs.6b13229
M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J. Boiten, L.B. da Silva Santos, P.E. Bourne, J. Bouwman, A.J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C.T. Evelo, R. Finkers, A. Gonzalez-Beltran, A.J. Gray, P. Groth, C. Goble, J.S. Grethe, J. Heringa, P.A. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S.J. Lusher, M.E. Martone, A. Mons, A.L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M.A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, and B. Mons, "The FAIR Guiding Principles for scientific data management and stewardship", Scientific Data, vol. 3, 2016. https://doi.org/10.1038/sdata.2016.18
G.M.S. Yip, Z. Chen, C.J. Edge, E.H. Smith, R. Dickinson, E. Hohenester, R.R. Townsend, K. Fuchs, W. Sieghart, A.S. Evers, and N.P. Franks, "A propofol binding site on mammalian GABAA receptors identified by photolabeling", Nature Chemical Biology, vol. 9, pp. 715-720, 2013. https://doi.org/10.1038/nchembio.1340
S.J. Coles, N.E. Day, P. Murray-Rust, H.S. Rzepa, and Y. Zhang, "Enhancement of the chemical semantic web through the use of InChI identifiers", Organic & Biomolecular Chemistry, vol. 3, pp. 1832, 2005. https://doi.org/10.1039/b502828k
M. Karthikeyan, and R. Vyas, "ChemEngine: harvesting 3D chemical structures of supplementary data from PDF files", Journal of Cheminformatics, vol. 8, 2016. https://doi.org/10.1186/s13321-016-0175-x

Tags:Carbon, chemical databases, chemical graveyard, chemical spectra, Chemistry, digital signature, Nature, Organic, Organic chemistry, Organic compound, Organic food, search engines, Technology/Internet
Posted in Chemical IT | 3 Comments »

The provenance of scientific data – establishing an audit trail.

Thursday, March 30th, 2017

In an era when alternative facts and fake news afflict us, the provenance of scientific data becomes ever more important. Especially if that data is available as open access and exploitable by others for both valid scientific reasons but potentially also by those with other motives. Here I consider the audit trail that might serve to establish data provenance in one typical situation in chemistry, the acquisition of NMR instrumental data.

Here I describe how such data is generated in my department; details may vary elsewhere.

The prospective user of the NMR service is allocated a service ID. In our case, that ID relates to the research group rather than to individual researchers. This ID is parochial, it does not reference any other information about the user in the institute. Only the service manager has the information to associate this ID with real users and this information is normally not distributed.
When a sample is submitted, this ID is used to create a new folder containing the data as a sub-folder of the group ID and located on the NMR data servers.
The dataset itself^‡ contains a number of files that contain an audit trail (names such as audita.txt, auditp.txt) with the fields: ##AUDIT TRAIL= $$ (NUMBER, WHEN, WHO, WHERE, PROCESS, VERSION, WHAT). Typically, none of these files have propagated the original user ID under which the data was collected; to do so would require a programmatic connection between the local authentication systems and the spectrometer software used, a connection that is normally missing. Thus the first break in the provenance trail.
In principle other audit trails can be inferred from these files, such as the unique identity of the instrument provided by its manufacturer. Further information such as e.g. the probe used to collect the data (probes can be readily changed over) or any calibration data used in setting up the instrument for the data collection are by and large not recorded. To my knowledge, although an instrument can have a unique serial number, such serial numbers of swappable components such as probes are not recorded by the collection software. Thus the second break in the provenance trail.
This data then needs to be processed by further software. In this case we use the MestreNova system for this task. Each dataset has editable assigned properties; below I show those that can be associated with the spectrum (accessed with MestreNova using Edit/Properties). All this comes from the information collected by the instrument. The user’s identity can be inserted into the “title” field, the display of which is off by default.
There is also a section for parameters, a synonym for which might be metadata and accessed using this program from View/Tables/Parameters. If Author was entered as a parameter in the dataset by the spectrometer software, the Mnova document would retrieve that information. Equally, an ORCID identifier for the author entered at the time of data collection and thus stored in the dataset could be read by Mnova, stored and displayed if configured to do so. It would be fair to say however that this option is rarely if indeed ever systematically implemented by NMR instrument data collection software and so is never propagated to the data processing software (as highlighted in red below). Thus a third break in the provenance trail.
This is also an alternative and this time formal metadata field that can be populated, by default as shown below with the type of spectrum and nucleus. These properties are not controlled in the sense of only allowing those terms that are present in a specified dictionary. The jargon for such control is a metadata schema. This is not used here, since dissemination of this information is not intended; the software accepts whatever information it is given.
There are thus several opportunities to collect the identity of the experimenter and thus attribute provenance to the collected data, but this does very much depend on the will of researchers, institutions or publishers to enforce specific policies around this. The fourth break in the provenance trail.
The dataset can then be uploaded (DOI: 10.14469/hpc/1291), at which stage provenance can finally be added using the ORCID credentials of the person publishing the dataset, who of course may or may not be the person who actually recorded the data! The full metadata for this specific collection can be seen at data.datacite.org/10.14469/hpc/1291. Or to put it another way, this is the first point in the provenance chain where the metadata is controlled by a schema and is also discoverable in a standard programmatic manner, i.e. the preceding link. The provenance is now formally associated with the ORCID identifier using the DataCite metadata schema. You should be aware that a local policy^† is that access to the repository at https://data.hpc.imperial.ac.uk is only allowed by cross-authentication with http://orcid.org/ using the user’s ORCID. This identifier is then automatically propagated to the metadata held at e.g. data.datacite.org/10.14469/hpc/1095. Currently however, none of any metadata originally recorded in either the instrumental file set or the processed MestreNova file is forwarded on to the metadata record held at DataCite; again loss of information and potentially of provenance.
The peer-reviewed article resulting from the interpretation of this data however can be associated with the provenance introduced in the previous stage; see data.datacite.org/10.14469/hpc/1267 and the IsReferencedBy property.

Now imagine if there was a common thread in all the stages of acquiring, processing and publishing this scientific data based on the ORCID.

Providing an ORCID could be made an essential requirement of access to the instrument.
This information would be propagated to the dataset …
by inclusion in one or more of the audit trail files.
At this stage, further persistent identifiers associated with the instrument manufacturer could be added, which help identify not only the instrument used, but sub-components such as the changeable probe. This would allow access to any calibration curves or probe sensitivity and other aspects.
The ORCID and other relevant information could be picked up by the software used to convert the data into spectra and propagated into the metadata containers for this software …
where its use is controlled by a specified schema.
At this stage, the ORCID and information such as the nucleus recorded, the sample temperature etc can be propagated on to the final metadata records.
And the reader of the article describing this work would have a formally defined provenance audit trail they could follow back to the start of the experiment or forward to a published article. In this case, the data claims provenance (acquired from peer review) from the article, but it should also work in reverse with the article claiming provenance from the data on which it is based. The indexing of this bidirectional exchange is one of the exciting features that we should see emerging from CrossRef (holders of metadata about articles) and DataCite (holders of metadata about research data) in the near future.

We are clearly a little way from having the infrastructures described above for establishing such data audit trails. To do so will require cooperation from instrument manufacturers, at least in the example as charted above, as well as researchers, institutions, publishers, peer-reviewers and funding bodies. The first step would be to ensure that all scientists who intend collecting, processing and publishing data should claim an ORCID. That remark is directed specifically at undergraduate, postgraduate and post-doctoral researchers, not just at their supervisor or their PI (principal investigator). At a point when the discussion about alternate facts and perhaps even alternate data risks a general loss of confidence in science, we should be pro-active in establishing trust in the scientific processes.

^‡ You can see an example obtained by this process at DOI: 10.14469/hpc/1095

^† This requirement is a strong driver for the uptake of ORCID amongst our student population.

Tags:Acquisition, Archival science, author, collection software, Company: NMR, data, Data management, data processing software, Evidence law, instrument data collection software, local authentication systems, Mestrenova, MestreNova system, Nuclear magnetic resonance, principal investigator, Provenance, Scientific method, service manager, spectrometer software, supervisor, Technology/Internet, Terminology
Posted in Chemical IT | 2 Comments »

A nice example of open data (in London).

Sunday, March 5th, 2017

Living in London, travelling using public transport is often the best way to get around. Before setting out on a journey one checks the status of the network. Doing so today I came across this page: our open data from Transport for London.

I learnt that by making TFL travel data openly available, some 11,000 developers (sic!) have registered for access, out of which some 600 travel apps have emerged.
The data is in XML, which makes it readily inter-operable.[1]
This encourages crowd-sourced innovation.
They have taken the trouble to produce an API (application programmable interface) which allows rich access to the data and information about e.g. AccidentStats, AirQuality, BikePoint, Journey, Line, Mode, Occupancy, Place, Road, Search, StopPoint Vehicle.

Chemists could learn some lessons here! Of course, there are quite a few chemical databases with APIs that are examples of open data, but the “ESI” (electronic supporting information) sources which almost all published articles rely upon to disseminate data are clearly struggling to cope. Take for example this recent article[2], where much of the data has been dropped into the inevitable PDF “coffin” and which is a breathtaking 907 pages long. To give the authors their due, they also provide 20 CIF files which ARE good sources of data. Rarely commented on, but clearly missing from the information associated with this (indeed most) articles is the metadata about the data. Thus the metadata for these CIF files amounts to just e.g. 229. To find out the context, one has to scour the article (or the 907 pages of the ESI) to identify compound 229 (I strongly suspect it’s a molecule because of the implied semantics of the term, not because its been explicitly declared). You will not find the metadata at e.g. data.datacite.org which is one open aggregator and global search engine based on deposited metadata.

I have commented elsewhere on this blog that other types of data could also be enhanced in the manner that CIF crystallographic files represent. For example the Mpublish NMR project,^‡ examples of which are shown here, and for which typical data AND its metadata can be seen at DOI: 10.14469/hpc/1053. I fancy that if this method had been adopted,[2] those 907 pages might have shrunk somewhat, although of course not entirely. But my hope is that gradually the innovative chemistry community will find ways of exhuming more and more data from the PDF coffin and in the process reducing the paginated lengths of the PDF-based ESI further, perchance eventually even to zero?

If you are yourself preparing an article and sweating over the ESI at this very moment, do please take a look at the Mpublish method and how perhaps it can help make your NMR data at least more useful to others.

^‡I understand an article describing this project is in preparation. If you cannot wait, this recent application of the Mpublish project has some details.[3]

References

P. Murray-Rust, and H.S. Rzepa, "Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles", Journal of Chemical Information and Computer Sciences, vol. 39, pp. 928-942, 1999. https://doi.org/10.1021/ci990052b
J.M. Lopchuk, K. Fjelbye, Y. Kawamata, L.R. Malins, C. Pan, R. Gianatassio, J. Wang, L. Prieto, J. Bradow, T.A. Brandt, M.R. Collins, J. Elleraas, J. Ewanicki, W. Farrell, O.O. Fadeyi, G.M. Gallego, J.J. Mousseau, R. Oliver, N.W. Sach, J.K. Smith, J.E. Spangler, H. Zhu, J. Zhu, and P.S. Baran, "Strain-Release Heteroatom Functionalization: Development, Scope, and Stereospecificity", Journal of the American Chemical Society, vol. 139, pp. 3209-3226, 2017. https://doi.org/10.1021/jacs.6b13229
M.J. Harvey, A. McLean, and H.S. Rzepa, "A metadata-driven approach to data repository design", Journal of Cheminformatics, vol. 9, 2017. https://doi.org/10.1186/s13321-017-0190-6

Tags:API, chemical databases, City: London, Company: TfL, Government, Greater London, Local government in London, London, Passenger Transportation Ground & Sea - NEC, PDF, Public transport, Route planning software, search engine, Sustainable transport, Technology/Internet, Transport, Transport for London, travel apps, travel data, XML
Posted in Chemical IT | No Comments »

Henry Rzepa's blog

Archive for the ‘Chemical IT’ Category

Tautomeric polymorphism.

References

Challenges in reliably representing the chemistry of crystal structures.

References

Curating a nine year old journal FAIR data table.

References

Conference report: an example of collaborative open science (reaction IRCs).

Conference report: OPEN SCIENCE AND THE CHEMISTRY LAB OF THE FUTURE

References

The challenges in curating research data: one case study.

References

Supporting information: chemical graveyard or invaluable resource for chemical structures.

References

The provenance of scientific data – establishing an audit trail.

A nice example of open data (in London).

References

Recent Posts

Archives

Blogroll

Meta