Posts Tagged ‘semantic web’

Shared space (in science).

Friday, January 6th, 2012

I thought I would launch the 2012 edition of this blog by writing about shared space. If you have not come across it before, it is (to quote Wikipedia), “an urban design concept aimed at integrated use of public spaces.” The BBC here in the UK ran a feature on it recently, and prominent in examples of shared space in the UK was Exhibition Road. I note this here on the blog since it is about 100m from my office.

Shared space is the Mornington Crescent of urban design, where you have to work out the rules of the game by in effective participating in it. Thus the new “rules” of travelling down Exhibition Road (by either foot, car, bike, bus or indeed motorbike as I do each day) are not declared, and each participant works them out on the fly. This is supposed to lead to fewer misunderstandings, although the practice does seem rather different (at least at the moment). But where is the chemistry? Well, these thoughts were triggered by two colleagues independently asking me about how chemists use metaphors, and how chemists use representations. I have in fact touched upon both of these previously, and it struck me that this last example, of arrow pushing in organic chemistry, was in fact a nice example of a shared space in chemistry. The rules of arrow pushing are not formally set out (in an IUPAC rule book or similar) but are worked out on the hoof so to speak. Except that the space is shared only by organic chemists. I have observed over the years that e.g. physical or inorganic chemists will mostly not dare venture into that shared space; they often give a rather good impression of not understanding the rules. I also know from experience that mathematicians and physicists regard arrow pushing as anything other than a shared (scientific) space.

Yet the modern scope and ethos of science is that we should all venture into shared spaces (whether they are in or out of our comfort zones). Perhaps, in science, the problem is that so much of what we do has what I refer to as “implicit semantics” (its part of our DNA of e.g. being a chemist). Take for example the diagram below (which I used previously) which sets out four possible sets of rules for this particular shared space. Even so, without further explanation, you might be struggling to infer what message is carried by this diagram. That is because so much of it contains implicit semantics, and unless you recognise the missing features, how can you go about finding out what is invisible?

Curly arrow pushing

My concluding thought would be that shared space is what the semantic web is surely striving for. And if Exhibition Road is anything to go by, it is clearly quite a challenge. But if I (and particularly the pedestrians I encounter there each day) end up surviving 2012, perhaps the Semantic Web may one day come about as well!

Validating the chemical literature heritage. Eudesma-1,3-dien-6,13-olide.

Thursday, December 8th, 2011

Previously, I had noted that Corey reported in 1963/65 the total synthesis of the sesquiterpene dihydrocostunolide. Compound 16, known as Eudesma-1,3-dien-6,13-olide was represented as shown below in black; the hydrogen shown in red was implicit in Corey’s representation, as was its stereochemistry. As of this instant, this compound is just one of 64,688,893 molecules recorded by Chemical Abstracts. How can we, in 2011, validate this particular entry, and resolve the stereochemical ambiguity? Here I discuss one approach (a vision if you like of the semantic web).

The following facts are asserted about 16;

  1. Its connection table, namely what atoms are connected by at least a single bond.
  2. The (presumed) absolute stereochemistry at four stereogenic centres, leaving the 5th (in red) either unknown or implicit. I say presumed because often when it is not known which of two possible enantiomers a scalemic molecule exists in; just one is often drawn, in essence as a guess.
  3. The 1H NMR chemical shifts of 13 of the 20 hydrogen atoms present in the molecule (the solvent used is unreported, and may be implicitly chloroform).
  4. [α]D +375° (no solvent reported)
  5. m.p. 69.5-70.5° (note by the way that the units represented by the symbol ° are quite different for these two facts! A scientist of course can easily recognise the implicit difference)
  6. λmax (methanol) 265 mµ, ε 4800 (note again the ambiguity in the units, in fact 265 mµ is nowadays written 265 nm and the molar extinction coefficient ε is assumed to be expressed in units of L mol−1 cm−1).
Can we use these facts to validate the structure of 16 and to resolve its stereochemical ambiguity? Well, modern computational quantum chemistry can (inter alia)  supply the following:
  1. From a given connection table, an accurate prediction of the 3D coordinates of all the atoms for, in this case, either of the stereoisomers involving the hydrogen shown in red.
  2. The 1H NMR shifts relative to TMS, to an accuracy of better than 0.5ppm (often very much better).
  3. [α]D
  4. λmax (methanol) and an approximate estimate of ε.
How do things pan out? We model the more specific stereoisomer shown below, with complete stereochemical notation (CIP) now annotated in.

  1. The 1H NMR was calculated at a ωB97XD/6-311G(d,p) optimised geometry and a single point 6-311++G(d,p) wavefunction. I have linked the “DOI” identified for this calculation to this post so that the calculation itself can be verified by others. It comes out (in ppm) δ 1.02 [0.98, 3H,s], 1.17 [1.15, 3H, d], 2.11 [1.95, 3H, s], 3.85 [3.79, 1H, dd], 5.75, 6.13, 6.30 [5.2-6.0, vinyl], the reported experimental values being in square brackets […].
  2. The spin spin couplings were calculated using the NMR(spinspin,mixed) model implemented in Gaussian (a specification for which is found in the online documentation of the NMR keyword). For δ 3.79, two couplings of 10 Hz are reported. The calculation predicts 9.77 and 9.53 Hz (for assignments, click on the image above to get a 3D model).
  3. [α]D +391° (calculated for chloroform)
  4.  λmax 265 nm (calculated for methanol; ε ~4800 for a linewidth of 3600 cm-1).
  5. Strictly speaking, all of the above should be repeated for the other possible stereoisomer, and the results for the two together analysed statistically.
Can we add data to the original information (a process which might be called curation)? Well, we can using the above calculations to;
  1. provide estimated chemical shifts and coupling constants for ALL the protons in the molecule, not just the 13 reported by Corey, and for all the carbons (no 13C spectrum was reported). Advances in spectrometer sensitivity and resolution mean that if these spectra were ever to be (re)measured, the additional protons could probably be easily identified, and both homo and heteronuclear spin-spin couplings measured.
  2. predict the electronic circular dichroism spectrum for 16 (not previously measured) and in particular the Cotton effect on the λmax 265 nm absorption as being positive (Δε ~+20). This would allow the absolute configuration of this scalemic molecule to be independently validated. We could add to this a prediction of the vibrational circular dichroism spectrum if need be.
  3. What we cannot easily do is predict the melting point (or indeed the crystal packing), although no doubt this will become more reliable in the future.
So what is the big picture? In the earlier post, I had identified a key article in the development of the electronic theory of pericyclic reactions, and in particular how the inferred stereochemistry of 10, 13 and 16 could have been used as the spark that ignited that theory. It would have been essential to ensure that these stereochemical foundations were absolutely sound. In this case of course, the compounds were related to many others by synthetic transformations, and the very fabric of the connections between these molecules served as a validation of the nature of the molecules.

But think how many (millions) of such molecules have been discovered, and how the majority of these have probably not been subjected to such rigorous scrutiny. It is entirely possible that much of the chemical literature is sprinkled with errors in assignments (and many more have unresolved ambiguities, such as the stereochemistry of the hydrogen shown in red at the top of this post). However, for the first time in the history of chemistry, we can now (almost routinely) use quantum modelling to provide independent validation of the chemical literature, as illustrated above. Of course, the validation is not absolute, merely probable to some degree (the above example we might agree shows a very high level of probability that the structure shown is in fact correct). More importantly, in computational validation, we have the potential for automation. One might strive for an infra-structure where much of the validation can be performed automatically, by tireless machines that operate 24/7, and that only flag probable errors when they discover them. This is the vision of the chemical semantic web!

Scalemic molecules: a cheminformatics challenge!

Wednesday, July 6th, 2011

A scalemic molecule is the term used by Eliel to describe any non-racemic chiral compound. Synthetic chemists imply it when they describe a synthetic product with an observable enantiomeric excess or ee (which can range from close to 0% to almost 100%). There are two cheminformatics questions of interest to me:

  1. How many non-trivial scalemic molecules have been reported in the literature (let’s assume their ee is significantly greater than 0%)?
    • The distribution function for the ee of these molecules would be most interesting!
  2. Of those, how many have the absolute configuration of the predominant enantiomer established with high confidence?
    • Or, to put this another way, how many may prove to be mis-assigned?

Note the careful qualification in the above questions. Thus by non-trivial, I mean compounds whose scalemic attributes persist in solution for a chemically useful duration. That could be taken to mean configurationally stable chiral molecules, rather than those that might be conformationally chiral (an example of a trivial scalemic molecule would be e.g. the twist-boat conformation of cyclohexane, which having D2 symmetry is dissymetric, but which would only retain its scalemic property for a trivially short timescale).

What are boundary values? These are some:

  • As I write this, CAS records 61,257,703 chemical substances. Needless to say (unless I missed it), the answer to my first question is not to be found there.
  • Beilstein (Reaxys) records 1,126,995 compounds as having one or more reported chiroptical properties (which is the most direct way of establishing a molecule is scalemic, although strictly, having say an optical rotation of 0° does not necessarily mean the molecule is not scalemic). We have no way of knowing how many molecules are scalemic for which no chiroptical measurement has been made (but one would hope its a small proportion). Perhaps that is a good answer to question 1?
    • of which 1,097,094 relate to optical rotatory power, 17,515 to optical rotatory dispersion and 62,248 to electronic circular dichroism.
    • it is more difficult to answer how many of these 1,126,995 substances have a firmly established absolute configuration. Measuring a chiroptical property per se does NOT in itself establish the absolute configuration. Doing so is a fascinating exercise in sequential logical argument, and how one does it has changed quite a lot over time. And what might I mean with high confidence? An older assignment (made say > 40 years ago) might be less confident than one established in 2011 (fortunately, we can probably trust the absolute configurations of the amino acids!). A bit of a can of worms, nevertheless. But it interests me because it is a good example of what the semantic web is supposed to be all about.
  • The Cambridge crystallographic database reports 560,307 entries, of which 72,340 are in chiral space groups (in which a chiral molecule can crystallise) and exhibit no disorder or other errors. We do not know how many of these are non-trivial, since all manner of small (and low energy) distortions can create a chiral species (in the solid state), but which would not persist  for a chemically useful duration in solution (i.e. it might for example immediately racemize and become non-scalemic).
  • The Flack parameter has been used since 1983 for enantiomorph estimation (a value of ~≤ 0.10(10) would be considered meaningful). This could in principle provide an answer of known confidence to my question 2 above (but would not address the issue of non-triviality).
    • The challenge now is to quantify how many compounds have a meaningful reported Flack parameter (presumably a sub-set of 72,340?)

Let me declare one personal interest. Over the last four years or so, we have been asked to confirm the absolute configuration of around eight scalemic molecules. After a detailed study, we concluded three were mis-assigned. Now this in no way implies anything about what the answer to question 2 above might be! But it does make one think!

(re)Use of data from chemical journals.

Wednesday, December 22nd, 2010

If you visit this blog you will see a scientific discourse in action. One of the commentators there notes how they would like to access some data made available in a journal article via the (still quite rare) format of an interactive table, but they are not familiar with how to handle that kind of data (file). The topic in question deals with various kinds of (chemical) data, including crystallographic information, computational modelling, and spectroscopic parameters. It could potentially deal with much more. It is indeed difficult for any one chemist to be familiar with how data is handled in such diverse areas. So I thought I would put up a short tutorial/illustration in this post of how one might go about extracting and re-using data from this one particular source.

Interactive Journal table

The above is a snapshot of part of the table in question, with a box in the middle set aside for a Jmol applet to appear. What might be both less obvious, and less familiar to many who might have seen such a display is the very rich environment available for manipulating the data. To expose some of this, proceed as follows:

  1. Firstly, load a molecule into the Jmol window by clicking on e.g. the hyperlink shown below.

    Loading a molecule

  2. The display shown below will appear, in this case a set of coordinates used to present a 3D model of a molecule, which can be rotated, zoomed, etc. It also has been labelled with various selected bond lengths etc.

    Interactive table with molecule loaded

  3. To extract data, right-click anywhere in the molecule area. Navigate through the menus which appear as shown below. In this case, the data is present in the form of a Gaussian log file. This can contain the history of the particular calculation performed (e.g. a geometry optimisation) or as in this case, all 3N-6 calculated normal vibrational modes. The one of interest here is number 318, being an O=C=O stretching mode.

    An Interactive table in a chemistry journal.

  4. This mode can now be manipulated visually by selecting various parameters:

    Manipulating a vibrational mode

  5. Jmol has a scintillating display of other options, and more are being added all the time, so the above display is by no means the limit of what one can do.
  6. Now to the most important bit. Invoke the menu as shown below, whereupon a copy of the relevant file (gzipped in this case to reduce its size) will be downloaded to your local system. You will now need to use a program on your own computer capable of reading and processing such a file (after unzipping).

    Downloading a data file.

  7. There may be a bewildering variety of programs and toolkits which may perform the operation you wish on such a file. Some are commercial, some are open source. To help people get going, I link to one of the latter type here, You might also want to visit the Quixote project for ideas.
  8. We are not quite finished yet. Perhaps a Gaussian log file does not suite your purpose. Well, now try clicking on this link

    Link to a digital repository

  9. This produces a page such as below, which contains more files. In this example, several molecular identifiers are present (InChI and InChI key) to help identify the uniqueness of the system, the molecular coordinates are available as a .cml file which itself can be processed by a variety of software tools, the original file used to run the calculation can be inspected (if you want to eg repeat it) as input.gjf, the logfile we have seen above, and a checkpoint file, which is most useful when using either the Gaussian program system or a visualiser (Gaussview, ChemBio3D etc, both commercial programs). A SMILES string is also offered, and sometimes (not in this example) a so-called wavefunction file which can be used by some programs to analyse the wavefunction, and perform e.g. QTAIM, ELF, NCI analyses.

    A digital repository page.

    It is now up to the user to identify suitable processing programs on their computer which fit their purpose.

  10. There is one other file present which I have not yet explained, the mets.xml manifest. This is a metadata file, containing (along with much else) an RDF declaration of (some) of the properties of the molecule. In theory at least, this file could be automatically harvested for the RDF, which could be injected into a triple store, and queried semantically using eg SPARQL. That is part of the semantic web.

I hope some of the screenshots here make the process of extracting data from an interactive table article a little more obvious. I must declare that this way of doing it is just one of the ways being explored and also (much to my regret) is not yet particularly common. But hopefully you might capture a little of what some of us believe to be the future of scientific journals.

Semantically rich molecules

Sunday, May 2nd, 2010

Peter Murray-Rust in his blog asks for examples of the Scientific Semantic Web, a topic we have both been banging on about for ten years or more (DOI: 10.1021/ci000406v). What we are seeking of course is an example of how scientific connections have been made using inference logic from semantically rich statements to be found on the Web (ideally connections that might not have previously been spotted by humans, and lie overlooked and unloved in the scientific literature). Its a tough cookie, and I look forward to the examples that Peter identifies. Meanwhile, I thought I might share here a semantically rich molecule. OK, I identified this as such not by using the Web, but as someone who is in the process of delivering an undergraduate lecture course on the topic of conformational analysis. This course takes the form of presenting a set of rules or principles which relate to the conformations of molecules, and which themselves derive from quantum mechanics, and then illustrating them with selected annotated examples. To do this, a great many semantic connections have to be made, and in the current state of play, only a human can really hope to make most of these. We really look to the semantic web as it currently is to perhaps spot a few connections that might have been overlooked in this process. So, below is a molecule, and I have made a few semantic connections for it (but have not actually fully formalised them in this blog; that is a different topic I might return to at some time). I feel in my bones that more connections could be made, and offer the molecule here as the fuse!

Two chair conformations of the molecule DULSAE. Click here for 3D. Note the (attractive) short H...H contacts.

To list all the likely semantics that a chemist would perceive in the graphic above would take far too long (by the time one would have finished, a text book would have been written). So here is a very very short summary in the context of conformational analysis.

  1. The molecule has a six membered ring as its backbone
  2. which can adopt two possible chair conformations
  3. which can interconvert by exchanging the axial and equatorial group pair for each of the four carbon atoms in the ring.
  4. An organic chemist will immediately notice a very unusual group, Fe(CO)2Cp, which itself is a semantic goldmine,
  5. but for the purposes here we will regard merely as a C-Fe bond!

The (semantic) question to be posed is “which of the two conformations shown above is the most stable“? That too of course has an abundance of implicit semantics, but most human chemists will probably know that this refers to asking which of the two geometries represents the lowest thermodynamic free energy (and we leave aside the issue of what medium the molecule is in, i.e. solid, solution or gas). A far trickier question is “why”?

So to (some interim) answers. Well, a ωB97XD/6-311G(d) calculation (wow, think of what is implied in that concise notation) predicts conformation (a) to be more stable by 2.3 kcal/mol (2.1 in ΔG, see DOI: 10042/to-4911). Now to the why. What connections would someone well versed in conformation analysis spot?

  1. The molecule has two methyl groups on adjacent atoms. They may prefer to be di-axial rather than di-equatorial to avoid excessive steric repulsions (whatever we mean by that!). That might prefer (b).
  2. The molecule has one carbon with both a cyano and an ether linkage. Well, that is susceptible to an anomeric effect (although, as I pointed out in an earlier post here, this connection has in fact often NOT been made in the literature). Only in conformation (a) is one of the oxygen lone pairs aligned anti-periplanar to the axis of the C-CN bond. The reasons why this is important are outlined in my Lecture course.
  3. Having spotted the last, the human might ask whether there is any possibility of an anomeric effect between an oxygen lone pair and the axis of the C-Fe bond? Well, I rather think that not a single human ever has asked that question! (I cannot know that of course, and perhaps someone has speculated upon this in the literature; this is where a full semantic web would help. That question could be posed of it! The reason  I suspect the connection might not have been made is that the anomeric effect is the domain of the organic chemistry, and  C-Fe bonds are those of the organometallic chemist. They do tend to see the chemical world rather differently, these two groups of chemists). If there was such an effect, it would favour (a).
  4. Then we have an X-C-C-Y motif. Depending on the nature of X and Y, the molecule might actually prefer a gauche conformation, i.e the dihedral angle XCCY would be around 60°. There are several such motifs one can detect; X=Y=O (twice). It might be that other permutations such as X=CN, Y=Fe(CO)2Cp, favour anti-periplanar. There are other permutations whose orientational preference may not even be recorded (in text books). Suddenly its gotten complicated!
  5. There are a number of short (~2.4Å) H…H contacts
  6. We are starting to understand that to unravel the conformation of this molecule, one may have to identify quite a number of different “rules”, and then to quantify each, and add up the numbers to get the final result. That energy of 2.3 kcal/mol may be composed of the result of applying quite a number of different rules. Hence the title of this post, a semantically rich molecule!

Well, I will leave it here for this post, without giving answers to the six points listed above, or really answering my main question posed above. That would make the post too complex (but I will follow this up!). I do want to end by planting the idea that answering this question involves making a great many chemical connections about the properties of this molecule, and then identifying quantitative ways (algorithms) in which an answer can be formulated. The molecule above is presented as a challenge for the Semantic Web to address!