Posts Tagged ‘search definition’

How to search data repositories for FAIR chemical content and data: SubjectScheme

Thursday, June 8th, 2017

As data repositories start to flourish, it is reasonable to ask questions such as what sort of chemistry can be found there and how can I find it? Here I give an updated[1] worked example of a digital repository search for chemical content and also pose an important issue for the chemistry domain.

Firstly, I should say this search is restricted just to those data repositories that submit indexing terms (metadata) to DataCite, which is the agency that will be used to conduct the searches. Each type of metadata is defined by a prefix or operator field (much in the same way that an advanced Google search can be prefixed with an operator, e.g. author:). I will use just two such DataCite field prefixes here as exemplars (there are many more).

  1. media: This specifies the media type for the data being searched. For restriction to chemistry one takes advantage of the chemical/x- media type, as described previously.[2]
  2. SubjectScheme: This is a new declaration, as specified in the DataCite V4 metadata schema.[3] The subject scheme in effect declares a subject-specific term, and is designed to be used by domains such as chemistry.

This latter is best illustrated by one specific example of a search which I will dissect here:
https://search.datacite.org/works?query=media:chemical\/x\-gaussian*+SubjectScheme:inchikey+subject:XZYDALXOGPZGNV-UHFFFAOYSA-M+media:chemical\/x\-mnpub*

  1. https://search.datacite.org/works?query= queries the DataCite MDS (metadata store).
  2. media:chemical\/x\-gaussian* defines a media type which contains the string chemical/x-gaussian, with the * being a wild-card which allows any characters to follow this string. This now is specifying any data repository where Gaussian files have been deposited and assigned this media type.
  3. + represents a Boolean AND operator.
  4. SubjectScheme:inchikey restricts a subject search to a subjectScheme having the value inchikey, whilst
  5. subject:XZYDALXOGPZGNV-UHFFFAOYSA-M defines the value of the subject itself.
  6. media:chemical/x-mnpub completes the search definition, this relating to the mandatory additional presence of an Mpublish[4] file indicating (spectroscopic, probably NMR) data readable by the MestreNova program.

One hit with these restrictions has doi: 10.14469/HPC/2635 and clicking the button on the landing page for this object labelled metadata resolves to e.g.
https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/2635,
and downloads the metadata record for this object. Part of this record looks a bit like:

This brings me to the important issue for the chemistry domain, which is to agree upon a core set of SubjectSchemes for implementation in data repositories with domain-specific chemical content. The two subjects above, the InChI and the InChIKey seem obvious candidates for inclusion. But how the list is extended and how the SubjectScheme is specified are now matters for the community to discuss. Perhaps the IUPAC GoldBook is one starting point for the SubjectScheme URIs. Watch this space.


The \ syntax indicates an “escaped” character. Thus in chemicalx\-gaussian a \ ensured that the following / is treated as part of the search string, and not as part of the search syntax. Likewise \- ensures the minus character is part of the string and not a syntactic negation. The current list of characters requiring escaping is + - & | ! ( ) { } [ ] ^ " ~ * ? : \ /

The documentation lists common fields, but there are far more specified in V4 of their schema. The ones you see used here are not (yet?) documented at https://search.datacite.org/help.html

This Google page has a rich plethora of powerful searches, which I suggest almost no-one knows about!


References

  1. H.S. Rzepa, A. Mclean, and M.J. Harvey, "InChI As a Research Data Management Tool", Chemistry International, vol. 38, pp. 24-26, 2016. https://doi.org/10.1515/ci-2016-3-408
  2. H.S. Rzepa, P. Murray-Rust, and B.J. Whitaker, "The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange", Journal of Chemical Information and Computer Sciences, vol. 38, pp. 976-982, 1998. https://doi.org/10.1021/ci9803233
  3. DataCite Metadata Working Group., "DataCite Metadata Schema Documentation for the Publication and Citation of Research Data v4.0", DataCite e.V., 2016. https://doi.org/10.5438/0012
  4. M.J. Harvey, A. McLean, and H.S. Rzepa, "A metadata-driven approach to data repository design", Journal of Cheminformatics, vol. 9, 2017. https://doi.org/10.1186/s13321-017-0190-6

Deviations from tetrahedral four-coordinate carbon: a statistical exploration.

Sunday, September 6th, 2015

An article entitled “Four Decades of the Chemistry of Planar Hypercoordinate Compounds[1] was recently reviewed by Steve Bacharach on his blog, where you can also see comments. Given the recent crystallographic themes here, I thought I might try a search of the CSD (Cambridge structure database) to see whether anything interesting might emerge for tetracoordinate carbon.

The search definition is shown below using a  simple carbon with four ligands, the ligands themselves also being tetracoordinate carbon. The search is restricted to data collected below temperatures of 140K, as well as R-factor <5%, no errors and no disorder. Cyclic species are allowed and a statistically reasonable 2773 hits emerged from the search.

Scheme

Recollect that the idealised angle subtended at the centre is 109.47°. I show below three separate heat plots of the search results. Why three? The way the search software (Conquest) works is that one could define four C-C distances and six angles, and then plot any combination of one distance and one angle. I show just three combinations here, but could have included many more.

There appear to be four distinct clusters of values for this angle that emerge from the three plots shown below (the “bin size” is 100, and the frequency colour code indicates how many hits there are in each bin).

  1. The hotspot is unsurprisingly ~109° with a corresponding C-C distance of ~1.54Å.
  2. There may be two clusters at angles of ~60° (cyclopropane), with C-C values ranging from ~1.47 to ~1.55Å.
  3. A collection at ~90° (mostly cyclobutane?), with C-C values up to 1.6Å.
  4. A collection at ~140° (again small rings), now with much shorter C-C values of ~1.46Å. This reminds of the approximation that the hybridisation in e.g. cyclopropane is a combination of sp5 and sp3.

Scheme

Scheme

Scheme

Ideally, what one might want to plot would be sums of four angles; for a pure tetrahedral carbon the sum would always be 438° (4*109.47°) but for a pure planar carbon it could be as low as 360° (4*90°). One could then see how closely the distribution approaches to the latter and hence reveal whether there are any true planar tetracoordinate carbon species known. Although the Conquest software cannot analyse in such terms, a Python-based API has recently been released that should allow this to be done, although I should state that this requires a commercial license and it is not open access code. If we manage to get it working, I will report!


As a teaser I also include a plot of six-coordinate carbon, in which the ligands can be any non-metal. Note the clusters at angles of 60, ~112 and ~120-130°. It is worth pointing out that the definition of the connection between a carbon and a ligand as a “bond” becomes increasingly arbitrary as the coordination becomes “hyper”. Because crystallography does not measure electron densities in “bonds”, we know nothing of its topology in this region. It is therefore quite possible that the appearance of the heat plot below might be related just as much to whatever convention is being used in creating the entry in the CSD as it would be to a quantum analysis of the bonding.

Scheme

References

  1. L. Yang, E. Ganz, Z. Chen, Z. Wang, and P.V.R. Schleyer, "Four Decades of the Chemistry of Planar Hypercoordinate Compounds", Angewandte Chemie International Edition, vol. 54, pp. 9468-9501, 2015. https://doi.org/10.1002/anie.201410407

More simple experiments with crystal data. The pyramidalisation of nitrogen.

Saturday, November 1st, 2014

We are approaching 1 million recorded crystal structures (actually, around 716,000 in the CCDC and just over 300,00 in COD). One delight with having this wealth of information is the simple little explorations that can take just a minute or so to do. This one was sparked by my helping a colleague update a set of interactive lecture demos dealing with stereochemistry. Three of the examples included molecules where chirality originates in stereogenic centres with just three attached groups. An example might be a sulfoxide, for which the priority rule is to assign the lone pair present with atomic number zero. The issue then arises as to whether this centre is configurationally stable, i.e. does it invert in an umbrella motion slowly or quickly.  My initial intention was to see if crystal structures could cast any light at all on this aspect.

pyramidal

Central atom has three bonded atoms as C, of which either all three must themselves have four attached atoms, or one can have just three attached atoms as shown above, along with acyclic character for the three bonds attached to the central atom, R ≤ 0.1, not disordered and no errors.

Using the search definition above for R3N one gets the result below. It shows a hot spot for an angle subtended at the nitrogen of ~111°, indicating a pyramidal nitrogen. But how easily is that perturbed? (which is almost like asking how easily can it invert its configuration?).

R3N, all sp3 attached carbons

A perturbation can be applied by changing just one of the attached carbons as having three attached atoms of its own (sp2 hybridised). The response is that the hot spot moves to 120° (below). Of course now this includes compounds such as amides and the like. But we have learnt that it takes just one such attached sp2 hybridised carbon to planarize an adjacent nitrogen.

R3N-1sp2-2sp3

The control experiment will now be to apply the same test to a P. The hot spot moves from ~99° (P with three sp3 carbons attached) to ~103° (P with two sp3 and one sp2). This reminds us that the overlap and energy-match between a p-orbital on carbon to an adjacent p-orbital on nitrogen is good, whereas the same overlap/energy match to a p-orbital on P is significantly less so.
R3P-sp3

R3P-1sp2-2sp3

One gets the same result when the central atom is S; the hotspot moves from ~102° to ~105°. Unfortunately, not enough compounds are known for a tri-substituted oxygen compounds to see how this element responds.

R3S-sp3R3S-1sp2-2sp3

My point in illustrating these statistics is to show how much text-book chemistry can be recovered simply by a few quick explorations of crystal structures. One could even argue that much introductory chemistry could be taught by reference to the statistics of such structures.

Amides and inverting the electronics of the Bürgi–Dunitz trajectory.

Thursday, June 26th, 2014

The Bürgi–Dunitz angle describes the trajectory of an approaching nucleophile towards the carbon atom of a carbonyl group. A colleague recently came to my office to ask about the inverse, that is what angle would an electrophile approach (an amide)? Thus it might approach either syn or anti with respect to the nitrogen, which is a feature not found with nucleophilic attack. amide My first thought was to calculate the wavefunction and identify the location and energy (= electrophilicity) of the lone pairs (the presumed attractor of an electrophile). But a better more direct approach soon dawned. A search of the crystal structure database. Here is the search definition, with the C=O-E angle, the O-E distance and the N-C=O-E torsion defined (also specified for R factor < 5%, no errors and no disorder). search   The first plot is of the torsion vs the distance, for E = H-X (X=O,F, Cl) amides

  1. The first observation is to note the prominent “hotspot” at a torsion of 180° and a (hydrogen bonding) distance of ~1.60-1.65Å. Amides, so it seems, prefer the electrophile (a proton) to approach anti to the nitrogen
  2. There is a smaller hotspot at a torsion of 0° and a rather longer distance of ~1.8Å corresponding to syn approach.
  3. And finally a barely discernible (but real) one at ~90°, corresponding to the proton attaching itself to the carbonyl π-bond.
  4. A plot of the angles involved reveals that the anti hotspot occurs at ~100° whilst the syn hotspot is about 120°.amides-angles
  5. whilst replacing the proton as electrophile by any metal results in a distinct change.amides-angles1amides-angles2
  6. Syn approach now holds the (red) hotspot, and the angle opens up to ~135°, whilst the anti approach covers a wider angle range of 130-150°
  7. A third hotspot region occurs for the 90° torsion, again metal-π-bond interactions.

The above is a very general statistical survey. As with most bonding effects, one really should investigate every example to discover any perturbing circumstances or structural motifs that might distort the outcome. But for a ten minute exercise in response to a fascinating question from a colleague, it’s not bad! And it certainly nicely inverts the usual Bürgi–Dunitz view of carbonyl groups.

Trigonal bipyramidal or square pyramidal: Another ten minute exploration.

Friday, May 2nd, 2014

This is rather cranking the handle, but taking my previous post and altering the search definition of the crystal structure database from 4- to 5-coordinate metals, one gets the following.

Fe ...

Fe …

Co ...

Co …

Ni ...

Ni …

Cu ...

Cu …

Trigonal bipyramidal coordination has angles of 90, 120 and 180°. Square pyramidal has no 120° angles, and the 180° angles might be somewhat reduced. Thus the Fe and Co series have plenty of 120, whereas the Ni and Cu series hardly any. The Ni series has many 160° values. It is clearly a serious issue that attempting any correlation with the spin states is going to be a lot of really hard work (I might next do another simple search where bond lengths can be shown to very closely correlate with low/medium/high spin states). I will not be trying a more finely grained analysis of the above plots; I just wanted to point out how very simple and quick they are to generate.

Tetrahedral or square planar? A ten minute exploration.

Wednesday, April 30th, 2014

I love experiments where the insight-to-time-taken ratio is high. This one pertains to exploring the coordination chemistry of the transition metal region of the periodic table; specifically the tetra-coordination of the series headed by Mn-Ni. Is the geometry tetrahedral, square planar, or other? One can get a statistical answer in about ten minutes.
Tet-SP.jpgThe (CCDC database) search definition required is shown above. The central atom defines the column of the period table, it is specified to have precisely four other atoms bonded to it, which can be any other element. These four bonds are specified as acyclic (to avoid any bias introduced by rings). And two angles are defined subtending the central atom. And off we go, defining on the way that the hits must be refined to an R-factor of < 0.05, have no disorder, and no errors.

Mn, (Tc), Re

Mn, (Tc), Re

Fe, Ru, Os

Fe, Ru, Os

Co, Rh, Ir

Co, Rh, Ir

Ni, Pd, Pt

Ni, Pd, Pt

Square planar coordination will manifest with pairs of angles of either 90° or 180°, whilst tetrahedral coordination will reveal only 109°.

  1. Both the Mn and the Fe series show a (red) hotspot at the tetrahedral value.
  2. The Co series shows a tetrahedral hot spot AND a somewhat less abundant square planar double-hot spot for the combination 90/180 and 180/90.
  3. The Ni series reveals the hottest spots to correspond to square planar, but with a significant tetrahedral cluster.

This quick survey can be followed up by more detailed explorations of the clusters. For example, can one go to the literature and find out the typical spin state for e.g. the Ni series in each of the geometries. Unfortunately, the CCDC database does not record what the spin state of any individual compound is; one will have to go to the original literature to find out. What a shame that the linkage between two quite different properties is (as far as I know) not available in any easily searchable form. Alternatively, one can narrow down the searches to individual searches of row 1, 2 or 3 of the transition series and then compare the behaviour. The possibilities are considerable.

Then there are the outliers in each plot. Some (many?) may prove to be due to faulty data (whilst we have specified no errors, they can still occur) but others may be due to an unusual structural feature, or perhaps even an as yet unrecognized phenomenon! Set as a student experiment, one might ask each student to explore say 3 outliers and express an opinion as to what causes them to deviate. Enjoy!

A to-and-fro of electrons operating in s-cis esters.

Thursday, February 21st, 2013

I conclude my exploration of conformational preferences by taking a look at esters. As before, I start with a search definition, the ester being restricted to one bearing only sp3 carbon centers.

s-cis-ester-torsion-search

The result of such a search is pretty clear-cut; they all exist in just one conformation, the s-cis, in which a lone pair of electrons on the alkyl-oxygen is aligned quite precisely anti-periplanar with the axis of the C=O bond. This very narrow distribution suggests a relatively large energy preference for this orientation, and we need to seek its origins.

s-cis-ester-torsion

This arises from two electronic alignments. The first orients the in-plane alkyl oxygen lone pair (orange-purple below) anti-periplanar with the C=O σ* empty orbital (red-blue; orange=red, blue=purple), an interaction mapping to 7.7 kcal/mol in the NBO E(2) energy. The second reinforcement (not shown) aligns the (O=)C-Me donor bond with the antiperiplanar O-Me acceptor (5.3 kcal/mol). These two interactions are weaker in the s-trans ester, which is 8.1 kcal/mol higher in ΔG298 and for which the E(2) terms are respectively 3.0 and 0.6 kcal/mol. 

Click for  3D.

Lp(alkyl-O)/C=O σ* Click for 3D.

But wait, this interaction has electrons moving from the alkyl oxygen to the acyl oxygen (red arrows below) and apparently weakening the C=O bond in the process. But in an entirely different context, we learn that the C=O vibrational stretching wavenumber for an ester (1750 cm-1) is higher than that of a ketone (~1715 cm-1); the C=O is stronger rather than weaker in the ester. So now we have to move the σ-electrons back again (green arrows below).s-cis-ester

This strengthening of the C=O bond arises from the following overlap of the σ-lone pair on the carbonyl oxygen with the alkyl-O-C σ* empty orbital, for which E(2) is 41.5 kcal/mol, much larger than the previous effect. It however does NOT discriminate between the s-cis and s-trans  conformations, since this interaction is almost the same in the latter (41.8). So we have a to-of-(red)-electrons which promote the s-cis conformation, and rather stronger fro-of-(green)-electrons which strengthen the C=O bond. But they do not cancel each-other; each has its own job to do!

Click for  3D.

Lp(acyl-O)/C-O σ* Click for 3D.

There is one other overlap which may differentiate between s-cis and s-trans, but a rather less obvious one. That is the alkyl-Oπ donating to the acyl C=Oπ* which has E(2) 64.7 for the former and 59.4 kcal/mol for the latter. It is not immediately apparent why this overlap should favour s-cis. It is however the effect that induces a significant rotational barrier about the C-O bond (~12 kcal/mol).

Click for  3D.

Lp (alkyl-O π)/C=O π* Click for 3D.

Here is the result of another search of the crystal database;  namely the C=O distance (DIST1) vs the  C-O distance (DIST2). You can see that the red hot spot (~1400 examples) is very isolated (the blue squares represent < 200 hits), and there seems to be no significant correlation between the two lengths and the structure.

s-cis-ester-distance
I will conclude with a brief discussion of the carbonyl lone pairs. There are two, and one of them has been shown above in the Lp(acyl-O)/C-O σ* interaction. There is another, but it plays no role in the conformation, and is of quite a different character. Although a low-lying orbital, it is clearly non bonding; indeed might be slightly anti-bonding along the C=O axis. These two carbonyl lone pairs are quite different in character, since each performs a different role in the molecule.

Click for  3D.

Click for 3D.

So the conformational analysis of this simple little molecule reveals some interesting toos-and-fros in the electrons. I will deal with the issue of the carbonyl stretching frequencies in another post.