Posts Tagged ‘Jan Jensen’

150,000,000 DFT calculations on 2,300,000 compounds!

Friday, July 5th, 2013

The title of this post summarises the contents of a new molecular database: www.molecularspace.org[1] and I picked up on it by following the post by Jan Jensen at www.compchemhighlights.org (a wonderful overlay journal that tracks recent interesting articles). The molecularspace project more formally is called “The Harvard Clean Energy Project: Large-scale computational screening and design of organic photovoltaics on the world community grid“. It reminds of a 2005 project by Peter Murray-Rust et al at the same sort of concept[2] (the World-Wide-Molecular-Matrix, or WWMM[3]), although the new scale is certainly impressive. Here I report my initial experiences looking through molecularspace.org

The 150,000,000 calculations are released under the the CC-BY license, which is an encouraging (open) start. One does need however to login to the site, which I was able to do using my Google credentials. Shown below is a screenshot of a typical result in a search (of Power conversion efficiency in my case).

CEPDB1

It comes in two parts, the first being the structure (given as a SMILES and 2D layout) with the principle predicted energy levels and predicted photovoltaic performance listed below that. This is then followed by what might be called an annotation with further computed/predicted properties using the algorithms applied by Chemicalize.org. This idea that a data set could accrete via semantically powerful annotations using other tools was also very much part of the concept of the WWMM (the matrix had at its heart a molecule in one dimension and a property, measured or computed in the other. The matrix is of course very sparse, which is why it needs annotation!).

It was at this point however that I started to wonder how I might add other annotations, based perhaps on other types of calculations. But thus far at least, I have not found any trace of something which I could immediately use for my own calculation; 3D coordinates specifically. Thus, the HOMO-LUMO energy gap is the key property which makes molecularspace unique and valuable (to someone working in the field of photovoltaics). But HOMO/LUMO gaps can be calculated in many different ways, and it can always be valuable to calibrate/validate the reported values against other methods. Perhaps if I continue to look, I might find these 3D coordinates (which, for 2,300,000 molecules would be a very valuable resource).  Certainly for example, should  I wish to do so, I could not at the moment readily replicate the calculation for any specific entry on the molecularspace site (which can be regarded as an essential component of scientific validation). When I use the first person, I mean of course either myself as a human or a software agent acting on my behalf (the latter having the endurance to repeat its procedures millions of times if necessary). 

The reader of this blog may have noticed that whenever I report a calculation here, I like to cite its doi (more formally its handle), which links to a digital repository. In my case, the repository certainly carries the 3D coordinates, and also the full wavefunction provided if the reader wishes other properties to be derived from it. Now if molecularspace is able to provide that in the fullness of time, it truly would be an impressive resource.

But the important take-home message from molecularspace is that archiving (under a CC-BY license) the “big” data from any given research in a manner which makes it readily re-usable by others (perhaps from quite different fields of science) is now an essential requisite of doing science. And it is really nice to see good examples of this in practice!


Generally, the calculations I perform for this blog are published in a DSpace repository (the original one, started in 2006[4]), and more recently in Chempound (a project by Peter Murray-Rust and colleagues which emerged out of the WWMM experiments) as well as Figshare[5]. The first and the third assign unique handles (i.e. a doi) to the data; chempound does not (and neither does molecularspace).

References

  1. J. Hachmann, R. Olivares-Amaya, S. Atahan-Evrenk, C. Amador-Bedolla, R.S. Sánchez-Carrera, A. Gold-Parker, L. Vogt, A.M. Brockway, and A. Aspuru-Guzik, "The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid", The Journal of Physical Chemistry Letters, vol. 2, pp. 2241-2251, 2011. https://doi.org/10.1021/jz200866s
  2. P. Murray-Rust, H.S. Rzepa, J.J.P. Stewart, and Y. Zhang, "A global resource for computational chemistry", Journal of Molecular Modeling, vol. 11, pp. 532-541, 2005. https://doi.org/10.1007/s00894-005-0278-1
  3. P. Murray-Rust, S.E. Adams, J. Downing, J.A. Townsend, and Y. Zhang, "The semantic architecture of the World-Wide Molecular Matrix (WWMM)", Journal of Cheminformatics, vol. 3, 2011. https://doi.org/10.1186/1758-2946-3-42
  4. J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
  5. H.S. Rzepa, "Gaussian Job Archive for CLi6", 2013. https://doi.org/10.6084/m9.figshare.739310

Combichem: an introductory example of the complexity of chemistry

Sunday, December 19th, 2010

Chemistry gets complex very rapidly. Consider the formula CH3NO as the topic for a tutorial in introductory chemistry. I challenge my group (of about 8 students) to draw as many different molecules as they can using exactly those atoms. I imply that perhaps each of them might find a different structure; this normally brings disbelieving expressions to their faces.

Click on image to see molecules constructed from these atoms. The list is not comprehensive!

Amongst the useful concepts that can be introduced are:

  1. How to determine how many double bond equivalents (or degrees of unsaturation) are implied by the formula.
    1. Students spot one dbe in the above formula, but can take a little longer to notice that it can reside in a ring.
    2. Few (and I count tutors in this) will add sub-valent atoms (here, the possibility of a carbene or a nitrene) to the list.
  2. What is meant by “different”? This can be reduced to the equations: Ln k/T = 23.76 – ΔG/RT; t1/2 = (Ln 2)/k, where t1/2 is the half life (in seconds) of any species constrained by a free energy barrier of ΔG. A nice illustration of this equation is to be found on Jan Jensen’s blog (and an worthwhile calculation would be to find the barrier required to achieve a half life based on the age of the universe). This can be boiled down to three ranges.
    1. Half lives of ~10-15 s, or vibrations (and this includes transition states themselves). Arguably, resonance isomers, which involve the (nominal) motions of electrons and not nuclei, fall into this class as well.
    2. Half lives of < 101 s, which would include most conformational isomers (excepting atropisomers) and highly unstable isomers, and which cannot be bottled and labelled as such.
    3. Compounds with half lives > 102 s, up to of course the age of the universe. This would include configurational isomers (and if the students are up to it, you can ask them to identify any compounds constructed above which can exhibit optical isomerism).
  3. One might be inclined to (approximately) use arrows to indicate the timescales above. Thus electronic resonance is represented by double-headed arrow, conformational and E/Z isomers by an equilibrium arrow, and a single headed arrow indicating a reaction (which may in fact have a very low barrier) connecting two isomers.
  4. Its normally now time to count the electrons. This includes the “invisible ones”, the lone pairs, and also the occasion to introduce the valence shell octet.
  5. Putting the appropriate charges onto any atoms which require them is always fun (the dative bond is avoided). The blue structure revealed in the click above is an extreme interpretation of this! Gernot Frenking has pioneered the class of compound he calls carbones. For his latest article on the theme, see DOI: 10.1002/anie.201002773. The green compound would belong to this class, if it did not fall apart (probably with no barrier) to something which is not actually one molecule, but two (separable) molecules (purple). This brings us into what a molecule actually is. Could it be two molecules unconected by any bonds, but nevertheless also inseparable (such as catenanes, rotaxanes, and many other entwined systems)? Two molecules can also interact weakly, which is not normally referred to as bonds. In this case, the two molecules would be bound by a hydrogen bond.
  6. Quite a number of the isomers can be also called tautomers. This involve the movement of one type of atom in particular, the hydrogen (or proton). In terms of lifetime, they would fall into class 2 above (although if one takes extreme care to remove all traces of acids or bases, particularly from the surface of any glass container, one can extend the lifetimes quite considerably).
  7. The peptide bond is included in the isomers, and its ionic resonance formulation, which can lead the discussion to the molecular basis of life and how finely-tuned this bond in fact is.
  8. One might speculate about what the most stable of all the isomers might be, and how many are indeed bottleable. One might introduce quantum mechanics as nowadays a very reliable way of estimating this (and whilst you are at it, introduce free energies, entropies etc). For example, which of the two red geometrical isomers is the more stable, and why? What is the best resonance representation (i.e. where does one put the charges? On this specific point, a CCSD/6-311G(d,p) ELF calculation does come up with a very definitive answer of on the nitrogen rather than the oxygen).
  9. This might be followed up by introducing arrow pushing as a means of interconverting two isomers, and with one of the pair of isomers, one can introduce pericyclic selection rules, transition state aromaticity and other advanced stereochemical concepts.
  10. Now we are well into to stereoelectronics. One can introduce anomeric effects via the NBO technique. Thus in the red compounds, there is an interesting interaction between the lone pair on carbon and the anti N-H bond (but, spectacularly, not the syn N-H bond). There is another particularly strong one between the oxygen lone pair and the C-N bond.

I dare say I have only picked at the surface, but covering the above should be enough for one tutorial I should imagine 🙂


PS For the (calculated) relative energies of some of these isomers, see DOI: 10.1021/jo010671v