The title of this post summarises the contents of a new molecular database: www.molecularspace.org[1] and I picked up on it by following the post by Jan Jensen at www.compchemhighlights.org (a wonderful overlay journal that tracks recent interesting articles). The molecularspace project more formally is called “The Harvard Clean Energy Project: Large-scale computational screening and design of organic photovoltaics on the world community grid“. It reminds of a 2005 project by Peter Murray-Rust et al at the same sort of concept[2] (the World-Wide-Molecular-Matrix, or WWMM[3]), although the new scale is certainly impressive. Here I report my initial experiences looking through molecularspace.org
The 150,000,000 calculations are released under the the CC-BY license, which is an encouraging (open) start. One does need however to login to the site, which I was able to do using my Google credentials. Shown below is a screenshot of a typical result in a search (of Power conversion efficiency in my case).

It comes in two parts, the first being the structure (given as a SMILES and 2D layout) with the principle predicted energy levels and predicted photovoltaic performance listed below that. This is then followed by what might be called an annotation with further computed/predicted properties using the algorithms applied by Chemicalize.org. This idea that a data set could accrete via semantically powerful annotations using other tools was also very much part of the concept of the WWMM (the matrix had at its heart a molecule in one dimension and a property, measured or computed in the other. The matrix is of course very sparse, which is why it needs annotation!).
It was at this point however that I started to wonder how I might add other annotations, based perhaps on other types of calculations. But thus far at least, I have not found any trace of something which I could immediately use for my own calculation; 3D coordinates specifically. Thus, the HOMO-LUMO energy gap is the key property which makes molecularspace unique and valuable (to someone working in the field of photovoltaics). But HOMO/LUMO gaps can be calculated in many different ways, and it can always be valuable to calibrate/validate the reported values against other methods. Perhaps if I continue to look, I might find these 3D coordinates (which, for 2,300,000 molecules would be a very valuable resource). Certainly for example, should I wish to do so, I could not at the moment readily replicate the calculation for any specific entry on the molecularspace site (which can be regarded as an essential component of scientific validation). When I use the first person, I mean of course either myself as a human or a software agent acting on my behalf (the latter having the endurance to repeat its procedures millions of times if necessary).
The reader of this blog may have noticed that whenever I report a calculation here, I like to cite its doi (more formally its handle), which links to a digital repository. In my case, the repository certainly carries the 3D coordinates, and also the full wavefunction provided if the reader wishes other properties to be derived from it.‡ Now if molecularspace is able to provide that in the fullness of time, it truly would be an impressive resource.
But the important take-home message from molecularspace is that archiving (under a CC-BY license) the “big” data from any given research in a manner which makes it readily re-usable by others (perhaps from quite different fields of science) is now an essential requisite of doing science. And it is really nice to see good examples of this in practice!
‡ Generally, the calculations I perform for this blog are published in a DSpace repository (the original one, started in 2006[4]), and more recently in Chempound (a project by Peter Murray-Rust and colleagues which emerged out of the WWMM experiments) as well as Figshare[5]. The first and the third assign unique handles (i.e. a doi) to the data; chempound does not (and neither does molecularspace).
References
- J. Hachmann, R. Olivares-Amaya, S. Atahan-Evrenk, C. Amador-Bedolla, R.S. Sánchez-Carrera, A. Gold-Parker, L. Vogt, A.M. Brockway, and A. Aspuru-Guzik, "The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid", The Journal of Physical Chemistry Letters, vol. 2, pp. 2241-2251, 2011. https://doi.org/10.1021/jz200866s
- P. Murray-Rust, H.S. Rzepa, J.J.P. Stewart, and Y. Zhang, "A global resource for computational chemistry", Journal of Molecular Modeling, vol. 11, pp. 532-541, 2005. https://doi.org/10.1007/s00894-005-0278-1
- P. Murray-Rust, S.E. Adams, J. Downing, J.A. Townsend, and Y. Zhang, "The semantic architecture of the World-Wide Molecular Matrix (WWMM)", Journal of Cheminformatics, vol. 3, 2011. https://doi.org/10.1186/1758-2946-3-42
- J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
- H.S. Rzepa, "Gaussian Job Archive for CLi6", 2013. https://doi.org/10.6084/m9.figshare.739310
