Posts Tagged ‘Much software’

Data galore! 134 kilomolecules.

Wednesday, August 6th, 2014

I do go on a lot about the importance of having modern access to data. And so the appearance of this article[1] immediately struck me as important. It is appropriately enough in the new journal Scientific Data. The data contain computed properties at the B3LYP/6-31G(2df,p) level for 133,885 species with up to nine heavy atoms, and the entire data set has its own DOI[2]. The data is generated by subjecting a molecule set to a number of validation protocols, including obtaining relaxed (optimised) geometries at the B3LYP/6-31G(2df,p) level. It would be good to replicate this set with inclusion of a functional that also includes dispersion, and of course making the coordinates all available in this manner greatly facilitates this. The collection also includes data for e.g. 6095 constitutional isomers of C7H10O2, which reminds me of an early, delightfully entitled, article adopting such an approach in quantum chemistry[3]. Such collections are an important part of the process of validating computational methods[4] This way of publishing data does raise some interesting discussion points.

  1. In this case, we have coordinates for 134 kilo molecules, but the individual molecules in this collection do not have formalised metadata. The InChI key is an example of such metadata, and means that the metadata can be specifically searched. Where you have a monolithic collection of 134k molecules, no such structured exposed metadata exists for individual entries and you will have to generate it yourself in order to search it.
  2. Each of the molecules in this collection is revealed (once you have downloaded the compressed archive as above and decompressed it into a 548 Mbyte folder) as separate XYZ files. This syntax has the merit of being very simple, and can easily be processed by a human. Computed molecular properties in the form of metadata are missing from this particular (relatively ancient) format. To recover them, you would have to repeat the calculation.
  3. In fact the XYZ files in this example do seem to have some (unformalised) properties appended to the bottom of the XYZ file (the SMILES and InChI strings are recognizably there, shown as an example below
    27
    gdb 57483   2.68237 1.10148 0.98017 0.0557  94.95   -0.2958 0.073 ...
    C   -0.0805964233    1.5844710741    0.1983967506   -0.41097
    .........
    29.7376 87.1304 196.1576    216.856 ...
    CC(C)(C)C1CCCC1 CC(C)(C)C1CCCC1 
    InChI=1S/C9H18/c1-9(2,3)8-6-4-5-7-8/h8H,4-7H2,1-3H3 InChI=1S/C9H18/c1-9(2,3)8-6-4-5-7-8/h8H,4-7H2,1-3H3
    

    This of itself does raise some issues.

    1. The title line (starting gdb) has extra numbers, but it is not immediately obvious what these are.
    2. The XYZ file is no longer standard because extra information is appended, both to each atom line (the charge? shown above as -0.41097) and to the bottom. Much software will not recognise this non-standard XYZ file, and is likely to discard the additional information. Thus I tried wxMacMolPlt (a long time reader of XYZ files) with no success. Human editing of the file was required to remove the additional information before a sensible molecule loaded. Only at this point could one progress to (re)compute the molecular properties.
    3. The extra information is not formally described. As a human I can recognise it as an atom coordinate list with appended charges (I think), to which is appended a  list of normal coordinate harmonic wavenumbers in units of cm-1, a SMILES and InChI as separate lines. That is really informed guesswork (a human is very good at such pattern recognition) but I cannot be absolutely certain, and a machine seeing this for the first time would certainly struggle.
    4. The last lines contains repetitions of the SMILES and InChI strings. I am guessing that this is the connectivity determined before and after geometry optimisation (using quantum mechanics, bonds can indeed break or form during such a process) but I may be quite wrong about that. I have not tried to resolve this issue by actually reading the depths of the article, since the file itself really should carry such information.
    5. The XYZ file itself carries no provenance, such as who created the file, which software and version was used to create it, the date of creation etc.
  4. An alternative approach is the one adopted here on this blog. Each individual molecule is assigned a DOI and its own metadata and provenance. It is presented to the user in a variety of syntactical forms, each designed for a different purpose, and each adopted for these needs. Thus the syntax and semantics of a CML file are clearly defined by a Schema, and this format can easily absorb additional information without “breaking the standard”. It too can be scaled to 134 kilo molecules[4] although this does require a suitable container (repository) to handle this scale (and I am not entirely sure that DataCite would approve of the generation of 134 kiloDOIs).

Overall, this sort of data publication must be warmly welcomed by the community, and I do hope that more chemistry data is routinely made available in appropriate manner. The presentation in ready-to-reuse form will no doubt improve as the value of such data becomes more fully appreciated. And ultimately, humans need to be excluded from much of this process (editing the 133,885 sets of XYZ coordinates as described above is not for humans to do).


‡Your computer however might balk at opening a folder with 133,885 items in it. Try this only on a very fast machine with lots of memory and ideally an SSD!

Contrary to some rumors, I do not hail from the planet Zog.

References

  1. R. Ramakrishnan, P.O. Dral, M. Rupp, and O.A. von Lilienfeld, "Quantum chemistry structures and properties of 134 kilo molecules", Scientific Data, vol. 1, 2014. https://doi.org/10.1038/sdata.2014.22
  2. Raghunathan Ramakrishnan., P. Dral, P.O. Dral, M. Rupp, and O. Anatole Von Lilienfeld., "Quantum chemistry structures and properties of 134 kilo molecules", 2014. https://doi.org/10.6084/m9.figshare.978904
  3. P.P. Bera, K.W. Sattelmeyer, M. Saunders, H.F. Schaefer, and P.V.R. Schleyer, "Mindless Chemistry", The Journal of Physical Chemistry A, vol. 110, pp. 4287-4290, 2006. https://doi.org/10.1021/jp057107z
  4. P. Murray-Rust, H.S. Rzepa, J.J.P. Stewart, and Y. Zhang, "A global resource for computational chemistry", Journal of Molecular Modeling, vol. 11, pp. 532-541, 2005. https://doi.org/10.1007/s00894-005-0278-1