Posts Tagged ‘chemical databases’

Supporting information: chemical graveyard or invaluable resource for chemical structures.

Friday, March 31st, 2017

Nowadays, data supporting most publications relating to the synthesis of organic compounds is more likely than not to be found in associated “supporting information” rather than the (often page limited) article itself. For example, this article[1] has an SI which is paginated at 907; almost a mini-database in its own right! Here I ponder whether such dissemination of data is FAIR (Findable, accessible, interoperable and re-usable).[2]

I am going to use this article as my starting point.[3] One of the compounds discussed there is shown below; it is not explicitly discussed in the main body of the article. So how findable is it?

  1. A search of Scifinder (Chemical abstracts) using the structure above reveals one hit, the source being the expected one.[3]
  2. A search of Reaxys (used to be Beilstein) reveals no hits in their own database, but one hit is noted in …
  3. Pubchem, where it occurs as substance 163835830. The source is again cited correctly[3]. One of the properties reported is the InChI key: JSLVVAICXSKSEQ-UHFFFAOYSA-N. This is the same key generated from the structure drawing programs Chemdraw or ChemDoodle.
  4. Google on the other hand finds nothing for JSLVVAICXSKSEQ-UHFFFAOYSA-N.[4]
  5. I also tried Google Scholar but again with no luck.

So supporting information does appear to be indexed by both Chemical Abstracts and Pubchem; it is thankfully not a graveyard![5] The chemical databases do return valuable additional information about the molecule, such as e.g. its InChI key and much else besides. Given that presumably the open PubChem resource IS indexed by Google, it must be a policy somewhere that prevents e.g. JSLVVAICXSKSEQ-UHFFFAOYSA-N from being found.

I suppose the next question might be Supporting information: chemical graveyard or invaluable resource for chemical spectra? I confess here that this post was in fact inspired by a previous one on the topic of the provenance of NMR spectra. And perhaps also with some input from the concept of sonification of spectra, in which an instrumental spectrum is converted into a sound signature to allow blind people access to such information. I wonder whether a sonified unique digital signature could be used to search for spectra, somewhat in the manner that InChI helped in tracking down (or not) the molecule above? I think it would be reasonable to say that e.g. NMR spectra as embedded in say a 907 page supporting information document are likely to be very much less FAIR[2]. The solution there of course is better provenance and better metadata, as I previously mulled.


I cannot help but wonder what a carbonyl group sounds like!

References

  1. J.M. Lopchuk, K. Fjelbye, Y. Kawamata, L.R. Malins, C. Pan, R. Gianatassio, J. Wang, L. Prieto, J. Bradow, T.A. Brandt, M.R. Collins, J. Elleraas, J. Ewanicki, W. Farrell, O.O. Fadeyi, G.M. Gallego, J.J. Mousseau, R. Oliver, N.W. Sach, J.K. Smith, J.E. Spangler, H. Zhu, J. Zhu, and P.S. Baran, "Strain-Release Heteroatom Functionalization: Development, Scope, and Stereospecificity", Journal of the American Chemical Society, vol. 139, pp. 3209-3226, 2017. https://doi.org/10.1021/jacs.6b13229
  2. M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J. Boiten, L.B. da Silva Santos, P.E. Bourne, J. Bouwman, A.J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C.T. Evelo, R. Finkers, A. Gonzalez-Beltran, A.J. Gray, P. Groth, C. Goble, J.S. Grethe, J. Heringa, P.A. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S.J. Lusher, M.E. Martone, A. Mons, A.L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M.A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, and B. Mons, "The FAIR Guiding Principles for scientific data management and stewardship", Scientific Data, vol. 3, 2016. https://doi.org/10.1038/sdata.2016.18
  3. G.M.S. Yip, Z. Chen, C.J. Edge, E.H. Smith, R. Dickinson, E. Hohenester, R.R. Townsend, K. Fuchs, W. Sieghart, A.S. Evers, and N.P. Franks, "A propofol binding site on mammalian GABAA receptors identified by photolabeling", Nature Chemical Biology, vol. 9, pp. 715-720, 2013. https://doi.org/10.1038/nchembio.1340
  4. S.J. Coles, N.E. Day, P. Murray-Rust, H.S. Rzepa, and Y. Zhang, "Enhancement of the chemical semantic web through the use of InChI identifiers", Organic & Biomolecular Chemistry, vol. 3, pp. 1832, 2005. https://doi.org/10.1039/b502828k
  5. M. Karthikeyan, and R. Vyas, "ChemEngine: harvesting 3D chemical structures of supplementary data from PDF files", Journal of Cheminformatics, vol. 8, 2016. https://doi.org/10.1186/s13321-016-0175-x

A nice example of open data (in London).

Sunday, March 5th, 2017

Living in London, travelling using public transport is often the best way to get around. Before setting out on a journey one checks the status of the network. Doing so today I came across this page: our open data from Transport for London. 

  1. I learnt that by making TFL travel data openly available, some 11,000 developers (sic!) have registered for access, out of which some 600 travel apps have emerged.
  2. The data is in XML, which makes it readily inter-operable.[1]
  3. This encourages crowd-sourced innovation.
  4. They have taken the trouble to produce an API (application programmable interface) which allows rich access to the data and information about e.g. AccidentStats, AirQuality, BikePoint, Journey, Line, Mode, Occupancy, Place, Road, Search, StopPointVehicle.

Chemists could learn some lessons here! Of course, there are quite a few chemical databases with APIs that are examples of open data, but the “ESI” (electronic supporting information) sources which almost all published articles rely upon to disseminate data are clearly struggling to cope. Take for example this recent article[2], where much of the data has been dropped into the inevitable PDF “coffin” and which is a breathtaking 907 pages long. To give the authors their due, they also provide 20 CIF files which ARE good sources of data. Rarely commented on, but clearly missing from the information associated with this (indeed most) articles is the metadata about the data. Thus the metadata for these CIF files amounts to just e.g. 229. To find out the context, one has to scour the article (or the 907 pages of the ESI) to identify compound 229 (I strongly suspect it’s a molecule because of the implied semantics of the term, not because its been explicitly declared). You will not find the metadata at e.g. data.datacite.org which is one open aggregator and global search engine based on deposited metadata.

I have commented elsewhere on this blog that other types of data could also be enhanced in the manner that CIF crystallographic files represent. For example the Mpublish NMR project, examples of which are shown here, and for which typical data AND its metadata can be seen at DOI: 10.14469/hpc/1053. I fancy that if this method had been adopted,[2] those 907 pages might have shrunk somewhat, although of course not entirely. But my hope is that gradually the innovative chemistry community will find ways of exhuming more and more data from the PDF coffin and in the process reducing the paginated lengths of the PDF-based ESI further, perchance eventually even to zero?

If you are yourself preparing an article and sweating over the ESI at this very moment, do please take a look at the Mpublish method and how perhaps it can help make your NMR data at least more useful to others.


I understand an article describing this project is in preparation. If you cannot wait, this recent application of the Mpublish project has some details.[3]

References

  1. P. Murray-Rust, and H.S. Rzepa, "Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles", Journal of Chemical Information and Computer Sciences, vol. 39, pp. 928-942, 1999. https://doi.org/10.1021/ci990052b
  2. J.M. Lopchuk, K. Fjelbye, Y. Kawamata, L.R. Malins, C. Pan, R. Gianatassio, J. Wang, L. Prieto, J. Bradow, T.A. Brandt, M.R. Collins, J. Elleraas, J. Ewanicki, W. Farrell, O.O. Fadeyi, G.M. Gallego, J.J. Mousseau, R. Oliver, N.W. Sach, J.K. Smith, J.E. Spangler, H. Zhu, J. Zhu, and P.S. Baran, "Strain-Release Heteroatom Functionalization: Development, Scope, and Stereospecificity", Journal of the American Chemical Society, vol. 139, pp. 3209-3226, 2017. https://doi.org/10.1021/jacs.6b13229
  3. M.J. Harvey, A. McLean, and H.S. Rzepa, "A metadata-driven approach to data repository design", Journal of Cheminformatics, vol. 9, 2017. https://doi.org/10.1186/s13321-017-0190-6

Bonds.

Thursday, October 13th, 2011

Bonds are a good example of something all chemists think they can recognise when they see them. But they are also remarkably dependent on context. We are running a molecular modelling course at the moment, and I found myself explaining to someone how very context-sensitive they can be. I thought it might be useful to collect my thoughts here.

  1. The most primitive bond is the connection type. This is used in chemical informatics to define a connection table for a molecule, which is used by all the major chemical databases to index and hence search for molecules. It is also used by the InChI identifier to create the InChI key, and of course SMILES strings. The connection bond has no other properties (such as its bond order etc), but it is assumed to be covalent rather than ionic.
  2. The next is the display bond. This is used by chemical visualisation programs; it is normally created by the code based on very simple rules, such as how far apart the two (or more) atoms are. Such bonds are normally drawn with straight lines, of which there can be up to five (or six at a pinch) nowadays. There is however only a fuzzy convention for how non-integer bond orders are represented. A dashed line can be added (and it might be the only line for weaker types such as hydrogen bonds), but its clear this display convention is suffering at this stage.
    • Perhaps to keep the synthetic chemists happy, I should add two flavours to this category, the stereochemical display bond, which attempts to add a 3D context but in truth does this less than perfectly and the retrosynthetic bond. I will not dwell.
  3. Then there is what I call the mechanical bond. This is used in molecular mechanics force fields. It is a declared bond, i.e. you declare where you want the bond to be, and once that is done, it remains there (it is thus never broken). Each declaration is associated with (quadratic) force constants, which taken as a whole define the force field.
  4. Next comes the quantum chemical bond. This is defined by a wavefunction, which in turn tells us about the electron density. This, to be frank, can be a can of worms. There must be dozens of ways of interpreting the electron density in terms of a bond type. I have used just one of these on this blog, the ELF procedure, which gives an estimate of how many electrons are involved in any bond (and these are always non-integers). Books could be written about this topic, but I will mention just three varieties which indicate how confusing quantum bonds can become. These are the homo(aromatic) bond, which itself comes in two varieties, bond and no-bond types (DOI: 10.1021/jp026521l), bent bonds and transition state bonds. Phew!
  5. The quantum topological bond emerges from  Bader’s QTAIM procedure, which provides a formal topological framework for defining what a bond is. As I noted in earlier posts, it is controversial, since it does not always reflect what chemists might regard as a useful definition that helps them do chemistry.
  6. Finally (?), I could add Rydberg bonds, which are mysterious formations on excited state surfaces, and which can be extraordinarily long (> 500Å), thus defying application of simple distance rules as noted in type 2 above.
It is a taken that the moment anyone tries to define boundaries and rules for bonds, people will argue against the scheme. But if you have your own type which is missing above, do let me know!