Posts Tagged ‘software tools’

(re)Use of data from chemical journals.

Wednesday, December 22nd, 2010

If you visit this blog you will see a scientific discourse in action. One of the commentators there notes how they would like to access some data made available in a journal article via the (still quite rare) format of an interactive table, but they are not familiar with how to handle that kind of data (file). The topic in question deals with various kinds of (chemical) data, including crystallographic information, computational modelling, and spectroscopic parameters. It could potentially deal with much more. It is indeed difficult for any one chemist to be familiar with how data is handled in such diverse areas. So I thought I would put up a short tutorial/illustration in this post of how one might go about extracting and re-using data from this one particular source.

Interactive Journal table

The above is a snapshot of part of the table in question, with a box in the middle set aside for a Jmol applet to appear. What might be both less obvious, and less familiar to many who might have seen such a display is the very rich environment available for manipulating the data. To expose some of this, proceed as follows:

  1. Firstly, load a molecule into the Jmol window by clicking on e.g. the hyperlink shown below.

    Loading a molecule

  2. The display shown below will appear, in this case a set of coordinates used to present a 3D model of a molecule, which can be rotated, zoomed, etc. It also has been labelled with various selected bond lengths etc.

    Interactive table with molecule loaded

  3. To extract data, right-click anywhere in the molecule area. Navigate through the menus which appear as shown below. In this case, the data is present in the form of a Gaussian log file. This can contain the history of the particular calculation performed (e.g. a geometry optimisation) or as in this case, all 3N-6 calculated normal vibrational modes. The one of interest here is number 318, being an O=C=O stretching mode.

    An Interactive table in a chemistry journal.

  4. This mode can now be manipulated visually by selecting various parameters:

    Manipulating a vibrational mode

  5. Jmol has a scintillating display of other options, and more are being added all the time, so the above display is by no means the limit of what one can do.
  6. Now to the most important bit. Invoke the menu as shown below, whereupon a copy of the relevant file (gzipped in this case to reduce its size) will be downloaded to your local system. You will now need to use a program on your own computer capable of reading and processing such a file (after unzipping).

    Downloading a data file.

  7. There may be a bewildering variety of programs and toolkits which may perform the operation you wish on such a file. Some are commercial, some are open source. To help people get going, I link to one of the latter type here, You might also want to visit the Quixote project for ideas.
  8. We are not quite finished yet. Perhaps a Gaussian log file does not suite your purpose. Well, now try clicking on this link

    Link to a digital repository

  9. This produces a page such as below, which contains more files. In this example, several molecular identifiers are present (InChI and InChI key) to help identify the uniqueness of the system, the molecular coordinates are available as a .cml file which itself can be processed by a variety of software tools, the original file used to run the calculation can be inspected (if you want to eg repeat it) as input.gjf, the logfile we have seen above, and a checkpoint file, which is most useful when using either the Gaussian program system or a visualiser (Gaussview, ChemBio3D etc, both commercial programs). A SMILES string is also offered, and sometimes (not in this example) a so-called wavefunction file which can be used by some programs to analyse the wavefunction, and perform e.g. QTAIM, ELF, NCI analyses.

    A digital repository page.

    It is now up to the user to identify suitable processing programs on their computer which fit their purpose.

  10. There is one other file present which I have not yet explained, the mets.xml manifest. This is a metadata file, containing (along with much else) an RDF declaration of (some) of the properties of the molecule. In theory at least, this file could be automatically harvested for the RDF, which could be injected into a triple store, and queried semantically using eg SPARQL. That is part of the semantic web.

I hope some of the screenshots here make the process of extracting data from an interactive table article a little more obvious. I must declare that this way of doing it is just one of the ways being explored and also (much to my regret) is not yet particularly common. But hopefully you might capture a little of what some of us believe to be the future of scientific journals.

The Fragile Web

Monday, August 31st, 2009

One of the many clever things that clever people can do with the Web is harvest it, aggregate it, classify it etc. Its not just Google that does this sort of thing! Egon Willighagen is one of those clever people. He runs the Chemical blogspace which does all sorts of amazing things with blogs.

He sent me a message recently, saying that unfortunately, he was not able to do any amazing things to my blog, since it was not failsafe any more. Apparently, deep down in the software he was using to harvest the details of my blog, an error along the lines of Bytes: 0xA0 0x0A 0x49 0x74 was causing grief. This is the sort of message that would make most people quake. In this instance, the excellent W3C comes to the rescue. By putting this blog feed into their RSS Validator , one can narrow down the error. It proved to be on a single line of an earlier blog posting. Remove this line, and all becomes well. In fact, if the line was displayed on a regular text editor, one eventually notices that the end of the line (which looks just like a space) might be the suspect. Remove just that one character, and the RSS Validator is (almost perfectly) happy. I hope that Egon will be too now!

But the lesson of this little exercise is that a single character can still bring the whole edifice crashing down (or at least my entire blog). Single characters of course have been notorious in the past. One that springs to mind was a single (white) space, inserted by accident into a line of Fortran code. That space subverted the meaning of the code, which in fact was being used to control the navigation of a spacecraft on its way to Jupiter. Result? The probe missed Jupiter by quite a margin, and the entire cost of the mission was lost (around 1$billion!).

It is also a lesson  in how an individual might operate within the  modern Web.  During the period  1993 to around 2001, most of the content on the  Web was in the form of static  HTML pages. This was written either by hand, or using software tools to do so.  This was scary stuff for most people. Then along came two  social inventions; the Wiki and the  Blog. Each of these hid (most of) the scary  HTML from the user, and allowed pain-free (almost) creation of content.  As time passed, everyone became accustomed to using such tools, and they started to trust them implicitly to produce  valid HTML under the hood. In my case,  I trusted the Blog software (WordPress) to both not produce faulty  HTML,  or at least to detect it if it got in by accident. In this instant, it is more subtle, with an error in the character encoding.  But this is the lesson.  As the skills of olden time (i.e. writing native  HTML) are lost, we will be more and more at the mercy of the modern tools.  Will we even notice the errors, which might propagate out with our name attached?  Or will the software get even smarter and fix the errors before they cause problems?  Will humans become almost entirely redundant?