Posts Tagged ‘chemical data’

The 2015 Bradley-Mason prize for open chemistry.

Friday, June 26th, 2015

Open principles in the sciences in general and chemistry in particular are increasingly nowadays preached from funding councils down, but it can be more of a challenge to find innovative practitioners. Part of the problem perhaps is that many of the current reward systems for scientists do not always help promote openness. Jean-Claude Bradley was a young scientist who was passionately committed to practising open chemistry, even though when he started he could not have anticipated any honours for doing so. A year ago a one day meeting at Cambridge was held to celebrate his achievements, followed up with a special issue of the Journal of Cheminformatics. Peter Murray-Rust and I both contributed and following the meeting we decided to help promote Open Chemistry via an annual award to be called the Bradley-Mason prize. This would celebrate both “JC” himself and Nick Mason, who also made outstanding contributions to the cause whilst studying at Imperial College. The prize was initially to be given to an undergraduate student at Imperial, but was also extended to postgraduate students who have promoted and showcased open chemistry in their PhD researches.

Peter and I are delighted to announce the inaugural winners of this prize.

The postgraduate winner is Tom Phillips for his open blog describing his experiences as a PhD student and for leading by example. He has published his instrumental codes on Github (and now Zenodo[1]) and data and codes for reproducing the graphs in his work on the “lab on a chip” in Figshare[2] and through his blog has encouraged other research students to do the same. Tom has worked assiduously to ensure that all the articles describing his PhD work are or will be open access.[3]

The undergraduate winner is Tom Arrow for his “spare time” involvement with WikiMedia (the foundation that underpins the open Wikipedia), including participating in a Wikimedia EU hackathon in Lyon France, and feeding his experiences and skills back into his undergraduate environment as well as enhancing the teaching Wiki used by his fellow students. Tom took the lead in introducing us to Wikidata[4] for storing chemical data in an open Wikibase data repository and in promoting its use for enriching Wikipedia chemistry pages and showcasing open data in undergraduate teaching environments.

References

  1. T. Phillips, and S. Macbeth, "pumpy: Zenodo release", 2015. https://doi.org/10.5281/zenodo.19033
  2. T. Phillips, J.H. Bannock, and J.D. Mello, "Data for microscale extraction and phase separation using a porous capillary", 2015. https://doi.org/10.6084/m9.figshare.1447208
  3. T.W. Phillips, J.H. Bannock, and J.C. deMello, "Microscale extraction and phase separation using a porous capillary", Lab on a Chip, vol. 15, pp. 2960-2967, 2015. https://doi.org/10.1039/c5lc00430f
  4. D. Vrandečić, and M. Krötzsch, "Wikidata", Communications of the ACM, vol. 57, pp. 78-85, 2014. https://doi.org/10.1145/2629489

Chemistry data round-tripping. Has there been ANY progress?

Monday, December 2nd, 2013

This is one of those topics that seems to crop up every three years or so. Since then, new versions of operating systems, new versions of programs, mobile devices and perhaps some progress? 

Right, I will briefly recapitulate. Chemical structure diagrams are special; they contain chemical semantics (what an atom is, what a bond is, stereochemistry, charges, etc). One needs special programs to represent this. Take two well-known ones. ChemBioDraw V 13 is the latest in a long line dating back to 1985 or so. A newcomer is ChemDoodle, just updated to version 6. The idea is you express your molecule, and capture some of its semantics using one of these programs. And then paste the data into another veritable word processor, Word (also dating back to around 1984). Then send the Word document to a colleague. Who might want to copy the structure back out, and put it back into ChemBioDraw/ChemDoodle. And put those semantics to good use, by editing it, or re-purposing the information. This is round-tripping the data. Its been almost 30 years, surely the process should be seamless by now? Wrong!

One problem is that the “exchange-particle” is the clipboard, yet another ancient and presumed mature technology. Its invisible of course, we rarely get to see it. And very operating system specific! So what is the current state of play? Round tripping ChemBiodraw structures across a single operating system might work. Well, it currently does for just one of the two most common desktop operating systems (remember, Word is provided by the originator of one of these operating systems). The other program, ChemDoodle round trips within both operating systems.

But, here is the key point, not across operating systems. Paste either a ChemBioDraw or a Chemdoodle structure into Word on one of these OS, and try re-editing that diagram on the version of Word on the other OS. The data is lost unless you have the “right” operating system.

An experiment I have not tried, but regarding which I would welcome any feedback is to factor in the two newest operating systems, this time for mobile devices such as tablets and phones. Lets not even worry whether different flavours of one of these mobile OSs are compatible. Apps for drawing chemical structures are available for both of these. Here, the amazing clipboard still exists. One now has four OS to consider, and four homogenous permutations and a minimum of six heterogenous round trips the data could try to take for any given app. We do not even consider app2app transfers not involving discrete intermediate documents. I would predict that only a few of these permutations preserve round-tripped data and its semantics.

Perhaps we need to look at it in a different way? One simply avoids putting data from one program into another. Chemical data is kept in its own files, never mixed with data from other programs, but always kept/sent separately. Pre-1984 and the clipboard, this might have made sense. But in an era when XML was invented around 17 years ago to allow data to fully retain semantic information in any environment it finds itself in, it seems surprising that we still have this situation.

I mention all of this, since there is a current refocusing on the importance of data; “emancipating data” is now important. But the reality is that much current software destroys the semantics in data at almost every turn. Thirty years of no progress then. But what of Chem4Word, a combination of differently namespaced  XML in which the chemistry is expressed in CML (it is only available for a single operating system!). I will perhaps devote a separate post to that one; first I have to try a few experiments!

Data-round-tripping: moving chemical data around.

Saturday, November 20th, 2010

For those of us who were around in 1985, an important chemical IT innovation occurred. We could acquire a computer which could be used to draw chemical structures in one application, and via a mysterious and mostly invisible entity called the clipboard, paste it into a word processor (it was called a Macintosh). Perchance even print the result on a laserprinter. Most students of the present age have no idea what we used to do before this innovation! Perhaps not in 1985, but at some stage shortly thereafter, and in effect without most people noticing, the return journey also started working, the so-called round trip. It seemed natural that a chemical structure diagram subjected to this treatment could still be chemically edited, and that it could make the round trip repeatedly. Little did we realise how fragile this round trip might be. Years later, the computer and its clipboard, the chemistry software, and the word processor had all moved on many generations (it is important to flag that three different vendors were involved, all using proprietary formats to weave their magic). And (on a Mac at least) the round-tripping no longer worked. Upon its return to (Chemdraw in this instance), it had been rendered inert, un-editable, and devoid of semantic meaning unless a human intervened. By the way, this process of data-loss is easily demonstrated even on this blog. The chemical diagrams you see here are similarly devoid of data, being merely bit-mapped JPG images. Which is why, on many of these posts, I put in the caption Click for 3D, which gives you access to the chemical data proper (in CML or other formats). And I throw in a digital repository identifier for good measure should you want a full dataset.

It is only now that we (more specifically, this user) understand what had happened under-the-hood to break this round-tripping. In 1984, when Apple produced the Mac, they also produced a most interesting data format called PICT. A human saw the PICT as a PICTure, but the computer saw more. It (could) see additional data embedded in the PICT. The clipboard supported the PICT format, which meant that both picture and data could be transferred between programs. And ChemDraw and Word also understood this. Hence the ability to round-trip noted above (it has to be said between specifically these programs).

Times moved on and the limitations of PICT set in. Apple refocussed on the PDF format. Related, notice, to the Postscript format that Adobe had introduced in order to allow high quality laserprinting. PICT support was abandoned, and the various components no longer carried recognisable data (specifically the clipboard or the ability of Word to recognise the data). Round-tripping broke. Does this matter? Well, one colleague where I work had accumulated more than 1000 chemical diagrams, which he decided to store in Powerpoint (and yes, he threw the original Chemdraw files away). The day came when he wanted to round trip one of them. And of course he could not. He was rather upset I have to say!

PDF was not really a format designed to carry data (see DOI: 10.1021/ci9003688). But, bless their hearts, the three vendors involved in this story all agreed to support data embedded in the PDF hamburger (and Abobe to tolerate it) and now once again, a structure diagram can move into an Office program (on Mac) and out again and retain its chemical integrity. What lessons can be learnt?

  1. Firstly, out of side, out of mind. The clipboard is truly mostly out of sight, and it was not really designed from the outset to preserve data properly. Nowadays I wonder whether clipboards in general recognise XML (and hence CML) and preserve it. I truly do not know. But they should.
  2. Secondly, any system which relies on three or four commercial vendors, who at least in the past, devised proprietary formats which they could change without warning, is bound to be fragile.
  3. We have learnt that data is valuable. More so than the representation of it (i.e. a 2D or 3D structure diagram). But when its lost, the users should care! And tell the vendors.
  4. Peter Murray-Rust and his team have produced CML4Word (or as Microsoft call it, Chemistry add-in for Word). At its heart is data integrity. Fantastic! But I wonder if it survives on Microsoft’s clipboard (I know it does not on Apple’s, since CML4Word is not available on that OS. And is unlikely to ever become so).
  5. And I can see history about to repeat itself. The same seems about to happen on new devices such as the Apple iPad. It too has copy/paste via a clipboard. I bet this will not round trip chemistry (or much other) data! Want to bet that the lessons of this story have not yet been learnt?

Oh, for those who wish to round-trip chemistry on a Mac, you will have to acquire ChemDraw 12.0.2 and Word 2011 (version 14.01), as well as OS X 10.6 for it to work.