Posts Tagged ‘word processor’

Chemistry data round-tripping. Has there been ANY progress?

Monday, December 2nd, 2013

This is one of those topics that seems to crop up every three years or so. Since then, new versions of operating systems, new versions of programs, mobile devices and perhaps some progress? 

Right, I will briefly recapitulate. Chemical structure diagrams are special; they contain chemical semantics (what an atom is, what a bond is, stereochemistry, charges, etc). One needs special programs to represent this. Take two well-known ones. ChemBioDraw V 13 is the latest in a long line dating back to 1985 or so. A newcomer is ChemDoodle, just updated to version 6. The idea is you express your molecule, and capture some of its semantics using one of these programs. And then paste the data into another veritable word processor, Word (also dating back to around 1984). Then send the Word document to a colleague. Who might want to copy the structure back out, and put it back into ChemBioDraw/ChemDoodle. And put those semantics to good use, by editing it, or re-purposing the information. This is round-tripping the data. Its been almost 30 years, surely the process should be seamless by now? Wrong!

One problem is that the “exchange-particle” is the clipboard, yet another ancient and presumed mature technology. Its invisible of course, we rarely get to see it. And very operating system specific! So what is the current state of play? Round tripping ChemBiodraw structures across a single operating system might work. Well, it currently does for just one of the two most common desktop operating systems (remember, Word is provided by the originator of one of these operating systems). The other program, ChemDoodle round trips within both operating systems.

But, here is the key point, not across operating systems. Paste either a ChemBioDraw or a Chemdoodle structure into Word on one of these OS, and try re-editing that diagram on the version of Word on the other OS. The data is lost unless you have the “right” operating system.

An experiment I have not tried, but regarding which I would welcome any feedback is to factor in the two newest operating systems, this time for mobile devices such as tablets and phones. Lets not even worry whether different flavours of one of these mobile OSs are compatible. Apps for drawing chemical structures are available for both of these. Here, the amazing clipboard still exists. One now has four OS to consider, and four homogenous permutations and a minimum of six heterogenous round trips the data could try to take for any given app. We do not even consider app2app transfers not involving discrete intermediate documents. I would predict that only a few of these permutations preserve round-tripped data and its semantics.

Perhaps we need to look at it in a different way? One simply avoids putting data from one program into another. Chemical data is kept in its own files, never mixed with data from other programs, but always kept/sent separately. Pre-1984 and the clipboard, this might have made sense. But in an era when XML was invented around 17 years ago to allow data to fully retain semantic information in any environment it finds itself in, it seems surprising that we still have this situation.

I mention all of this, since there is a current refocusing on the importance of data; “emancipating data” is now important. But the reality is that much current software destroys the semantics in data at almost every turn. Thirty years of no progress then. But what of Chem4Word, a combination of differently namespaced  XML in which the chemistry is expressed in CML (it is only available for a single operating system!). I will perhaps devote a separate post to that one; first I have to try a few experiments!

Computers 1967-2011: a personal perspective. Part 1. 1967-1985.

Thursday, July 7th, 2011

Computers and I go back a while (44 years to be precise), and it struck me (with some horror) that I have been around them for ~62% of the modern computing era (Babbage notwithstanding, ~1940 is normally taken as the start of the modern computing era). So indulge me whilst I record this perspective from the viewpoint of the computers I have used over this 62% of the computing era.

  1. 1967: I encountered (but that term has to be qualified) my first computer, suggested to me as an alternative to running quarter marathons on Wimbledon common at school by an obviously enlightened teacher! I wrote a program (in Algol) on paper tape, put the tape in an envelope, and sent it off to Imperial College (by van) to run, on an IBM 7094. A week later, printed output showed you had made a mistake on line 1 of the program. As I recollect, after about eight weeks of this, I got the program to run (and calculated π to 5 decimal places).
  2. 1970: By now I was a student (again at Imperial College), and was introduced to Fortran, then a radical new innovation to a chemistry degree. The delightfully named pufft compiler combined with the 7094 again, but this time with punched Holerith cards as input and line printer output. I cannot remember what we were asked to program. I do remember that the punched cards were produced by a pool of punch card operators, working from code pages written by the programmer. Some students (not me!) thought it great fun to give their Fortran variables naughty names (which the punch card operators then refused to punch, thus causing the student to fail the course!).
  3. 1971: I really liked this programming lark, so when instant-turnaround was introduced that year, I decided to do a proper program. It was called NLADAD (yes, I was no good at names, even then), which stood for non-linear-analysis of donor-acceptor complexes. The idea was to take recorded NMR chemical shifts, and fit them to an equilibrium A+B ⇔ AB+B ⇔ AB2 using non-linear regression analysis. It must have been all of 200 lines of code (OK, I did not write the matrix inversion routine myself)! Instant turnaround was also great, you got to punch your own cards this time, and had the great excitement of feeding them into a card reader yourself. You then walked about 5 yards to the line printer and waited agog. No waiting one week, this was less than a minute. Or it would have been if the line printer did not paper-wreck every two minutes! (I might add that I have a dim recollection of a member of the computer centre staff standing by to recover these paper wrecks. He, by the way, is now the director of the ICT division here!).
  4. 1972: I am now doing a PhD (yes, boringly, yet again at Imperial College). I had found the one and only teletypewriter in the chemistry department. The crystallographers had secreted it away in their empire, but were very dismayed to find me occupying it constantly. Instant was now even more instant. I was now connecting to a time-sharing CDC 6400 computer, at the dazzling speed of 110 baud (or bytes per second). These were small bytes by the way, since the CDC used 6 bits per byte. The result was that one did everything in UPPER CASE, since a 6-bit byte only allows 64 characters! My (still Fortran) programs reached probably 1000 lines of code now, and I was engrossed in deriving non-linear analyses of steady state chemical kinetics (about four different kinds of rate equation as I recollect). Ah, the joys of covariance analysis, and propagation of errors (I was in a kinetics lab, and all the other students plotted graphs on graph paper, and if pressed, plotted gradients of graphs, the so-called Guggenheim plots. I thought this the dark ages, but no-one volunteered to join me in this single teletypewriter room. Not even the attractive girls in the group. I was the geek of my time, no doubt about that. My kinetic analysis did however have one upside. Its how I meet my wife to be a few years later!).
  5. 1974: PhD completed, I was now ready to go to Texas, where everything is bigger (and in terms of computers, slightly better, a CDC 6600 now and a 300 baud teletypewriter!). I had been computing now for seven years, and finally I actually got to SEE the device for the very first time. My mentor, Michael Dewar, had a sort of special relationship with the university. His students (and possibly only his students) were allowed to go into the depths of the machine room, where behind plate glass you could see the CDC 6600. I soon learnt how to get even closer. It was not particularly exciting however. I was more entranced with the CALCOMP flatbed plotter, which was located next to the 6600. Pictures at last (you probably do not want to know that to convert my kinetics in 4 above to pictures, I got quite expert in using a french curve. Look it up before you jump to conclusions). Part of the pact I negotiated was that I was only allowed into the inner sanctum at 03:00 in the morning (sic!). Still a geek then! Oddly, I was one of the few students in Dewar’s group using the CALCOMP, but at least we now had pictures of the molecules I was now calculating (using MINDO/3). To put the computing power into context, in 1975, Paul Weiner, another group member, announced that he had completed a full geometry optimisation of LSD, this having taken about 4 days to do on that over-worked 6600. The entire group went out to celebrate. Many pitchers of beer were drunk that nite.

    Computer graphics from 1976.

  6. 1977: Back to Imperial, where we might have also now had a CDC 6600. And a Tektronix terminal running at the dizzying (hardwired end-to-end) speed of 9600 baud. I learnt to Word process on this device (using a word processor, written in Fortran, although not by me) and I wrote three review articles by this means, using a fancy phototypesetter as the printer. My next program, STEK, probably ran to about 5000 lines of code, and it persuaded the Tektronix to plot all sorts of things, ball&stick diagrams, isometric potential surfaces, molecular orbitals, and the like (and jumping ahead, my experience with this program eventually led to CML, and Peter Murray-Rust, but that is indeed jumping ahead). I think I also managed to gain access to the Imperial machine room, that inner sanctum, yet again. But for reasons I will not go into, it was not as interesting as the Texan machine room.

    Chemistry Computer graphics, circa 1977-85.

  7. 1979: I encountered a Cray 1 computer, and probably also 8-bit bytes (and yes, lower case printer outputs) for the first time at the University of London Computing Centre.
  8. 1980: Remember that teletypewriter, encountered earlier. Well these were now running at 2400 baud and I started to organise the deployment of a chemistry department computer network to sprinkle several such terminals around the department. The controller was a PAD, and in that year, we introduced STN ONLINE using this network. It was the first time we could search CAS online ourselves (previously, it was a service offered by the library). Literature searching has not been the same since.
  9. 1980: I finally again encountered a real computer, which one could happily listen to without creeping into machine rooms in the middle of the night. It was the data system on a Bruker Spectrospin 250 MHz superconducting NMR spectrometer. I had many adventures on this system. It was installed, by the way, on more or less the same day as the birth of my first daughter Joana. It had a hard drive (5 Mbytes as I recollect, and cost an absolute fortune, around £10,000 if I remember correctly).

    Combining Quantum mechanics and NMR.

    Computer graphics 1982, from NMR spectrometer.

  10. 1982: More networks, this time a curious computer known as the Corvus Concept, using a networked hard drive (possibly as big as 20 Mbytes by now), and a large screen.
  11. 1985: Enter the Mac (OK, the IBM PC came a little earlier, but it was not entrancing). Now one really had a tactile computer that made noises (not always nice), produced smoke signals occasionally, and ejected its floppy disk incessantly. Yet another revolution to cope with. As I type this, I look down on that Mac, which is still underneath my desk. Wonder if its worth anything on ebay?

Well, a second consecutive blog, with (almost) no pictures or molecules. And I have only gotten to the half way stage of my story. Better break off then.

Data-round-tripping: wherein the future?

Tuesday, December 7th, 2010

Moving (chemical) data around in a manner which allows its (automated) use in whichever context it finds itself must be a holy grail for all scientists and chemists. I posted earlier on the fragile nature of molecular diagrams making the journey between the editing program used to create them (say ChemDraw) and the Word processor used to place them into a context (say Microsoft office), via an intermediate storage area known as the clipboard. The round trip between the Macintosh (OS X) versions of these programs had been broken a little while, but it is now fixed! A small victory. This blog reports what happened when such a Mac-created Word document is sent to someone using Microsoft Windows as an OS (or vice versa).

As you might have guessed, the molecular diagram arrives largely dead, and not re-usable. Opening the .docx archive (it is nothing more than a zip file) reveals only a JPEG file residing inside. Nothing that can be chemically repurposed. If the reverse process is undertaken, of creating a chemdraw diagram, and pasting it into Word on Windows, one finds in the .docx two components; a bit-mapped image linked to an active object containing the data. Only the first of these is recognised if the file makes its way to a Macintosh; i.e. the same story, the data is again lost. So the bottom line is that Mac users and Windows users cannot, after all, exchange repurposable molecular diagrams using Word documents using this combination of programs. This is not good.

But let me remind what happened around 1993. The word processor was joined by a program called the Web browser. In 1996, the underlying content carrier, HTML, became XHTML (an instance of XML). Right from day 1 almost, such XHTML could, and frequently was repurposed. A memorable example is that search engines could use it to index the Web. The XHTML easily survived trips to and from clipboards. In 1996, CML joined HTML as a way of carrying chemical information capable of round-tripping without loss (if need be). There are other chemical XML languages in use nowadays, including CDXML used by the ChemDraw program. Word itself now uses XML (the x in .docx). So, after 14 years, why am I still describing the difficulties above? I am frankly at a loss to explain why there is still a need to write this post.

All is not entirely lost. The CML4Word approach is designed to enable (chemical) data round tripping from the outset. Although I do not yet know if the CML created and stored in the Word document using this mechanism is recognised anywhere outside of Word 2007 on Windows?  If anyone can let me know of examples where such a CML-enabled Word document can be used in other environments, I would be very grateful (but not on  OS X, as I know already).

And as I might have mentioned in the previous post on this topic, things may not however be getting better in that other carrier of information and data, the mobile phone/iPad, as exemplified by operating systems such as iOS or Android. Watch this space, as they say.

Data-round-tripping: moving chemical data around.

Saturday, November 20th, 2010

For those of us who were around in 1985, an important chemical IT innovation occurred. We could acquire a computer which could be used to draw chemical structures in one application, and via a mysterious and mostly invisible entity called the clipboard, paste it into a word processor (it was called a Macintosh). Perchance even print the result on a laserprinter. Most students of the present age have no idea what we used to do before this innovation! Perhaps not in 1985, but at some stage shortly thereafter, and in effect without most people noticing, the return journey also started working, the so-called round trip. It seemed natural that a chemical structure diagram subjected to this treatment could still be chemically edited, and that it could make the round trip repeatedly. Little did we realise how fragile this round trip might be. Years later, the computer and its clipboard, the chemistry software, and the word processor had all moved on many generations (it is important to flag that three different vendors were involved, all using proprietary formats to weave their magic). And (on a Mac at least) the round-tripping no longer worked. Upon its return to (Chemdraw in this instance), it had been rendered inert, un-editable, and devoid of semantic meaning unless a human intervened. By the way, this process of data-loss is easily demonstrated even on this blog. The chemical diagrams you see here are similarly devoid of data, being merely bit-mapped JPG images. Which is why, on many of these posts, I put in the caption Click for 3D, which gives you access to the chemical data proper (in CML or other formats). And I throw in a digital repository identifier for good measure should you want a full dataset.

It is only now that we (more specifically, this user) understand what had happened under-the-hood to break this round-tripping. In 1984, when Apple produced the Mac, they also produced a most interesting data format called PICT. A human saw the PICT as a PICTure, but the computer saw more. It (could) see additional data embedded in the PICT. The clipboard supported the PICT format, which meant that both picture and data could be transferred between programs. And ChemDraw and Word also understood this. Hence the ability to round-trip noted above (it has to be said between specifically these programs).

Times moved on and the limitations of PICT set in. Apple refocussed on the PDF format. Related, notice, to the Postscript format that Adobe had introduced in order to allow high quality laserprinting. PICT support was abandoned, and the various components no longer carried recognisable data (specifically the clipboard or the ability of Word to recognise the data). Round-tripping broke. Does this matter? Well, one colleague where I work had accumulated more than 1000 chemical diagrams, which he decided to store in Powerpoint (and yes, he threw the original Chemdraw files away). The day came when he wanted to round trip one of them. And of course he could not. He was rather upset I have to say!

PDF was not really a format designed to carry data (see DOI: 10.1021/ci9003688). But, bless their hearts, the three vendors involved in this story all agreed to support data embedded in the PDF hamburger (and Abobe to tolerate it) and now once again, a structure diagram can move into an Office program (on Mac) and out again and retain its chemical integrity. What lessons can be learnt?

  1. Firstly, out of side, out of mind. The clipboard is truly mostly out of sight, and it was not really designed from the outset to preserve data properly. Nowadays I wonder whether clipboards in general recognise XML (and hence CML) and preserve it. I truly do not know. But they should.
  2. Secondly, any system which relies on three or four commercial vendors, who at least in the past, devised proprietary formats which they could change without warning, is bound to be fragile.
  3. We have learnt that data is valuable. More so than the representation of it (i.e. a 2D or 3D structure diagram). But when its lost, the users should care! And tell the vendors.
  4. Peter Murray-Rust and his team have produced CML4Word (or as Microsoft call it, Chemistry add-in for Word). At its heart is data integrity. Fantastic! But I wonder if it survives on Microsoft’s clipboard (I know it does not on Apple’s, since CML4Word is not available on that OS. And is unlikely to ever become so).
  5. And I can see history about to repeat itself. The same seems about to happen on new devices such as the Apple iPad. It too has copy/paste via a clipboard. I bet this will not round trip chemistry (or much other) data! Want to bet that the lessons of this story have not yet been learnt?

Oh, for those who wish to round-trip chemistry on a Mac, you will have to acquire ChemDraw 12.0.2 and Word 2011 (version 14.01), as well as OS X 10.6 for it to work.