Chemical IT « Henry Rzepa's blog

Archive for the ‘Chemical IT’ Category

Scalemic molecules: a cheminformatics challenge!

Wednesday, July 6th, 2011

A scalemic molecule is the term used by Eliel to describe any non-racemic chiral compound. Synthetic chemists imply it when they describe a synthetic product with an observable enantiomeric excess or ee (which can range from close to 0% to almost 100%). There are two cheminformatics questions of interest to me:

How many non-trivial scalemic molecules have been reported in the literature (let’s assume their ee is significantly greater than 0%)?
- The distribution function for the ee of these molecules would be most interesting!
Of those, how many have the absolute configuration of the predominant enantiomer established with high confidence?
- Or, to put this another way, how many may prove to be mis-assigned?

Note the careful qualification in the above questions. Thus by non-trivial, I mean compounds whose scalemic attributes persist in solution for a chemically useful duration. That could be taken to mean configurationally stable chiral molecules, rather than those that might be conformationally chiral (an example of a trivial scalemic molecule would be e.g. the twist-boat conformation of cyclohexane, which having D₂ symmetry is dissymetric, but which would only retain its scalemic property for a trivially short timescale).

What are boundary values? These are some:

As I write this, CAS records 61,257,703 chemical substances. Needless to say (unless I missed it), the answer to my first question is not to be found there.
Beilstein (Reaxys) records 1,126,995 compounds as having one or more reported chiroptical properties (which is the most direct way of establishing a molecule is scalemic, although strictly, having say an optical rotation of 0° does not necessarily mean the molecule is not scalemic). We have no way of knowing how many molecules are scalemic for which no chiroptical measurement has been made (but one would hope its a small proportion). Perhaps that is a good answer to question 1?
- of which 1,097,094 relate to optical rotatory power, 17,515 to optical rotatory dispersion and 62,248 to electronic circular dichroism.
- it is more difficult to answer how many of these 1,126,995 substances have a firmly established absolute configuration. Measuring a chiroptical property per se does NOT in itself establish the absolute configuration. Doing so is a fascinating exercise in sequential logical argument, and how one does it has changed quite a lot over time. And what might I mean with high confidence? An older assignment (made say > 40 years ago) might be less confident than one established in 2011 (fortunately, we can probably trust the absolute configurations of the amino acids!). A bit of a can of worms, nevertheless. But it interests me because it is a good example of what the semantic web is supposed to be all about.
The Cambridge crystallographic database reports 560,307 entries, of which 72,340 are in chiral space groups (in which a chiral molecule can crystallise) and exhibit no disorder or other errors. We do not know how many of these are non-trivial, since all manner of small (and low energy) distortions can create a chiral species (in the solid state), but which would not persist for a chemically useful duration in solution (i.e. it might for example immediately racemize and become non-scalemic).
The Flack parameter has been used since 1983 for enantiomorph estimation (a value of ~≤ 0.10(10) would be considered meaningful). This could in principle provide an answer of known confidence to my question 2 above (but would not address the issue of non-triviality).
- The challenge now is to quantify how many compounds have a meaningful reported Flack parameter (presumably a sub-set of 72,340?)

Let me declare one personal interest. Over the last four years or so, we have been asked to confirm the absolute configuration of around eight scalemic molecules. After a detailed study, we concluded three were mis-assigned. Now this in no way implies anything about what the answer to question 2 above might be! But it does make one think!

Tags:Cambridge, chemical substances, chiral, chiroptical, disorder, dissymetric, low energy, scalemic molecules, semantic web, synthetic product
Posted in Chemical IT | No Comments »

Molecular illusions and deceptions. Ascending and Descending Penrose stairs.

Wednesday, June 15th, 2011

It is not often that an article on the topic of illusion and deception makes it into a chemical journal. Such is addressed (DOI: 10.1002/anie.201102210) in no less an eminent journal than Angew Chemie. The illusion (or deception if you will) actually goes to the heart of how we represent three-dimensional molecules in two dimensions, and the meanings that may be subverted by doing so. A it happens, it is also a recurring theme of this particular blog, which is the need to present chemistry with data for all three dimensions fully intact (hence the Click for 3D captions which often appear profusely here).

Molecular Penrose stair. Click for 3D.

The molecule above has been synthesized and a crystal structure obtained (if you click above, you will get the 3D coordinates; the above is pruned of some sidechains which are irrelevant here). The authors assert it as an example of a Penrose stair, or perhaps the better known lithograph by M. C. Escher known as Ascending and Descending. This is a visual paradox, the point of which is to show how the eye can be easily deceived if the brain is asked to fill in a third dimension given only two. The molecule above has been drawn with the illusion of depth, using (or mis-using) the embolded bonds akin to those proposed by Maehr (who meant something quite different as it happens). It should be worrying to any chemist who cares about stereochemistry to think that use of these time honoured conventions could actually result in a paradox! So perhaps this accounts for why an article on this very topic has made it onto the pages of such an eminent journal.

What might be of (chemical) interest about this molecule, other than its illusory aspects? Well, could you for example work out, given ONLY the representation above, that the molecule has D₂ symmetry, and is therefore chiral? I wondered if it might also be Möbius (with perhaps two half twists), although in fact the π-system in this case has a linking number Lk of zero (not 2). It also has some interesting rather close H…H contacts in the middle, and the inner periphery appears to be a 14-annulene and the outer 26, both conforming to the Huckel 4n+2 rule. Despite this the central C-C bond is actually quite long, and the conjugation is hence significantly interrupted. There is more chemistry in the original article.

But I want to close here on the point that to overcome the deception and illusion, you need to get the three dimensional data in chemistry. You can do this by clicking above. Or, by going to the original article and striving to do so there. I think you will find the latter route the greater challenge! Then, ask yourself why an eminent journal, in publishing an article on the topic of deception and illusion, makes it so relatively difficult to overcome that illusion. Certainly more difficult than I hope it proves to be on this blog!

Tags:chemical, chemical journal, chemist, M. C. Escher, Tutorial material
Posted in Chemical IT, Interesting chemistry | 2 Comments »

Hafnium and Niels Bohr

Sunday, June 5th, 2011

In 1923, Coster and von Hevesey (DOI: 10.1038/111182a0) claimed discovery of the element Hafnium, atomic number 72 (latin Hafnia, meaning Copenhagen, where the authors worked) on the basis of six lines in its X-ray spectrum. The debate had long raged as to whether (undiscovered) element 72 belonged to the rare-earth group 3 of the periodic table below yttrium, or whether it should be placed in group 4 below zirconium. Establishing its chemical properties finally placed it in group 4. Why is this apparently arcane and obscure re-assignment historically significant? Because, in June 1922, in Göttingen, Niels Bohr had given a famous series of lectures now known as the Bohr Festspiele on the topic of his electron shell theory of the atom. Prior to giving these lectures he had submitted his collected thoughts in January 1922[1].

Like Mendeleev before, who had predicted ekasilicon, ekaaluminium and ekaboron (eventually discovered as germanium, gallium and scandium), Bohr had used his electron shell theory to (correctly) predict the properties of element 72. In modern terms, he had concluded that its electron shell structure must be 2.8.18.32.10.2 or [Xe].4f¹⁴.5d².6s². Classification as a rare earth would have resulted in the 4f shell having 15 electrons, impossible in Bohr’s theory. Coster and von Hevesey note in their article that Bohr’s striking prediction was now verified.

Why I am writing all of this? For various reasons:

Unlike Mendeleev, Bohr’s prediction of the properties of a (then uncharacterized) element, whilst famous at the time, is nowadays largely forgotten by chemists. It is one of the great achievements of the then new quantum theory.
Reading the 67 pages of Bohr’s article on the topic reveals no discussion of element 72 (articles of this era are nowadays only available as scanned images, not full text, and one must rely on a human visual scan of all 67 pages, which of course may not be reliable) but its (absence) in the table below is striking. Here VI means the 6th row of the periodic table.

Niels Bohr’s Periodic table, 1922.
Notice the only other missing elements, Technetium (43), Promethium (61), Astatine (85), Francium (87) and Rhenium (75, the only non-radioactive one remaining to be discovered),
I must presume that Bohr introduced his discussion of element 72 into his June lectures to make an impact with his audience! One might have hoped that tracking down what happened between January 1922, when Bohr fails to make much of the missing element 72, and June in the same year would be possible from Coster and von Hevesey’s citation of Bohr in 1923. But it was the practice of the time to rarely cite one’s sources. Thus they give no published citation to Bohr, and one might conclude that they might instead be quoting Bohr from his lectures rather than his writings (who, I wonder, was poor old Bury, now forgotten!).

Coster and Hevesey’s allusion to Bohr’s theory.
Bohr’s own 1922 article on the topic is also visually striking. It contains in its 67 pages:
1. 13 (short) equations
2. Two figures (the second a variation on the first)
3. One table (above).
4. and lots of text (in German).
5. No citations at the end, not even one, although many people are acknowledged in the text itself.
6. No explicit statement of shell structures as e.g. 2.8.18.32.10.2 or [Xe].4f¹⁴.5d².6s².
Given that Bohr’s article can be regarded as one of the most influential of the 20th century (even prior to its being placed on a firm theoretical footing by solution of the Schroedinger equation for the hydrogen atom), I find it interesting how quickly it achieved this status (Bohr won the Nobel prize in 1922 as well). One might conclude that reputations were made as much via verbal presentations as by the immediate visual impact of the associated publications.

Finally, I note the striking contrast between Bohr’s article and Langmuir’s, written about a year earlier in 1921. Here, Langmuir sets out some postulates, the first of which is shown below.

Langmuir’s 1921 postulate.

The filled electron shells are clearly set out here (much more clearly than in Bohr’s 1922 article). But yet again, we remain baffled as to how Langmuir arrived at this postulate. Although he (very briefly) mentions Bohr in his own paper, it is only in the context of speculating about what prevents the electrons from falling into the nucleus, and few citations are again given (a notable exception is to Pease for suggesting the triple bond). We may only suspect that Langmuir had heard Bohr talking about his theory, and had extended G. N. Lewis’ concept (also not directly cited) of (filled) valence shells for his own theory of chemical bonding.

Well, in a little less than 90 years, we have progressed from finding almost no sources cited in some of the most influential papers of the 20th century, to the DOI (or URL) embedded in everything. I think that when the history of the present era is written, the introduction of the DOI/URL will take its place in the pantheon of great scientific events. Its the connections that matter, stupid!

Postscript. Hevesey in this review written in 1925 sets out a good history of Hafnium. This article contains (on p7) a clear statement of the electron shell structure of Hafnium as 2.8.18.32.8.2.2, which is cited as Bohr’s result. Hevesey quotes Bohr via reference 12, which is in fact to a book Bohr published in 1924. There is no mention of Langmuir in Hevesey’s review.

Postscript1: Hafnium (as its oxide) is now an essential element to the ever smaller fabrication of silicon chips (32nm and smaller). It is one of 14 elements considered essential to the future green technologies (six of which, but not including Hafnium, are considered in critical risk of supply disruption by 2015).

References

N. Bohr, "Der Bau der Atome und die physikalischen und chemischen Eigenschaften der Elemente", Zeitschrift f�r Physik, vol. 9, pp. 1-67, 1922. https://doi.org/10.1007/bf01326955

Tags:Bohr, Bury, chemical bonding, chemical properties, Copenhagen, green technologies, Hafnium, Historical, Langmuir, Niels Bohr, silicon chips, Technetium, X-ray
Posted in Chemical IT, General | 5 Comments »

Blogs, Twitter, Wikis and other on-line tools: the movie!

Friday, May 27th, 2011

Libraries (and librarians) are evolving rapidly. Thus a week or so ago one of our dynamic librarians here, approached some PhD students and academics to ask them how they used “Web 2.0” (thanks Jenny!). The result was edited (thanks John!) and uploaded, where you can see it below (embedded in this post, I might add, using HTML5). No doubt there is more of this genre to come. Libraries nowadays it seems, are not just about books and journals, but about the full digital experience (not to mention sustenance; ours is now one of the more popular places for students to eat!).

In another initiative, several of our research lectures will shortly be recorded, with slides, audio and video interleaved and the result expressed via our iTunesU site (in fact, I also tried a project along those lines in 1999, and the lectures are still visible here). Lecture podcasts are on the increase (inject directly into iTunes here to see/hear talks I gave on the topic of Wikipedia and iPads) and I have previously noted on this blog my thoughts about the future of (e)Books. A common theme of all this digital content is to maintain a balance between purely visual entertainment, whilst trying to also create re-usable and semantically-rich components. The movie above, informative as it might be, is largely meant to be entertaining (or engaging; I leave you to judge whether it succeeds in either endeavour). These blog posts (until this one), have concentrated more on the content than the style (although do note that I have been assiduous in running this blog with a mobile-device plugin so that it can be at least in part viewed in such a manner), delivering the former via Jmol models (and perhaps more of HTML5 in the future), with data-oriented information supplied via links to digital repositories.

I am struck by the ever increasing contrast between “chalk-n-talk” (the photo below pertains to my office blackboard, and as you can see I do still love my chalk, thanks Greg!) and the (probably bewildering) variety of additional digital outlets we now have. How on earth does one cope?

Office blackboard, with chalk!

Tags:Chalk, iPads, on-line tools, Twitter
Posted in Chemical IT, General | 1 Comment »

What is the future of books?

Friday, April 29th, 2011

At a recent conference, I talked about what books might look like in the near future, with the focus on mobile devices such as the iPad. I ended by asserting that it is a very exciting time to be an aspiring book author, with one’s hands on (what matters), the content. Ways of expressing that content are currently undergoing an explosion of new metaphors, and we might even expect some of them to succeed! But content is king, as they say.

Here I list only some innovative solutions which have emerged in the last year or so, but which also raise important issues which we ignore at our peril.

TouchPress were one of the first publishers to get off the mark with their living books. Their first offering was The Elements, deriving from an earlier interactive display of the periodic table (an example of which can be seen in the entrance to the chemistry building at Imperial College). It is a programmed book, in the sense that the content is expressed using code written by the publisher (very much in the manner of interactive games).
Next to appear were Inkling, who describe their offering as interactive. Their approach is described in a blog written by their founder, Matt Macinnis. There he talks about The Art of Content Engineering, which again makes it sound as if authoring a book is in effect programming it! (I know what he means; if you follow the link to the talk I allude to above, you may spot that it too is, at least in part, programmed, and not simply written). Inkling also promote the book as part of a social network, with readers able to annotate the content, and share that annotation with others.
The latest company to change the way books are both read and authored is Pushpoppress, the heart of which is also an interactive app.
Then there is the epub3 format. This is a free and open standard for e-books. This third revision in particular is meant to enhance interactivity.

Something of a common theme so far. Books are going to be interactive! But what about these issues?

Each of the first three (commercial) publishers above has adopted their own programming format. Although HTML5 may be at the heart of some of this, programming may also mean control (in the sense that the creative industries must put control of their content at the heart of what they do). Each of the first three above sound like a closed system, and extracting re-usable content is, I argue, an essential part of doing science. I am just a tad worried that the approaches exemplified above may not allow this to happen.
Suppose you manage to acquire a chemistry textbook in any of the four approaches listed above. Will they inter-operate, in the sense of being able to extract data from one and perhaps inject it into another? Or will each be a data- or information silo, rigidly controlled by the creative content generator (whoever that is)?
What might an aspiring author, intent on creating interactive content do? Should they go closed/proprietary or open? They will clearly need to retrain themselves. We have indeed come a long way along the road: hand-written manuscript → typed manuscript → word-processed manuscript → interactive app! Like computer games, is the day of the single-authored book rapidly fading, to be replaced by a large team, each with their own tasks to perform?

I end with this question. Is the era of books, just like the Web itself, going to be the app? And who will be able to (find the time) to participate?

Tags:aspiring author, aspiring book author, e-books, Imperial College, intent on creating interactive content do, iPad, King, Matt Macinnis, mobile devices, social network, Tutorial material
Posted in Chemical IT, General | 8 Comments »

Chemicalizing a blog.

Wednesday, March 30th, 2011

I am at the ACS meeting, attending a session on chemistry and the Internet. This post was inspired by Chemicalize, a service offered by ChemAxon, which scans a post like this one, and identifies molecules named. I had previously used generic post taggers, which frankly did not work well in identifying chemical content. So this is by way of an experiment. I list below some of the substances about which I have blogged, to see how the chemicalizer works.

Mauveine
Copper phthalocyanine
Lapis Lazuli (this is a difficult one, since the active ingredient is actually trisulfide radical anion or S₃^-.; lets see if any of that is picked up!)
Cyanohydrin (a generic term, but more specifically HCN + Formaldehyde)
diberyllium
Calixarene (another generic term)
1,3-dimethylcyclobutadiene and carbon dioxide
Z-DNA and Z-d(CGCG)₂
Cyclohexane, cyclohexene, cyclohexadiene and benzene (the third of these is ambiguous as I have written it)
CH₃NO (a formula, with many isomers of course)
Dicarbon or C₂and a cyonium cation or CN⁺

That should suffice to see how such a list can be chemicalized.

Tags:ACS, chemical content
Posted in Chemical IT | 7 Comments »

Embedding molecules in blogs: ChemDoodle, WebGL and SVG

Friday, December 24th, 2010

If you get a small rotatable molecule below, then ChemDoodle/HTML5/WebGL is working. Why might this be important? Well, the future is mobile, in other words, devices that rely on batteries or other sources of built-in power. This means the power guzzling GPU cards of the past (some reach ~400 Watts!) cannot be used. Rather than using e.g. a full power OpenGL library, one will use Web-based graphics libraries, which (to quote Wikipedia) extends the capability of the JavaScript programming language to allow it to generate interactive 3D graphics within any compatible web browser. A typical target device might be for example Apple’s iPad (for which the redoubtable Jmol, which is based on Java, is unlikely to ever work).

To find out if your device and its browser can support this type of graphical display, go to either this test page or this more general one (which at the time of writing actually gets the WebGL test wrong!).

I have deployed an earlier graphical methodology in other posts (SVG), which many browsers now support. This combination of HTML5, SVG and WebGL is the future! For its use on another blog, see here.

Tags:3D graphics, Apple, GPU, HTML5, iPad, Java, JavaScript, OpenGL library, SVG, typical target device, Web browser, Web-based graphics libraries, WebGL
Posted in Chemical IT | 3 Comments »

(re)Use of data from chemical journals.

Wednesday, December 22nd, 2010

If you visit this blog you will see a scientific discourse in action. One of the commentators there notes how they would like to access some data made available in a journal article via the (still quite rare) format of an interactive table, but they are not familiar with how to handle that kind of data (file). The topic in question deals with various kinds of (chemical) data, including crystallographic information, computational modelling, and spectroscopic parameters. It could potentially deal with much more. It is indeed difficult for any one chemist to be familiar with how data is handled in such diverse areas. So I thought I would put up a short tutorial/illustration in this post of how one might go about extracting and re-using data from this one particular source.

Interactive Journal table

The above is a snapshot of part of the table in question, with a box in the middle set aside for a Jmol applet to appear. What might be both less obvious, and less familiar to many who might have seen such a display is the very rich environment available for manipulating the data. To expose some of this, proceed as follows:

Firstly, load a molecule into the Jmol window by clicking on e.g. the hyperlink shown below.

Loading a molecule
The display shown below will appear, in this case a set of coordinates used to present a 3D model of a molecule, which can be rotated, zoomed, etc. It also has been labelled with various selected bond lengths etc.

Interactive table with molecule loaded
To extract data, right-click anywhere in the molecule area. Navigate through the menus which appear as shown below. In this case, the data is present in the form of a Gaussian log file. This can contain the history of the particular calculation performed (e.g. a geometry optimisation) or as in this case, all 3N-6 calculated normal vibrational modes. The one of interest here is number 318, being an O=C=O stretching mode.

An Interactive table in a chemistry journal.
This mode can now be manipulated visually by selecting various parameters:

Manipulating a vibrational mode
Jmol has a scintillating display of other options, and more are being added all the time, so the above display is by no means the limit of what one can do.
Now to the most important bit. Invoke the menu as shown below, whereupon a copy of the relevant file (gzipped in this case to reduce its size) will be downloaded to your local system. You will now need to use a program on your own computer capable of reading and processing such a file (after unzipping).

Downloading a data file.
There may be a bewildering variety of programs and toolkits which may perform the operation you wish on such a file. Some are commercial, some are open source. To help people get going, I link to one of the latter type here, You might also want to visit the Quixote project for ideas.
We are not quite finished yet. Perhaps a Gaussian log file does not suite your purpose. Well, now try clicking on this link

Link to a digital repository
This produces a page such as below, which contains more files. In this example, several molecular identifiers are present (InChI and InChI key) to help identify the uniqueness of the system, the molecular coordinates are available as a .cml file which itself can be processed by a variety of software tools, the original file used to run the calculation can be inspected (if you want to eg repeat it) as input.gjf, the logfile we have seen above, and a checkpoint file, which is most useful when using either the Gaussian program system or a visualiser (Gaussview, ChemBio3D etc, both commercial programs). A SMILES string is also offered, and sometimes (not in this example) a so-called wavefunction file which can be used by some programs to analyse the wavefunction, and perform e.g. QTAIM, ELF, NCI analyses.

A digital repository page.

It is now up to the user to identify suitable processing programs on their computer which fit their purpose.
There is one other file present which I have not yet explained, the mets.xml manifest. This is a metadata file, containing (along with much else) an RDF declaration of (some) of the properties of the molecule. In theory at least, this file could be automatically harvested for the RDF, which could be injected into a triple store, and queried semantically using eg SPARQL. That is part of the semantic web.

I hope some of the screenshots here make the process of extracting data from an interactive table article a little more obvious. I must declare that this way of doing it is just one of the ways being explored and also (much to my regret) is not yet particularly common. But hopefully you might capture a little of what some of us believe to be the future of scientific journals.

Tags:chemical, chemical journals, chemist, opendata, RDF, semantic web, software tools, suitable processing programs, XML
Posted in Chemical IT, Interesting chemistry | 7 Comments »

Data-round-tripping: wherein the future?

Tuesday, December 7th, 2010

Moving (chemical) data around in a manner which allows its (automated) use in whichever context it finds itself must be a holy grail for all scientists and chemists. I posted earlier on the fragile nature of molecular diagrams making the journey between the editing program used to create them (say ChemDraw) and the Word processor used to place them into a context (say Microsoft office), via an intermediate storage area known as the clipboard. The round trip between the Macintosh (OS X) versions of these programs had been broken a little while, but it is now fixed! A small victory. This blog reports what happened when such a Mac-created Word document is sent to someone using Microsoft Windows as an OS (or vice versa).

As you might have guessed, the molecular diagram arrives largely dead, and not re-usable. Opening the .docx archive (it is nothing more than a zip file) reveals only a JPEG file residing inside. Nothing that can be chemically repurposed. If the reverse process is undertaken, of creating a chemdraw diagram, and pasting it into Word on Windows, one finds in the .docx two components; a bit-mapped image linked to an active object containing the data. Only the first of these is recognised if the file makes its way to a Macintosh; i.e. the same story, the data is again lost. So the bottom line is that Mac users and Windows users cannot, after all, exchange repurposable molecular diagrams using Word documents using this combination of programs. This is not good.

But let me remind what happened around 1993. The word processor was joined by a program called the Web browser. In 1996, the underlying content carrier, HTML, became XHTML (an instance of XML). Right from day 1 almost, such XHTML could, and frequently was repurposed. A memorable example is that search engines could use it to index the Web. The XHTML easily survived trips to and from clipboards. In 1996, CML joined HTML as a way of carrying chemical information capable of round-tripping without loss (if need be). There are other chemical XML languages in use nowadays, including CDXML used by the ChemDraw program. Word itself now uses XML (the x in .docx). So, after 14 years, why am I still describing the difficulties above? I am frankly at a loss to explain why there is still a need to write this post.

All is not entirely lost. The CML4Word approach is designed to enable (chemical) data round tripping from the outset. Although I do not yet know if the CML created and stored in the Word document using this mechanism is recognised anywhere outside of Word 2007 on Windows? If anyone can let me know of examples where such a CML-enabled Word document can be used in other environments, I would be very grateful (but not on OS X, as I know already).

And as I might have mentioned in the previous post on this topic, things may not however be getting better in that other carrier of information and data, the mobile phone/iPad, as exemplified by operating systems such as iOS or Android. Watch this space, as they say.

Tags:Android, cellular telephone, chemical, chemical information, Chemical IT, content carrier, HTML, iPad, JPEG, Mac OS X, Macintosh, Microsoft, Microsoft Windows, opendata, operating systems, search engines, Web browser, word processor, XML
Posted in Chemical IT | No Comments »

Data-round-tripping: moving chemical data around.

Saturday, November 20th, 2010

For those of us who were around in 1985, an important chemical IT innovation occurred. We could acquire a computer which could be used to draw chemical structures in one application, and via a mysterious and mostly invisible entity called the clipboard, paste it into a word processor (it was called a Macintosh). Perchance even print the result on a laserprinter. Most students of the present age have no idea what we used to do before this innovation! Perhaps not in 1985, but at some stage shortly thereafter, and in effect without most people noticing, the return journey also started working, the so-called round trip. It seemed natural that a chemical structure diagram subjected to this treatment could still be chemically edited, and that it could make the round trip repeatedly. Little did we realise how fragile this round trip might be. Years later, the computer and its clipboard, the chemistry software, and the word processor had all moved on many generations (it is important to flag that three different vendors were involved, all using proprietary formats to weave their magic). And (on a Mac at least) the round-tripping no longer worked. Upon its return to (Chemdraw in this instance), it had been rendered inert, un-editable, and devoid of semantic meaning unless a human intervened. By the way, this process of data-loss is easily demonstrated even on this blog. The chemical diagrams you see here are similarly devoid of data, being merely bit-mapped JPG images. Which is why, on many of these posts, I put in the caption Click for 3D, which gives you access to the chemical data proper (in CML or other formats). And I throw in a digital repository identifier for good measure should you want a full dataset.

It is only now that we (more specifically, this user) understand what had happened under-the-hood to break this round-tripping. In 1984, when Apple produced the Mac, they also produced a most interesting data format called PICT. A human saw the PICT as a PICTure, but the computer saw more. It (could) see additional data embedded in the PICT. The clipboard supported the PICT format, which meant that both picture and data could be transferred between programs. And ChemDraw and Word also understood this. Hence the ability to round-trip noted above (it has to be said between specifically these programs).

Times moved on and the limitations of PICT set in. Apple refocussed on the PDF format. Related, notice, to the Postscript format that Adobe had introduced in order to allow high quality laserprinting. PICT support was abandoned, and the various components no longer carried recognisable data (specifically the clipboard or the ability of Word to recognise the data). Round-tripping broke. Does this matter? Well, one colleague where I work had accumulated more than 1000 chemical diagrams, which he decided to store in Powerpoint (and yes, he threw the original Chemdraw files away). The day came when he wanted to round trip one of them. And of course he could not. He was rather upset I have to say!

PDF was not really a format designed to carry data (see DOI: 10.1021/ci9003688). But, bless their hearts, the three vendors involved in this story all agreed to support data embedded in the PDF hamburger (and Abobe to tolerate it) and now once again, a structure diagram can move into an Office program (on Mac) and out again and retain its chemical integrity. What lessons can be learnt?

Firstly, out of side, out of mind. The clipboard is truly mostly out of sight, and it was not really designed from the outset to preserve data properly. Nowadays I wonder whether clipboards in general recognise XML (and hence CML) and preserve it. I truly do not know. But they should.
Secondly, any system which relies on three or four commercial vendors, who at least in the past, devised proprietary formats which they could change without warning, is bound to be fragile.
We have learnt that data is valuable. More so than the representation of it (i.e. a 2D or 3D structure diagram). But when its lost, the users should care! And tell the vendors.
Peter Murray-Rust and his team have produced CML4Word (or as Microsoft call it, Chemistry add-in for Word). At its heart is data integrity. Fantastic! But I wonder if it survives on Microsoft’s clipboard (I know it does not on Apple’s, since CML4Word is not available on that OS. And is unlikely to ever become so).
And I can see history about to repeat itself. The same seems about to happen on new devices such as the Apple iPad. It too has copy/paste via a clipboard. I bet this will not round trip chemistry (or much other) data! Want to bet that the lessons of this story have not yet been learnt?

Oh, for those who wish to round-trip chemistry on a Mac, you will have to acquire ChemDraw 12.0.2 and Word 2011 (version 14.01), as well as OS X 10.6 for it to work.

Tags:Adobe, Apple, Apple iPad, ChemDraw 12, chemical data, chemical diagrams, chemical integrity, Chemical IT, chemical structure diagram, chemical structures, chemistry software, iPad, Mac, Mac OS X, Macintosh, Microsoft, opendata, PDF, Peter Murray-Rust, Postscript, word processor, XML
Posted in Chemical IT | 5 Comments »

Henry Rzepa's blog