Posts Tagged ‘search engines’

PIDapalooza 2018. A conference like no other!

Tuesday, January 23rd, 2018

Another occasional conference report (day 1). So why is one about “persistent identifiers” important, and particularly to the chemistry domain?

The PID most familiar to most chemists is the DOI (digital object identifier). In fact there are many; some 60 types have been collected by ORCID (themselves purveyors of researcher identifiers). They sometimes even have different names; in life sciences they tend to be known instead as accession numbers. One theme common to many (probably not all) is that they represent sources of metadata about the object being identified. Further information if which allows you (or a machine) to decide if acquiring the full object is worthwhile. So in no particular order, here are some of the things I learnt today.

  1. Mark Hahnel noted the recent launch of the Dimensions resource which links research data with other research activities; I have not yet had a chance to learn its capabilities, but it seems an interesting alternative to other stalwarts such as eg Google Scholar etc.

    You can try this example: https://app.dimensions.ai/discover/publication?search_text=10.6084&search_type=kws&full_search=true which retrieves articles in which the data repository with prefix 10.6084 (Figshare) is cited. Try also the prefix 10.14469 which is the Imperial College repository.

  2. Andy Mabbett talked about the deployment and use of persistent identifiers (the Q numbers) in Wikidata, which increasingly underpin the basis for the various flavours of Wikipedia. He also noted their use of some 50 different identifiers.
  3. Johanna McEntyre noted some 5M published articles in life sciences which reference 1M+ ORCID identifiers, easily the domain with the fastest uptake of this type. Also noted was the new FREYA project; aiming to connect open identifiers for discovery, access and use of research resources.
  4. Tom Gillespie talked about RRID, or Research Resource Identifiers. Included in this are hardware, including instruments and with around 6000 RRIDs systematized so far. They argue this area promotes both the A and I of FAIR (accessible and inter-operable). Of course A and I mean many things to many people.
  5. Several other presentations talked about the finer detail of metadata, such as sub-classifications into e.g. descriptive/admin/technical, but I did rather miss demos showing how search queries of such fine-grained metadata could be constructed.

Apart from the presentations themselves, PIDapalooza is unusual for some other activities. Thus you could go get your PIDnails done, with a selection of 8 or so tasteful logos to choose from. There will be tattoos tomorrow (this is a conference for younger people after all). I may grab a photo or two to provide evidence!

 

Two stories about Open Peer Review (OPR), the next stage in Open Access (OA).

Thursday, October 5th, 2017

We have heard a lot about OA or Open Access (of journal articles) in the last five years, often in association with the APC (Article Processing Charge) model of funding such OA availability. Rather less discussed is how the model of the peer review of these articles might also evolve into an Open environment. Here I muse about two experiences I had recently.

Organising the peer review of journal articles is often now seen as the single most important activity a journal publisher can undertake on behalf of the scientific community; the very reputation of the journal depends on this process being conducted responsibly, thoroughly and with integrity by the selected reviewers. Reviewers conduct this process voluntarily, mostly anonymously, without remuneration or recognition and often with short deadlines for completion. After one such process, I recently received an interesting follow-up email from the journal, suggesting I register my activity with Publons.com, a site set up to register and give non-anonymous credit for reviewing activities. I should say that Publons is a commercial company, set up in 2012 to to “address the static state of peer-reviewing practices in scholarly communication, with a view to encourage collaboration and speed up scientific development”. Worthy aims, but like many a .com company nowadays, one might ask what the back-story might be. Thus many of the Internet giants, Google, Facebook, Twitter etc, do have back-stories, which often underpin their business models, but which may only emerge years after their founding. With only a hazy idea of what Publons’ back-story might be, I went ahead and registered my reviewing activity.

After doing so, I then accessed my entry. You only learn that I have reviewed for a particular journal, but nothing about the actual process itself. I did not really think that this experiment had done much to encourage collaboration and speed up scientific development. It might be useful for early career researchers to get their name exposed however.

I can almost understand why the review itself might not be publicly displayed, but as a result you learn nothing about the factual basis of the review and whether it might have been conducted responsibly, thoroughly and with integrity. Instead, I now suspect that the presence of my name on this site might merely encourage other publishers to deluge me with requests for further (freely donated) refereeing.

Discussing this at lunch, a colleague (thanks Ed!) reminded me of a veritable journal called Organic Syntheses. Here, authors submit a synthetic procedure and open identified “checkers” are invited to repeat the procedure and comment on it. The two roles are kept separate (i.e. the checkers do not become co-authors), but they could get credit for their activity. Thus if you view a typical recent entry[1] you will see a full biography and affiliation of the checkers given at the end, with footnotes often describing their own observations if they differ from those of the authors. 

This set me thinking whether an open peer review process might also contain such an element of checking, as well as informed comment, nay opinion, about the article itself and the conclusions it makes. The opportunity arose when I was contacted by an author who was about to submit a computational article to a journal. This journal allowed open peer review. If I agreed to review, my name would be attached to the article if accepted for publication. I undertook this on the basis that I would use this review to conduct some limited checking of the computations and other assumptions underpinning the conclusions in the submitted article. I also wanted this open process to include the data on which my review was based. Most importantly if anyone wished to replicate my replication, the barriers to doing so should be as low as is possible. Shortly thereafter, I received a formal invitation from the journal and I set about my task. Crucially, all my own calculations supporting the review were archived in a data repository, albeit under embargo. In my cover letter I included the DOI for my data and the embargo access code, so that the authors (and the editor of the journal if they so wished) could inspect the data against which I wrote my review.

Then followed standard procedures, whereby the authors took my comments into consideration, revised the article and the final version was indeed accepted and published.[2] You will find the two referees/checkers listed, although unlike Organic Syntheses,  there is no bibliographic information about them or their affiliation. I did ask the journal if they could at least link my ORCID identifier to my name, but that request was refused. If my name had been a common one, then disambiguating it into a unique identity could be a challenge. There was also no mechanism to associate my identity on the journal with any data on which I had based my review. Really, the only open aspect of this process was just my (potentially ambiguous) name, nothing else. No follow-up was received from the journal to add the review to Publons. 

The next stage was to contact the author who had originally set the process under way to ask them if they would mind my releasing the data on which my review had been based. They agreed, as also they did to my telling this story. The overall outcome is thus a published article with the reviewers (if not their reviews or any supporting evidence for their review) openly named. In this specific case, there is also an open dataset with a formal link back to the article in the form of a DOI (10.14469/hpc/2640, although I suspect this aspect is unique, even precedent setting), but one driven by the reviewer and not the journal. It would be nice to have bidirectional links between both article and the review data, but I do not know any publishers currently operating such a mechanism (if anyone knows such, please tell).

Now to the broader questions about the process described above. I think that the aspiration to encourage collaboration and speed up scientific development may indeed have been promoted by this association between article and the data assembled by the reviewer. Whether the final article was improved as a result of the processes described here I will leave the authors to comment if they wish. As with the checkers employed by Organic Syntheses, such a review process takes not just time, but resources. Resources that currently have to be freely donated by the reviewers and their host institution and which clearly cannot become expensive, time-consuming or onerous. That was not the case as it happens here; my contributions were facilitated by my having sufficient expertise to perform the tasks I undertook really quite quickly.

I will raise one more issue; that of whether to add my review to the dataset which is now openly available. In fact it is not included, in part because it related to the initially submitted version of the MS. The final MS version has been revised and so many of the comments in my review may only make sense if you have the first version to hand. It would be perhaps unreasonable to make the first drafts of manuscripts routinely available (although historians of science would probably love that!) alongside the reviews of that first draft. But I could also see a case for doing so if the community agreed to it. One to discuss for the future I think. There is also the associated issue of what should happen to any dataset associated with a review in the event that the final article is rejected and not accepted. Should the data remain permanently under embargo and the reviewer’s identity permanently anonymous? Perhaps opening up even such datasets might nevertheless  encourage collaboration and speed up scientific development, but I fancy some would consider that a step too far!

References

  1. J. Zhu, "Preparation of N-Trifluoromethylthiosaccharin: A Shelf-Stable Electrophilic Reagent for Trifluoromethylthiolation", Organic Syntheses, vol. 94, pp. 217-233, 2017. https://doi.org/10.15227/orgsyn.094.0217
  2. L. Li, M. Lei, Y. Xie, H.F. Schaefer, B. Chen, and R. Hoffmann, "Stabilizing a different cyclooctatetraene stereoisomer", Proceedings of the National Academy of Sciences, vol. 114, pp. 9803-9808, 2017. https://doi.org/10.1073/pnas.1709586114

How to search data repositories for FAIR chemical content and data: SubjectScheme

Thursday, June 8th, 2017

As data repositories start to flourish, it is reasonable to ask questions such as what sort of chemistry can be found there and how can I find it? Here I give an updated[1] worked example of a digital repository search for chemical content and also pose an important issue for the chemistry domain.

Firstly, I should say this search is restricted just to those data repositories that submit indexing terms (metadata) to DataCite, which is the agency that will be used to conduct the searches. Each type of metadata is defined by a prefix or operator field (much in the same way that an advanced Google search can be prefixed with an operator, e.g. author:). I will use just two such DataCite field prefixes here as exemplars (there are many more).

  1. media: This specifies the media type for the data being searched. For restriction to chemistry one takes advantage of the chemical/x- media type, as described previously.[2]
  2. SubjectScheme: This is a new declaration, as specified in the DataCite V4 metadata schema.[3] The subject scheme in effect declares a subject-specific term, and is designed to be used by domains such as chemistry.

This latter is best illustrated by one specific example of a search which I will dissect here:
https://search.datacite.org/works?query=media:chemical\/x\-gaussian*+SubjectScheme:inchikey+subject:XZYDALXOGPZGNV-UHFFFAOYSA-M+media:chemical\/x\-mnpub*

  1. https://search.datacite.org/works?query= queries the DataCite MDS (metadata store).
  2. media:chemical\/x\-gaussian* defines a media type which contains the string chemical/x-gaussian, with the * being a wild-card which allows any characters to follow this string. This now is specifying any data repository where Gaussian files have been deposited and assigned this media type.
  3. + represents a Boolean AND operator.
  4. SubjectScheme:inchikey restricts a subject search to a subjectScheme having the value inchikey, whilst
  5. subject:XZYDALXOGPZGNV-UHFFFAOYSA-M defines the value of the subject itself.
  6. media:chemical/x-mnpub completes the search definition, this relating to the mandatory additional presence of an Mpublish[4] file indicating (spectroscopic, probably NMR) data readable by the MestreNova program.

One hit with these restrictions has doi: 10.14469/HPC/2635 and clicking the button on the landing page for this object labelled metadata resolves to e.g.
https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/2635,
and downloads the metadata record for this object. Part of this record looks a bit like:

This brings me to the important issue for the chemistry domain, which is to agree upon a core set of SubjectSchemes for implementation in data repositories with domain-specific chemical content. The two subjects above, the InChI and the InChIKey seem obvious candidates for inclusion. But how the list is extended and how the SubjectScheme is specified are now matters for the community to discuss. Perhaps the IUPAC GoldBook is one starting point for the SubjectScheme URIs. Watch this space.


The \ syntax indicates an “escaped” character. Thus in chemicalx\-gaussian a \ ensured that the following / is treated as part of the search string, and not as part of the search syntax. Likewise \- ensures the minus character is part of the string and not a syntactic negation. The current list of characters requiring escaping is + - & | ! ( ) { } [ ] ^ " ~ * ? : \ /

The documentation lists common fields, but there are far more specified in V4 of their schema. The ones you see used here are not (yet?) documented at https://search.datacite.org/help.html

This Google page has a rich plethora of powerful searches, which I suggest almost no-one knows about!


References

  1. H.S. Rzepa, A. Mclean, and M.J. Harvey, "InChI As a Research Data Management Tool", Chemistry International, vol. 38, pp. 24-26, 2016. https://doi.org/10.1515/ci-2016-3-408
  2. H.S. Rzepa, P. Murray-Rust, and B.J. Whitaker, "The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange", Journal of Chemical Information and Computer Sciences, vol. 38, pp. 976-982, 1998. https://doi.org/10.1021/ci9803233
  3. DataCite Metadata Working Group., "DataCite Metadata Schema Documentation for the Publication and Citation of Research Data v4.0", DataCite e.V., 2016. https://doi.org/10.5438/0012
  4. M.J. Harvey, A. McLean, and H.S. Rzepa, "A metadata-driven approach to data repository design", Journal of Cheminformatics, vol. 9, 2017. https://doi.org/10.1186/s13321-017-0190-6

Supporting information: chemical graveyard or invaluable resource for chemical structures.

Friday, March 31st, 2017

Nowadays, data supporting most publications relating to the synthesis of organic compounds is more likely than not to be found in associated “supporting information” rather than the (often page limited) article itself. For example, this article[1] has an SI which is paginated at 907; almost a mini-database in its own right! Here I ponder whether such dissemination of data is FAIR (Findable, accessible, interoperable and re-usable).[2]

I am going to use this article as my starting point.[3] One of the compounds discussed there is shown below; it is not explicitly discussed in the main body of the article. So how findable is it?

  1. A search of Scifinder (Chemical abstracts) using the structure above reveals one hit, the source being the expected one.[3]
  2. A search of Reaxys (used to be Beilstein) reveals no hits in their own database, but one hit is noted in …
  3. Pubchem, where it occurs as substance 163835830. The source is again cited correctly[3]. One of the properties reported is the InChI key: JSLVVAICXSKSEQ-UHFFFAOYSA-N. This is the same key generated from the structure drawing programs Chemdraw or ChemDoodle.
  4. Google on the other hand finds nothing for JSLVVAICXSKSEQ-UHFFFAOYSA-N.[4]
  5. I also tried Google Scholar but again with no luck.

So supporting information does appear to be indexed by both Chemical Abstracts and Pubchem; it is thankfully not a graveyard![5] The chemical databases do return valuable additional information about the molecule, such as e.g. its InChI key and much else besides. Given that presumably the open PubChem resource IS indexed by Google, it must be a policy somewhere that prevents e.g. JSLVVAICXSKSEQ-UHFFFAOYSA-N from being found.

I suppose the next question might be Supporting information: chemical graveyard or invaluable resource for chemical spectra? I confess here that this post was in fact inspired by a previous one on the topic of the provenance of NMR spectra. And perhaps also with some input from the concept of sonification of spectra, in which an instrumental spectrum is converted into a sound signature to allow blind people access to such information. I wonder whether a sonified unique digital signature could be used to search for spectra, somewhat in the manner that InChI helped in tracking down (or not) the molecule above? I think it would be reasonable to say that e.g. NMR spectra as embedded in say a 907 page supporting information document are likely to be very much less FAIR[2]. The solution there of course is better provenance and better metadata, as I previously mulled.


I cannot help but wonder what a carbonyl group sounds like!

References

  1. J.M. Lopchuk, K. Fjelbye, Y. Kawamata, L.R. Malins, C. Pan, R. Gianatassio, J. Wang, L. Prieto, J. Bradow, T.A. Brandt, M.R. Collins, J. Elleraas, J. Ewanicki, W. Farrell, O.O. Fadeyi, G.M. Gallego, J.J. Mousseau, R. Oliver, N.W. Sach, J.K. Smith, J.E. Spangler, H. Zhu, J. Zhu, and P.S. Baran, "Strain-Release Heteroatom Functionalization: Development, Scope, and Stereospecificity", Journal of the American Chemical Society, vol. 139, pp. 3209-3226, 2017. https://doi.org/10.1021/jacs.6b13229
  2. M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J. Boiten, L.B. da Silva Santos, P.E. Bourne, J. Bouwman, A.J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C.T. Evelo, R. Finkers, A. Gonzalez-Beltran, A.J. Gray, P. Groth, C. Goble, J.S. Grethe, J. Heringa, P.A. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S.J. Lusher, M.E. Martone, A. Mons, A.L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M.A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, and B. Mons, "The FAIR Guiding Principles for scientific data management and stewardship", Scientific Data, vol. 3, 2016. https://doi.org/10.1038/sdata.2016.18
  3. G.M.S. Yip, Z. Chen, C.J. Edge, E.H. Smith, R. Dickinson, E. Hohenester, R.R. Townsend, K. Fuchs, W. Sieghart, A.S. Evers, and N.P. Franks, "A propofol binding site on mammalian GABAA receptors identified by photolabeling", Nature Chemical Biology, vol. 9, pp. 715-720, 2013. https://doi.org/10.1038/nchembio.1340
  4. S.J. Coles, N.E. Day, P. Murray-Rust, H.S. Rzepa, and Y. Zhang, "Enhancement of the chemical semantic web through the use of InChI identifiers", Organic & Biomolecular Chemistry, vol. 3, pp. 1832, 2005. https://doi.org/10.1039/b502828k
  5. M. Karthikeyan, and R. Vyas, "ChemEngine: harvesting 3D chemical structures of supplementary data from PDF files", Journal of Cheminformatics, vol. 8, 2016. https://doi.org/10.1186/s13321-016-0175-x

500 chemical twists: a (chalk and cheese) comparison of the impacts of blog posts and journal articles.

Friday, June 3rd, 2016

The title might give it away; this is my 500th blog post, the first having come some seven years ago. Very little online activity nowadays is excluded from measurement and so it is no surprise that this blog and another of my “other” scholarly endeavours, viz publishing in traditional journals, attract such “metrics” or statistics. The h-index is a well-known but somewhat controversial measure of the impact of journal articles; here I thought I might instead take a look at three less familiar ones – one relating to blogging, one specific to journal publishing and one to research data.

First, an update on the accumulated outreach of this blog over this seven-year period. The total number of country domains measured is 190. The African continent still has quite a few areas with zero hits (as does Svalbard, with a population of only 2600 for a land mass area 61,000 km2 or 23 km2 per person). Given the low blog readership density on the African continent, it would be interesting to find out whether journal readership is any better.

Next, I look at the temporal distribution for individual posts. The first has attracted the highest total; in five years it has had 19,262 views (the diagram below shows the number of views per day). Four others exceed 10,000 and 80 exceed 1000 views.

Of these five, the next is the oldest, going back to 2009. I was very surprised to find such longevity, with the number of views increasing rather than decreasing with the passage of time.

So time now to compare these statistics with the journals. And of course its chalk and cheese. A “view” for a post means someone (or something) accessing the post URL, which is then recorded in the server log. Resolving the URL does at least load the entire content of the post; whether its read or not is of course not recorded. Importantly, if you want to view the content at some later stage, a new “view” has to be made (although some browsers do save a web page and allow offline viewing at a later stage, but I suspect this usage is low). With electronic journal access, it’s rather different. Access to an article is now predominantly via two mechanisms:

  1. From the table of contents (this is somewhat analogous to browsing a blog)
  2. From the article DOI.

Statistics for these two methods are gathered differently. The new CrossRef resource chronograph.labs.crossref.org (CrossRef allocate all journal DOIs) can be used to measure what they call DOI “resolutions”. A DOI resolution however leads one only to what is called the “landing page”, where the interested reader can view the title, the graphical abstract and some other metadata. It does not mean of course that they go on to actually view the article (as HTML, equivalent to the blog above, or probably more often by downloading a PDF file). Here are a few results using this method:

What about the other main journal article access method, not via a DOI but from a table of contents page journal page? A Google search revealed this site: jusp.mimas.ac.uk (JUSP stands for Journal usage statistics portal, which sounded promising). This site collects “COUNTER compliant usage data”. COUNTER (Counting Online Usage of Networked Electronic Resources) is an initiative supported by many journal publishers and it sounds an interesting way of measuring “usage” (as opposed to “views” or “resolutions”; it’s that chalk and cheese again!). I would love to be able to show you some statistics using this resource, but the “small print” caught me out: “JUSP gives librarians a simple way of analysing the value and impact of their electronic journals”. Put simply, I am a researcher, not a librarian. As a researcher I do not have direct access; JUSP is a closed, restricted access (albeit taxpayer-funded) resource. I am discussing this with our head of information resources (who is a librarian) and hope to report back here on the outcome.

Finally research data. This is almost too new to be able to measure, but this resource stats.datacite.org is starting to collect statistics on data resolutions (similar to DOI resolutions).

  1. You can see from the below for Imperial College (in fact this represents the two data repositories that we operate and which I cite here extensively on these blogs) that the resolution at running up to about 200 a month per dataset (more typically ~25 a month), with a total of 5065 resolutions for all items in March 2016 (the blog has ~12,000 views per month).

  2. Figshare is another data repository we have made use of:

So to the summary.

  1. Firstly, we see that I have shown three forms of impact, views, resolutions and usage. If one had statistics on all three, one might then try to see if they are correlated in any way. Even then, normalisation might be a challenge.
  2. Over ~7 years, five posts on this blog have attracted >10,000 views.
  3. Many of the blog posts have a long “finish” (to use a wine tasting term); the views continue regularly and often increase over time.
  4. My analysis of the three journal articles above (and about 15 others) shows that between 50-300 resolutions over a few years is fairly typical (for this researcher at least; I am sure most better known researchers attract far far more).
  5. The temporal distribution for article resolutions and blog views show both can have continuing impact over an extended period. None of the 18 articles I looked at show a significantly increasing impact with time but many of the blog posts do. This tends to suggest that the audiences for each are quite different; researchers for articles and a fair proportion of inquisitive students for the blog?
  6. I may speculate whether a correlation between my article resolutions and my h-index probably might be found, but the article resolution has a fine-grained temporal resolution (allowing a derivative wrt time to be obtained) that is perhaps potentially more valuable than just the coarse h-index integration (an article can of course be cited for both positive and negative reasons!).
  7. Initial analysis for data shows resolutions running at a similar rate to article resolutions. It is not yet possible to correlate data resolutions with article resolutions in which that data is discussed.

References

  1. S.M. Rappaport, and H.S. Rzepa, "Intrinsically Chiral Aromaticity. Rules Incorporating Linking Number, Twist, and Writhe for Higher-Twist Möbius Annulenes", Journal of the American Chemical Society, vol. 130, pp. 7613-7619, 2008. https://doi.org/10.1021/ja710438j
  2. A.E. Aliev, J.R.T. Arendorf, I. Pavlakos, R.B. Moreno, M.J. Porter, H.S. Rzepa, and W.B. Motherwell, "Surfing π Clouds for Noncovalent Interactions: Arenes versus Alkenes", Angewandte Chemie International Edition, vol. 54, pp. 551-555, 2014. https://doi.org/10.1002/anie.201409672
  3. K. Abersfelder, A.J.P. White, H.S. Rzepa, and D. Scheschkewitz, "A Tricyclic Aromatic Isomer of Hexasilabenzene", Science, vol. 327, pp. 564-566, 2010. https://doi.org/10.1126/science.1181771

Data-round-tripping: wherein the future?

Tuesday, December 7th, 2010

Moving (chemical) data around in a manner which allows its (automated) use in whichever context it finds itself must be a holy grail for all scientists and chemists. I posted earlier on the fragile nature of molecular diagrams making the journey between the editing program used to create them (say ChemDraw) and the Word processor used to place them into a context (say Microsoft office), via an intermediate storage area known as the clipboard. The round trip between the Macintosh (OS X) versions of these programs had been broken a little while, but it is now fixed! A small victory. This blog reports what happened when such a Mac-created Word document is sent to someone using Microsoft Windows as an OS (or vice versa).

As you might have guessed, the molecular diagram arrives largely dead, and not re-usable. Opening the .docx archive (it is nothing more than a zip file) reveals only a JPEG file residing inside. Nothing that can be chemically repurposed. If the reverse process is undertaken, of creating a chemdraw diagram, and pasting it into Word on Windows, one finds in the .docx two components; a bit-mapped image linked to an active object containing the data. Only the first of these is recognised if the file makes its way to a Macintosh; i.e. the same story, the data is again lost. So the bottom line is that Mac users and Windows users cannot, after all, exchange repurposable molecular diagrams using Word documents using this combination of programs. This is not good.

But let me remind what happened around 1993. The word processor was joined by a program called the Web browser. In 1996, the underlying content carrier, HTML, became XHTML (an instance of XML). Right from day 1 almost, such XHTML could, and frequently was repurposed. A memorable example is that search engines could use it to index the Web. The XHTML easily survived trips to and from clipboards. In 1996, CML joined HTML as a way of carrying chemical information capable of round-tripping without loss (if need be). There are other chemical XML languages in use nowadays, including CDXML used by the ChemDraw program. Word itself now uses XML (the x in .docx). So, after 14 years, why am I still describing the difficulties above? I am frankly at a loss to explain why there is still a need to write this post.

All is not entirely lost. The CML4Word approach is designed to enable (chemical) data round tripping from the outset. Although I do not yet know if the CML created and stored in the Word document using this mechanism is recognised anywhere outside of Word 2007 on Windows?  If anyone can let me know of examples where such a CML-enabled Word document can be used in other environments, I would be very grateful (but not on  OS X, as I know already).

And as I might have mentioned in the previous post on this topic, things may not however be getting better in that other carrier of information and data, the mobile phone/iPad, as exemplified by operating systems such as iOS or Android. Watch this space, as they say.