Posts Tagged ‘Peter Murray-Rust’
Tuesday, October 4th, 2016
Peter Murray-Rust and I are delighted to announce that the 2016 award of the Bradley-Mason prize for open chemistry goes to Jan Szopinski (UG) and Clyde Fare (PG).
Jan’s open chemistry derives from a final year project looking at why atom charges derived from quantum chemical calculation of the electronic density represent chemical information well, but the electrostatic potential (ESP) generated from these charges is very poor and conversely charges derived from the computed electrostatic potential are incommensurate with chemical information (such as the electronegativity of atoms). He has developed a Python program called ‘repESP’ in which ‘compromise’ charges are generated which attempt to reconcile the physical world-view (fitting the ESP) with chemical insight provided by NPA (Natural Population Analysis). Jan was the main driver to making his code open source, “opening his supervisor’s eyes” to the various flavours of open source licences. To ensure that all subsequent improvements to the program remain available to anyone, the source code has been released under a ‘copyleft’ licence (GPL v3) and is maintained by Jan on GitHub, where Jan looks forward to helping new users and collaborating with contributors.
Clyde has made various contributions to opensource chemistry over the period of his PhD, with the focus mainly on utilities to improve quantum chemical research and the enhancement of a popular machine learning library with a method that has been successful in chemometrics, creation of an opensource channel for teaching chemists programming and data analysis and creation of a tool to help encourage open sourcing software development. Cclib is the most popular library for parsing quantum chemical data from output files and Clyde has contributed patches for the Atomic simulation environment which enables control of quantum chemical codes from a unified python interface. He was responsible for the construction of a computational chemistry electronic notebook published to github and which is now under active development by others as well. This aims to encapsulate computation chemical research projects, both for the sake of reproducibility and for the sake of organising and keeping track of quantum chemical research. Alongside this platform he created an enhanced Gaussian calculator for the Atomic Simulation Environment that enables automatic construction of ONIOM input files, also now under active development. He also made contributions to scikit learn, the most popular python machine learning framework, implementing a kernel for Kernel Ridge Regression that has become the most successful kernel for regression over molecular properties. He was part of the team that won the 2014 sustainable software conference prize for creation of the opensource healthchecker software as part of Sustain. He has argued for opensource as a platform for teaching resources and created the Imperial Chemistry github user account, which is now run by the department. Materials for the Imperial Chemistry Data Analysis and Programming workshops implemented as Python Notebooks are now available through this account and continue under active development.
Criteria for the award will include judging the submission on its immediate accessibility via public web sites, what is visible and re-usable in this way and of evidence of either community formation/engagement or re-use of materials by people other than the proposer.
Tags:Analytical chemistry, chemical information, chemical insight, Cheminformatics, Chemistry, Chemometrics, Clyde Fare, Company: GitHub, computation chemical research projects, computational chemistry, computing, Cross-platform software, driver, GitHub, Jan Szopinski, machine learning, open sourcing software development, opensource healthchecker software, Peter Murray-Rust, public web sites, Python, quantum chemical calculation, quantum chemical codes, quantum chemical data, quantum chemical research, Quotation, Server & Database Software, simulation, Software, supervisor, sustainable software conference prize, Technology/Internet
Posted in Bradley-Mason Prize for Open Chemistry | No Comments »
Tuesday, August 16th, 2016
This week the ACS announced its intention to establish a “ChemRxiv preprint server to promote early research sharing“. This was first tried quite a few years ago, following the example of especially the physicists. As I recollect the experiment lasted about a year, attracted few submissions and even fewer of high quality. Will the concept succeed this time, in particular as promoted by a commercial publisher rather than a community of scientists (as was the original physicists model)?
The RSC (itself a highly successful commercial publisher) has picked up on this and run its own commentary. You will find quotes from yours truly there, along with Peter Murray-Rust, a long time ardent promoter of community driven open science. One interesting aspect is that the ACS runs around 50 journals, and the decision on whether each will accept preprints for publication will (shortly = next few weeks) be made by the individual editors. I wonder if the eventual list of those supporting the project will bring any surprises (bets on J. Am. Chem. Soc. preprints anyone)?
But I want to pick up on the declared aspiration “to promote early research sharing“. Here I couple research sharing with data sharing. If you share your research, you should also share the data resulting from that research. We are now entering a new era of data sharing (in part as a result of mandation by various funding bodies) and so one has to ask whether a pre-print server will encourage people to create and share FAIR data (data which is findable, accessible, inter-operable and re-usable) as a model to replace the current one of “supporting information” held in enormous PDF files (mostly unFAIR on at least three counts). This question is indeed posed in the RSC commentary. What I would like to see happen are projects such as that described here, which create what were described as “first class research objects”, and which I think amply fulfil the criteria of being FAIR. So, will ChemRxiv preprint servers help promote such FAIR data sharing as part of early research sharing? We will find out soon.
The ACS supports OA (Open Access) sharing of articles, provided the authors pay (or arrange payment of) the appropriate APC or article processing charge. These charges are complex, being subject to various discounts (for example if you as an author are an ACS member or not) but are generally not insignificant (> $1000). I wondered whether preprints might be subject to an APC, and so I asked the ACS. The response was “we don’t anticipate any submission or usages fees at this time“. I think that means free at point of submission, and free at point of readership “at this time“.
Finally, let me now summarise as I understand the current family of “research publications”:
- The preprint
- The final author version as submitted to a journal
- The “version of record” (VoR) as published by the journal
- Any FAIR published data associated with the article
All four of these are attempts at “research sharing”. Each may be located in a different location, and each may have its own DOI. And of course we cannot easily know how much overlap there is between each of them. Thus, how might 1-3 differ in terms of the story or “narrative” of scientific claims? Does 4 agree or support 1-3? Does 4 agree with perhaps data subsets contained in 1-3? If keeping abreast of the current research literature is a challenge, imagine having to cope with/reconcile up to four versions of each “publication”!
Lots of food for thought here. We have not heard the last of these themes.
Tags:Academia, Academic publishing, article processing charge, author, Data publishing, Data sharing, food, Grey literature, Open access, Open science, PDF, Peter Murray-Rust, pre-print server, Preprint, preprint server, Public sphere, Publishing, Scholarly communication, Technology/Internet
Posted in Chemical IT | 1 Comment »
Friday, June 26th, 2015
Open principles in the sciences in general and chemistry in particular are increasingly nowadays preached from funding councils down, but it can be more of a challenge to find innovative practitioners. Part of the problem perhaps is that many of the current reward systems for scientists do not always help promote openness. Jean-Claude Bradley was a young scientist who was passionately committed to practising open chemistry, even though when he started he could not have anticipated any honours for doing so. A year ago a one day meeting at Cambridge was held to celebrate his achievements, followed up with a special issue of the Journal of Cheminformatics. Peter Murray-Rust and I both contributed and following the meeting we decided to help promote Open Chemistry via an annual award to be called the Bradley-Mason prize. This would celebrate both “JC” himself and Nick Mason, who also made outstanding contributions to the cause whilst studying at Imperial College. The prize was initially to be given to an undergraduate student at Imperial, but was also extended to postgraduate students who have promoted and showcased open chemistry in their PhD researches.
Peter and I are delighted to announce the inaugural winners of this prize.
The postgraduate winner is Tom Phillips for his open blog describing his experiences as a PhD student and for leading by example. He has published his instrumental codes on Github (and now Zenodo[1]) and data and codes for reproducing the graphs in his work on the “lab on a chip” in Figshare[2] and through his blog has encouraged other research students to do the same. Tom has worked assiduously to ensure that all the articles describing his PhD work are or will be open access.[3]
The undergraduate winner is Tom Arrow for his “spare time” involvement with WikiMedia (the foundation that underpins the open Wikipedia), including participating in a Wikimedia EU hackathon in Lyon France, and feeding his experiences and skills back into his undergraduate environment as well as enhancing the teaching Wiki used by his fellow students. Tom took the lead in introducing us to Wikidata[4] for storing chemical data in an open Wikibase data repository and in promoting its use for enriching Wikipedia chemistry pages and showcasing open data in undergraduate teaching environments.
References
- T. Phillips, and S. Macbeth, "pumpy: Zenodo release", 2015. https://doi.org/10.5281/zenodo.19033
- T. Phillips, J.H. Bannock, and J.D. Mello, "Data for microscale extraction and phase separation using a porous capillary", 2015. https://doi.org/10.6084/m9.figshare.1447208
- T.W. Phillips, J.H. Bannock, and J.C. deMello, "Microscale extraction and phase separation using a porous capillary", Lab on a Chip, vol. 15, pp. 2960-2967, 2015. https://doi.org/10.1039/c5lc00430f
- D. Vrandečić, and M. Krötzsch, "Wikidata", Communications of the ACM, vol. 57, pp. 78-85, 2014. https://doi.org/10.1145/2629489
Tags:Cambridge, chemical data, Chemistry Central, Collective intelligence, Crowdsourcing, Doctor of Philosophy, Education, European Union, France, GITHUB INC., Imperial College, Jean Claude Bradley, lab on a chip, Lyon, Nick Mason, Nonprofit technology, Open content, Peter Murray-Rust, reward systems, Technology/Internet, Tom Arrow, Tom Phillips, Wikimedia Foundation, wikipedia, World Wide Web, young scientist
Posted in Bradley-Mason Prize for Open Chemistry, Chemical IT | 1 Comment »
Friday, October 11th, 2013
In 1993-1994, when the Web (synonymous in most minds now with the Internet) was still young, the pace of progress was so rapid that some wag worked out that one “web-year” was like a dog-year, worth about 7 years of normal human time. So in this respect, 1994 is now some 133 web-years ago. Long enough for an archaeological excavation.
And so it was that I came across two Web-pages that have suddenly acquired a topical significance:
- http://www.ariadne.ac.uk/issue1/clic
- http://doi.org/10.1080/13614579509516846[1]
Their topicality in part arises from e.g. http://www.rsc.org/AboutUs/News/PressReleases/2013/RSC-announces-chemical-sciences-repository.asp where the RSC seeks community support to help curate the data we as scientists produce.
Some of my recent posts (this one on dual-publisher models and this one on publishing procedures) also pertain to this and Peter Murray-Rust is constantly blogging on the topic (see this for the latest).
Perhaps 2013 will indeed be the year of data!
References
- D. James, B.J. Whitaker, C. Hildyard, H.S. Rzepa, O. Casher, J.M. Goodman, D. Riddick, and P. Murray‐Rust, "The case for content integrity in electronic chemistry journals: The CLIC project", New Review of Information Networking, vol. 1, pp. 61-69, 1995. https://doi.org/10.1080/13614579509516846
Tags:chemical, Peter Murray-Rust, Web-pages, web-years
Posted in Chemical IT | 1 Comment »
Sunday, September 15th, 2013
I do go on rather a lot about enabling or hyper-activating[1] data. So do others[2]. Why is sharing data important?
- Reproducibility is a cornerstone in science,
- To achieve this, it is important that scientific research be open and transparent.
- Openly available research data is central to achieving this. It is estimated that less than 20% of the data collected in chemistry is made available in any open manner.
- RCUK (the UK research councils) wish increased transparency of publicly funded research and availability of its outputs‡
But it’s not all hot air, honestly. Peter Murray-Rust and I had started out on a journey to improve reproducibility, openness and transparency in (inter alia) scientific publishing in 1994. In 2001 we published an example of a data-rich article[3] based on CML, and by 2004 the concept had evolved into something Peter termed a datument[4]. Some forty such have now been crafted.[5]
In 2009, the journal Nature Chemistry was starting up, and I approached them with the idea of an interactive data exploratorium on the premise that a new journal might be receptive to new ways of presenting science. It was accepted and published[6] and was followed in 2010 by a second variation.[7] In both cases, these activated-figures were sent to the journal as part of the submission process, and hosted by them (they still are). You can even access them without a subscription to the journal!
Move on to 2012, when David Scheschkewitz had some very exciting silicon chemistry to report, we collaborated on some computational modelling, and sent the resulting article to Nature Chemistry for publication. This included the usual interactive table reporting the modelling and its data. However, it transpired that the production workflows for Nature Chemistry had been streamlined and I was informed that interactive tables could no longer be accepted. This time, we (i.e. the authors) would have to solve the issue of how to host and present the data ourselves.
I was very keen that this table be treated with equal weight to the article itself (citable in its own right) and that it not be downgraded to supporting information (ESI). My objection to ESI is that it is often poorly structured by authors, i.e. it is not prepared in a form which allows the data to be re-used, either by a perceptive human, or a logical machine. As a result it is often given little attention by referees (although bloggers seem to do a far better job) and furthermore can end up being lost behind a pay wall (the two Nature Chem interactive objects noted above can be openly accessed, but only if you know that they exist). So I determined that:
- The table should be immediately accessible by non-experts, but not through any convoluted processes of downloading a file, expanding it and finding the correct document within the resulting fileset to view in the correct program, which is how normal ESI is handled.
- The table and the data it contained within should be capable of acting as a scientific tool, forming what could be the starting point for a new investigation if appropriate.
To solve this issue, some lateral and quick thinking was needed. The solution was a two-component model in which the original article is treated as a “narrative“, intertwingled with a second, but nevertheless distinct component, the “data“. This data would follow the principles of the Amsterdam Manifesto; it would itself be citable. The two components would become symbiotes (a datument). The narrative[8] could cite this data and the data could back-link to the narrative. The data would inherit trust (i.e. peer review) from that applied to the narrative and the latter would inherit a date stamp and integrity from the data host (in this case Figshare[9]).*
The data itself can have two layers, presentation [9]¶ using a combination of software (Jmol or JSmol for chemistry) which are used to invoke the “raw” data. That data itself is citable[10] (this is just a single example, resident as it happens on a different repository). The reader can choose use just the presentation layer or the underlying data.
The data object can be embedded in other pages; here it is below. The data sources for this table are themselves citable[11].
What are the advantages of such an approach? (the “what’s in it for me” question often asked by research students and their supervisors)
- Each of the components is held in an environment optimised for it and so can be presented to full advantage.
- The conventional narrative publisher does not necessarily also have to develop their own infrastructures for handling the data. They can choose to devolve that task to a “data publisher”.
- The data publisher (Figshare in this case) makes the data open. One does not need an institutional subscription to access it.
- “Added value” for each component can be done separately. Thus most narrative publishers would not necessarily wish to develop infrastructures for validating it or subsequently mining such “big data”. Indeed data mining of journals is prohibited by many publishers; it simply is either not possible or rendered so administratively difficult as to be impractical.
- Whilst a narrative article must clearly exist as a single instance (otherwise the authors would be accused of plagiarism), data can have multiple instances. Indeed, there exist protocols (SWORD) for moving data from one repository to another as the need arises. Publishing the same data in two or more locations is not currently considered plagiarism!
- The data component can be published as part of an article or say as part of a PhD thesis. This way, the creator of the data gets the advantages not of a date stamp associated with a narrative citation but of a much earlier stamp associated more closely with the actual creation of the data. That could easily and usefully resolve many disputes about who discovered what first, leaving the other issue of who interpreted what first to the narrative. I should mention that it is perfectly possible to “embargo” the data deposition so that it only becomes public when the narrative does (although you may choose not to do this).
- A data deposition cannot be modified, but a new version (which bidirectionally links back to the old one) can be published if say more data is collected at a future date.
- A whole infrastructure devoted just to enhancing the cited data can evolve; one that is unlikely to do so if the narrative publishers are the only stakeholders. For example, synthetic procedural data can be tagged using the excellent chemical tagger.
- It is relatively simple (=cheap) to build a pre-processor for publishing data, which for a research student can act as an electronic laboratory notebook, holding meta-data about the deposited/published data and the handles (doi) associated with each deposition. I have been using such an environment now for about seven years as the e-notebook for this blog for example. Thus the task of preparing figures and tables for a publication (or a blog post) is greatly facilitated. The same system is also used by research students and undergraduates for their lab work.
- I have noted previously how e.g. Google Scholar identifies data citations along with article citations in constructing an individual research profile. A researcher could become known for their published data as well as their published narratives. Indeed, it seems likely that the person who acquires and publishes the data, i.e. the research student, would then get accolades directly rather them all accruing to their supervisor.
But what can you, gentle reader of this blog, do to help? Well, ask if your institution already has, or plans to create a data repository. It can be local (we use DSpace) or “in-the-cloud” (e.g. Figshare). If not, ask why not! And if you are planning to submit an article for publication in the near future, ponder how you might better share its data.
‡ As first circulated on 28 April, 2011. See
http://www.epsrc.ac.uk/about/standards/researchdata/Pages/policyframework.aspx
†The example given at the start of this post[8] contains only one table processed in this manner; the actual synthetic procedures are still held in more conventional SI.
*This blog uses the excellent Kcite plugin to manage citations.
¶The good folks at Figshare were extremely helpful in converting this deposition into an interactive presentation. Thanks guys!
References
- O. Casher, G.K. Chandramohan, M.J. Hargreaves, C. Leach, P. Murray-Rust, H.S. Rzepa, R. Sayle, and B.J. Whitaker, "Hyperactive molecules and the World-Wide-Web information system", Journal of the Chemical Society, Perkin Transactions 2, pp. 7, 1995. https://doi.org/10.1039/p29950000007
- R. Van Noorden, "Data-sharing: Everything on display", Nature, vol. 500, pp. 243-245, 2013. https://doi.org/10.1038/nj7461-243a
- P. Murray-Rust, H.S. Rzepa, and M. Wright, "Development of chemical markup language (CML) as a system for handling complex chemical content", New Journal of Chemistry, vol. 25, pp. 618-634, 2001. https://doi.org/10.1039/b008780g
- H.S. Rzepa, "Chemical datuments as scientific enablers", Journal of Cheminformatics, vol. 5, 2013. https://doi.org/10.1186/1758-2946-5-6
- H.S. Rzepa, "Transclusions of data into articles", 2013. https://doi.org/10.6084/m9.figshare.797481
- H.S. Rzepa, "The importance of being bonded", Nature Chemistry, vol. 1, pp. 510-512, 2009. https://doi.org/10.1038/nchem.373
- H.S. Rzepa, "The rational design of helium bonds", Nature Chemistry, vol. 2, pp. 390-393, 2010. https://doi.org/10.1038/nchem.596
- M.J. Cowley, V. Huch, H.S. Rzepa, and D. Scheschkewitz, "Equilibrium between a cyclotrisilene and an isolable base adduct of a disilenyl silylene", Nature Chemistry, vol. 5, pp. 876-879, 2013. https://doi.org/10.1038/nchem.1751
- D. Scheschkewitz, M.J. Cowley, V. Huch, and H.S. Rzepa, "The Vinylcarbene – Cyclopropene Equilibrium of Silicon: an Isolable Disilenyl Silylene", 2013. https://doi.org/10.6084/m9.figshare.744825
- H.S. Rzepa, "Gaussian Job Archive for C60H92Si3", 2012. https://doi.org/10.6084/m9.figshare.96410
Tags:chemical tagger, data mining, datument, David Scheschkewitz, e-notebook, Google, opendata, Peter Murray-Rust, pre-processor, researcher, scientific tool, supervisor, United Kingdom
Posted in Chemical IT, Interesting chemistry | 9 Comments »
Friday, July 5th, 2013
The title of this post summarises the contents of a new molecular database: www.molecularspace.org[1] and I picked up on it by following the post by Jan Jensen at www.compchemhighlights.org (a wonderful overlay journal that tracks recent interesting articles). The molecularspace project more formally is called “The Harvard Clean Energy Project: Large-scale computational screening and design of organic photovoltaics on the world community grid“. It reminds of a 2005 project by Peter Murray-Rust et al at the same sort of concept[2] (the World-Wide-Molecular-Matrix, or WWMM[3]), although the new scale is certainly impressive. Here I report my initial experiences looking through molecularspace.org
The 150,000,000 calculations are released under the the CC-BY license, which is an encouraging (open) start. One does need however to login to the site, which I was able to do using my Google credentials. Shown below is a screenshot of a typical result in a search (of Power conversion efficiency in my case).

It comes in two parts, the first being the structure (given as a SMILES and 2D layout) with the principle predicted energy levels and predicted photovoltaic performance listed below that. This is then followed by what might be called an annotation with further computed/predicted properties using the algorithms applied by Chemicalize.org. This idea that a data set could accrete via semantically powerful annotations using other tools was also very much part of the concept of the WWMM (the matrix had at its heart a molecule in one dimension and a property, measured or computed in the other. The matrix is of course very sparse, which is why it needs annotation!).
It was at this point however that I started to wonder how I might add other annotations, based perhaps on other types of calculations. But thus far at least, I have not found any trace of something which I could immediately use for my own calculation; 3D coordinates specifically. Thus, the HOMO-LUMO energy gap is the key property which makes molecularspace unique and valuable (to someone working in the field of photovoltaics). But HOMO/LUMO gaps can be calculated in many different ways, and it can always be valuable to calibrate/validate the reported values against other methods. Perhaps if I continue to look, I might find these 3D coordinates (which, for 2,300,000 molecules would be a very valuable resource). Certainly for example, should I wish to do so, I could not at the moment readily replicate the calculation for any specific entry on the molecularspace site (which can be regarded as an essential component of scientific validation). When I use the first person, I mean of course either myself as a human or a software agent acting on my behalf (the latter having the endurance to repeat its procedures millions of times if necessary).
The reader of this blog may have noticed that whenever I report a calculation here, I like to cite its doi (more formally its handle), which links to a digital repository. In my case, the repository certainly carries the 3D coordinates, and also the full wavefunction provided if the reader wishes other properties to be derived from it.‡ Now if molecularspace is able to provide that in the fullness of time, it truly would be an impressive resource.
But the important take-home message from molecularspace is that archiving (under a CC-BY license) the “big” data from any given research in a manner which makes it readily re-usable by others (perhaps from quite different fields of science) is now an essential requisite of doing science. And it is really nice to see good examples of this in practice!
‡ Generally, the calculations I perform for this blog are published in a DSpace repository (the original one, started in 2006[4]), and more recently in Chempound (a project by Peter Murray-Rust and colleagues which emerged out of the WWMM experiments) as well as Figshare[5]. The first and the third assign unique handles (i.e. a doi) to the data; chempound does not (and neither does molecularspace).
References
- J. Hachmann, R. Olivares-Amaya, S. Atahan-Evrenk, C. Amador-Bedolla, R.S. Sánchez-Carrera, A. Gold-Parker, L. Vogt, A.M. Brockway, and A. Aspuru-Guzik, "The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid", The Journal of Physical Chemistry Letters, vol. 2, pp. 2241-2251, 2011. https://doi.org/10.1021/jz200866s
- P. Murray-Rust, H.S. Rzepa, J.J.P. Stewart, and Y. Zhang, "A global resource for computational chemistry", Journal of Molecular Modeling, vol. 11, pp. 532-541, 2005. https://doi.org/10.1007/s00894-005-0278-1
- P. Murray-Rust, S.E. Adams, J. Downing, J.A. Townsend, and Y. Zhang, "The semantic architecture of the World-Wide Molecular Matrix (WWMM)", Journal of Cheminformatics, vol. 3, 2011. https://doi.org/10.1186/1758-2946-3-42
- J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
- H.S. Rzepa, "Gaussian Job Archive for CLi6", 2013. https://doi.org/10.6084/m9.figshare.739310
Tags:energy gap, energy levels, Google, Harvard, Jan Jensen, molecularspace site, opendata, Peter Murray-Rust, software agent acting, www.compchemhighlights.org, www.molecularspace.org
Posted in Chemical IT | 7 Comments »
Monday, October 31st, 2011
Most of the chemical structure diagrams in this blog originate from Chemdraw, which seems to have been around since the dawn of personal computers! I have tended to use this program to produce JPG bitmaps for the blog, writing them out in 4x magnification, so that they can be scaled down for display whilst retaining some measure of higher resolution if needed for other purposes. These other purposes might be for e.g. the production of e-books (using Calibre), the interesting Blog(e)book format offered as a service by Feedfabrik, or display on mobile tablets where the touch-zoom metaphor to magnify works particularly well. But bitmap images are not really well future proofed for such new uses. Here I explore one solution to this issue.

I have previously mentioned scalable vector graphics (SVG) as an alternative, and fortunately the production of such has become routine.3 The diagram above2 is indeed SVG (and if you cannot see it, then try a modern SVG-capable browser1). It was produced thus:
- Drawn in Chemdraw
- Exported as Encapsulated postscript
- Imported into Scribus, an Open Source desktop publishing program (where it can be annotated/edited if need be)
- This program will also need Ghostscript installed to handle the EPS
- and exported from Scribus to SVG.
- Notice how the diagram above automatically scales to fill the width of the page. If you click on it, you get the diagram on its own. If you zoom the browser window, it should scale perfectly.
- I note that these SVG diagrams work well in e-books or blogbooks.
There seem to be many other (open) programs out there which support SVG, so the above combination is not necessary the only one, or indeed the best. There is one other aspect which might be mentioned. The old GIF or JPG bitmap formats do have good meta-data support, such as
EXIF,
GPS or
XMP. These invisible data have often been used to embed a molecular connection table into a GIF or JPG file, such that the original molecular data can be reconstituted from the image file. Unfortunately, there are no real standards for doing this, and so round-tripping the data is probably a closed process within a specific software environment. However, because SVG is an XML format, it can be readily made to carry such information in a fully inter-operable manner. For example, one could easily embed a CML description of the molecule into its own container (namespace) in the SVG file. For the purposes of rendering an on-screen image, this extra information is of course ignored.
1 I notice that Internet Explorer 9 (both 32- and 64-bit versions) will display (and save) the above diagram if you click on it, but it cannot (yet) be inlined into the post, although the documentation implies it should.
2 The version below is the conventional JPG form (click on it to see the original 4x version).

Diagram displayed using JPG.
3. Historical note. Peter Murray-Rust and I have been promoting SVG for use in chemistry for 11+ years now. For one ancient page, see here. The syntax has decayed somewhat, but some of the diagrams still work!
Tags:chemical diagrams, chemical structure diagrams, desktop publishing, e-books, Feedfabrik, GIF, GPS, Internet Explorer, Peter Murray-Rust, software environment, Vector Graphics
Posted in Chemical IT, General | 13 Comments »
Monday, October 24th, 2011
I have for perhaps the last 25 years been urging publishers to recognise how science publishing could and should change. My latest thoughts are published in an article entitled “The past, present and future of Scientific discourse” (DOI: 10.1186/1758-2946-3-46). Here I take two articles, one published 58 years ago and one published last year, and attempt to reinvent some aspects. You can see the result for yourself (since this journal is laudably open access, and you will not need a subscription). The article is part of a special issue, arising from a one day symposium held in January 2011 entitled “Visions of a Semantic Molecular Future” in celebration of Peter Murray-Rust’s contributions over that period (go read all 15 articles on that theme in fact!).
Here I want to note just two features, which I have also striven to incorporate into many of the posts this blog (which in one small regard I have attempted to formulate as an experimental test-bed for publishing innovations). Scalable-Vector-Graphics (SVG) emerged around the turn of the millennium as a sort of HTML for images. To my knowledge, no science publisher has yet made it an intrinsic part of their publishing process (although gratifyingly all modern browsers support at least a sub-set of the format). Until now (perhaps). Thus 10.1186/1758-2946-3-46 contains diagrams in SVG, but you will need to avoid the Acrobat version, and go straight to the HTML version to see them. However, what sparked my noting all of this here was the recent announcement by Amazon that they are adopting a new format for their e-books, which they call Kindle Format 8 or KF8 (the successor to their Mobi7 format). To quote: “Technical and engineering books are created more efficiently with Cascading Style Sheet 3 formatting, nested tables, boxed elements and Scalable Vector Graphics“. This is wrapped in HTML5 to be able to provide (inter alia) a rich interactive experience for the reader. In fairness, there is also the more open epub3 which strives for the same. Other features of HTML5 include embedded chemistry using WebGL and the same mechanisms are being used for the construction of modern chemical structure drawing packages.
It remains to be seen how much of all of this will be adopted by mainstream chemistry publishers. Here, we do get into something of a cyclic argument. I suspect the publishers will argue that few of the authors that contribute to their journals will send them copy in any of these new formats and that it would be too expensive for them to re-engineer these articles with little or no help from such authors. The chemistry researchers who do the writing (perhaps composition might be a better word?) might argue there is little point in adopting innovative formats if the publishers do not accept them (I will point out that my injection of SVG into the above article did have some teething problems). For example, you will not find SVG noted in any of the “instructions for authors” in most “high impact journals” (or, come to that, HTML5).
If one looks at the 25 year old period, in 1986 all chemistry journals were distributed exclusively on paper. My office shelves still show the scars of bearing the weight of all that paper. Move on 25 years, and all journals almost without exception are now distributed electronically. I suspect the outcome in many a reader’s hands is simply that they (rather than the publisher) now bear the printing costs themselves (despite or perhaps because of the introduction of electronic binders such as Mendeley). But it will only be when the article itself grows out of its printable constraints, and hops onto mobile devices such as Kindles and iPads in the promised (scientifically) interactive and data-rich form, that the true revolution will start taking place.
A final observation: you will not readily obtain the interactive features of 10.1186/1758-2946-3-46 on e.g. an iPad or Kindle because the Java-based Jmol is not supported on either. But Jmol has now been ported to Android, and its certainly one to watch.
Tags:Acrobat, Amazon, Android, chemical structure drawing packages, e-books, HTML, HTML5, iPad, iPads, Java, KF8, Kindle, mobile devices, opendata, Peter Murray-Rust, printing costs, SVG, Vector Graphics
Posted in Chemical IT, General | 3 Comments »
Sunday, October 16th, 2011
I asked a while back whether blogs could be considered a serious form of scholarly scientific communication (and so has Peter Murray-Rust more recently). A case for doing so might be my post of about a year ago, addressing why borane reduces a carboxylic acid, but not its ester, where I suggested a possible mechanism. Well, colleagues have raised some interesting questions, both on the blog itself and more silently by email to me. As a result, I have tried to address some of these questions, and accordingly my original scheme needs some revision! This sort of iterative process of getting to the truth with the help of the community (a kind of crowd-sourced chemistry) is where I feel blogs do have a genuine role to play.

The reduction of a carboxylic acid by borane
TS1 in this scheme is modified from before to include an extra borane coordinating to the oxygen of the O-R group. I will include here the intrinsic reaction coordinate [computed at ωB97XD/6-311G(d,p)], since it shows some fascinating features.

One notes that the barrier for extrusion (R=H) is lower than before, due to the effect of the extra coordinated BH3 group. But notice the “bump” at an IRC value of ~4.0. If one inspects the gradients along the IRC, they reveal that the ejecting H-H molecule is tempted to coordinate to the boron to form a 5-coordinate species (a “hidden intermediate”) before abruptly changing direction and flying off into space!

You can see an animation by invoking this link or below:

What happens if R=Me (an ester)? Well, the activation energy is now closer to 40 kcal/mol, which means the rate of the reaction would be very slow. Notice the bump corresponding to 5-coordinate boron has now vanished!

Again, a link for IRC animation of the reaction (it is rather nice, even if a say so myself). QED? Well, not quite. One still has to show that TS2 – TS4 do not control things! The IRC for TS2 (the first addition of a hydrogen to the carbon) is shown below, again with fascinating bumps along the way. The TS2 animation is here. The free energy of TS2 is 6.9 kcal/mol lower than TS1 (even though the actual activation barrier is higher), which makes the latter the rate determining step. Note the bumps at IRC = -8 and +5. These are due to rotations setting up the reaction.

TS3, a ring closing reaction (animation) shows an unexpected feature which I leave you to discover for yourself. TS4 is the second and final addition of a hydrogen to the carbon, with animation and resembling an SN2 inversion. The reaction is completed by hydrolysis.
The relative free energies of TS1, 2, 3 and 4 are respectively 0.0, -6.9, -35.0 and -19.4 kcal/mol, which makes the overall rate limiting step TS1. If that is the case, then this explains why borane reduces only a carboxylic acid and not an ester.
Now all I have to do is explain all of this to my tutorial group! Mind you, this is a deceptively complex mechanism, and who knows if it may yet spring surprises.
Tags:activation energy, animation, carboxylic acid, carboxylic ester, diborane, free energy, pericyclic, Peter Murray-Rust, Reaction Mechanism, reduction, Tutorial material
Posted in Interesting chemistry | 8 Comments »
Thursday, July 7th, 2011
Computers and I go back a while (44 years to be precise), and it struck me (with some horror) that I have been around them for ~62% of the modern computing era (Babbage notwithstanding, ~1940 is normally taken as the start of the modern computing era). So indulge me whilst I record this perspective from the viewpoint of the computers I have used over this 62% of the computing era.
- 1967: I encountered (but that term has to be qualified) my first computer, suggested to me as an alternative to running quarter marathons on Wimbledon common at school by an obviously enlightened teacher! I wrote a program (in Algol) on paper tape, put the tape in an envelope, and sent it off to Imperial College (by van) to run, on an IBM 7094. A week later, printed output showed you had made a mistake on line 1 of the program. As I recollect, after about eight weeks of this, I got the program to run (and calculated π to 5 decimal places).
- 1970: By now I was a student (again at Imperial College), and was introduced to Fortran, then a radical new innovation to a chemistry degree. The delightfully named pufft compiler combined with the 7094 again, but this time with punched Holerith cards as input and line printer output. I cannot remember what we were asked to program. I do remember that the punched cards were produced by a pool of punch card operators, working from code pages written by the programmer. Some students (not me!) thought it great fun to give their Fortran variables naughty names (which the punch card operators then refused to punch, thus causing the student to fail the course!).
- 1971: I really liked this programming lark, so when instant-turnaround was introduced that year, I decided to do a proper program. It was called NLADAD (yes, I was no good at names, even then), which stood for non-linear-analysis of donor-acceptor complexes. The idea was to take recorded NMR chemical shifts, and fit them to an equilibrium A+B ⇔ AB+B ⇔ AB2 using non-linear regression analysis. It must have been all of 200 lines of code (OK, I did not write the matrix inversion routine myself)! Instant turnaround was also great, you got to punch your own cards this time, and had the great excitement of feeding them into a card reader yourself. You then walked about 5 yards to the line printer and waited agog. No waiting one week, this was less than a minute. Or it would have been if the line printer did not paper-wreck every two minutes! (I might add that I have a dim recollection of a member of the computer centre staff standing by to recover these paper wrecks. He, by the way, is now the director of the ICT division here!).
- 1972: I am now doing a PhD (yes, boringly, yet again at Imperial College). I had found the one and only teletypewriter in the chemistry department. The crystallographers had secreted it away in their empire, but were very dismayed to find me occupying it constantly. Instant was now even more instant. I was now connecting to a time-sharing CDC 6400 computer, at the dazzling speed of 110 baud (or bytes per second). These were small bytes by the way, since the CDC used 6 bits per byte. The result was that one did everything in UPPER CASE, since a 6-bit byte only allows 64 characters! My (still Fortran) programs reached probably 1000 lines of code now, and I was engrossed in deriving non-linear analyses of steady state chemical kinetics (about four different kinds of rate equation as I recollect). Ah, the joys of covariance analysis, and propagation of errors (I was in a kinetics lab, and all the other students plotted graphs on graph paper, and if pressed, plotted gradients of graphs, the so-called Guggenheim plots. I thought this the dark ages, but no-one volunteered to join me in this single teletypewriter room. Not even the attractive girls in the group. I was the geek of my time, no doubt about that. My kinetic analysis did however have one upside. Its how I meet my wife to be a few years later!).
- 1974: PhD completed, I was now ready to go to Texas, where everything is bigger (and in terms of computers, slightly better, a CDC 6600 now and a 300 baud teletypewriter!). I had been computing now for seven years, and finally I actually got to SEE the device for the very first time. My mentor, Michael Dewar, had a sort of special relationship with the university. His students (and possibly only his students) were allowed to go into the depths of the machine room, where behind plate glass you could see the CDC 6600. I soon learnt how to get even closer. It was not particularly exciting however. I was more entranced with the CALCOMP flatbed plotter, which was located next to the 6600. Pictures at last (you probably do not want to know that to convert my kinetics in 4 above to pictures, I got quite expert in using a french curve. Look it up before you jump to conclusions). Part of the pact I negotiated was that I was only allowed into the inner sanctum at 03:00 in the morning (sic!). Still a geek then! Oddly, I was one of the few students in Dewar’s group using the CALCOMP, but at least we now had pictures of the molecules I was now calculating (using MINDO/3). To put the computing power into context, in 1975, Paul Weiner, another group member, announced that he had completed a full geometry optimisation of LSD, this having taken about 4 days to do on that over-worked 6600. The entire group went out to celebrate. Many pitchers of beer were drunk that nite.

Computer graphics from 1976.
- 1977: Back to Imperial, where we might have also now had a CDC 6600. And a Tektronix terminal running at the dizzying (hardwired end-to-end) speed of 9600 baud. I learnt to Word process on this device (using a word processor, written in Fortran, although not by me) and I wrote three review articles by this means, using a fancy phototypesetter as the printer. My next program, STEK, probably ran to about 5000 lines of code, and it persuaded the Tektronix to plot all sorts of things, ball&stick diagrams, isometric potential surfaces, molecular orbitals, and the like (and jumping ahead, my experience with this program eventually led to CML, and Peter Murray-Rust, but that is indeed jumping ahead). I think I also managed to gain access to the Imperial machine room, that inner sanctum, yet again. But for reasons I will not go into, it was not as interesting as the Texan machine room.

Chemistry Computer graphics, circa 1977-85.
- 1979: I encountered a Cray 1 computer, and probably also 8-bit bytes (and yes, lower case printer outputs) for the first time at the University of London Computing Centre.
- 1980: Remember that teletypewriter, encountered earlier. Well these were now running at 2400 baud and I started to organise the deployment of a chemistry department computer network to sprinkle several such terminals around the department. The controller was a PAD, and in that year, we introduced STN ONLINE using this network. It was the first time we could search CAS online ourselves (previously, it was a service offered by the library). Literature searching has not been the same since.
- 1980: I finally again encountered a real computer, which one could happily listen to without creeping into machine rooms in the middle of the night. It was the data system on a Bruker Spectrospin 250 MHz superconducting NMR spectrometer. I had many adventures on this system. It was installed, by the way, on more or less the same day as the birth of my first daughter Joana. It had a hard drive (5 Mbytes as I recollect, and cost an absolute fortune, around £10,000 if I remember correctly).

Combining Quantum mechanics and NMR.

Computer graphics 1982, from NMR spectrometer.
- 1982: More networks, this time a curious computer known as the Corvus Concept, using a networked hard drive (possibly as big as 20 Mbytes by now), and a large screen.
- 1985: Enter the Mac (OK, the IBM PC came a little earlier, but it was not entrancing). Now one really had a tactile computer that made noises (not always nice), produced smoke signals occasionally, and ejected its floppy disk incessantly. Yet another revolution to cope with. As I type this, I look down on that Mac, which is still underneath my desk. Wonder if its worth anything on ebay?
Well, a second consecutive blog, with (almost) no pictures or molecules. And I have only gotten to the half way stage of my story. Better break off then.
Tags:chemical shifts, chemistry department computer network, controller, director, fancy phototypesetter, Fortran, GBP, Guggenheim, Historical, IBM, ICT, Imperial College, Joana, London Computing Centre, Michael Dewar, obviously enlightened teacher, Paul Weiner, Peter Murray-Rust, programmer, steady state chemical kinetics, Tektronix, Texas, University of London, University of London Computing Centre, Wimbledon, word processor
Posted in Chemical IT | 5 Comments »