PDF « Henry Rzepa's blog

Posts Tagged ‘PDF’

Research data: Managing spectroscopy-NMR.

Wednesday, March 16th, 2016

At the ACS conference, I have attended many talks these last four days, but one made some “connections” which intrigued me. I tell its story (or a part of it) here.

But to start, try the following experiment.

Find a Word document of .docx type on your hard drive
Remove the .docx suffix and replace it with a .zip suffix.
Expand as if it is an archive (it is!).
A folder is created and this itself contains four further folders. These all contain XML files, and in the sub-folder actually called word you will find something called document.xml That file contains the visible content of the document; all the others are support documents, including styles etc.

The reason this is important was made clear in Santi Dominguez’ talk. Most of it was concerned with introducing Mbook, an ELN (electronic laboratory notebook) but the relevance to the above comes from his introduction of Mpublish, a forthcoming product targeting the area of research data management. What is the connection? Well, NMR spectrometers produce raw outputs as collections of files, much in the manner of the exploded word document above. Some files contain the raw FID, others contain the acquisition parameters, etc. These files are then turned into the traditional spectra by suitable processing software such as Mestrenova (part of the same ecosystem as Mpublish). Most users of such programs then squirt the spectra into a PDF file and it is this last document that is preserved as “research data” – almost invariably this is the version sent off to journals as the supporting information or SI for the article. SI is called information for a good reason; in such a container it is very often not easily usable data, and functions just visually.

So what is the problem? Well, the conversion of the NMR fileset (and quite possibly many other forms of spectroscopy) into a PDF file is a lossy process. It cannot be reversed; information has been lost. And only really a human who can easily retrieve and interpret such a visual presentation.

Santi described how Mpublish can assemble all the files associated with the instrumental outputs, optionally add chemical structure and other information, collect suitable metadata describing the contents and create a .zip archive. As we saw with Word however, the suffix does not even need to be .zip. It was suggested that it be this information-complete archive that should really be used as SI to accompany an article in which NMR data is invoked to support the narrative. In the reverse process, anyone downloading this zip archive could themselves potentially acquire full access, without information loss, to the original NMR data. There is a little further magic that needs to be included to make the process work which I do not include here. When Mpublish becomes available to play with, I will complete that story here.

It is good to report that software is starting to appear which enhances the management and reporting of research data as part of the publication process. The “rules” and “best practice” of this game are still being written however. In this regard, I feel that it is the researchers themselves that must play a vital role in defining the rules. Let us not cede that role just to publishers.

Tags:Archive formats, chemical structure, ELN, Nuclear magnetic resonance, PDF, research data management, spectroscopy, suitable processing software, XML, Zip
Posted in Chemical IT | 1 Comment »

Global initiatives in research data management and discovery: searching metadata.

Monday, March 7th, 2016

The upcoming ACS national meeting in San Diego has a CINF (chemical information division) session entitled "Global initiatives in research data management and discovery". I have highlighted here just one slide from my contribution to this session, which addresses the discovery aspect of the session.

Data, if you think about it, is rarely discoverable other than by intimate association with a narrative or journal article. Even then, the standard procedure is to identify the article itself as being of interest, and then digging out the "supporting information", which normally takes the form of a single paginated PDF document. If you are truly lucky, you might also get a CIF file (for crystal structures). But such data has little life of its own outside of its parent, the article. Put another way, it has no metadata it can call its own (metadata is data about an object, in this case research data). An alternative is to try to find the data by searching conventional databases such as CAS, Beilstein/Reaxys or CSD, and there of course the searches can be very precise. But (someone) has to pay the bills for such accessibility.

We are now starting to see quite different solutions to finding data (the F in FAIR data, the other letters representing accessibility, interoperability and re-usability). These solutions depend on metadata being a part of the solution from the outset, rather than any afterthought produced as a commercial solution. The collection of metadata is part of the overall process called RDM, or research data management, perhaps even the most important part of it. In exchange for identifying metadata about one's data, one gets back a "receipt" in the form of a persistent identifier for the data, more commonly known as a DOI. The agency that issues the DOI also undertakes to look after the donated metadata, and to make it searchable. The table below shows eight searches of such metadata, one example of how to acquire statistics relating to the usage of the data and one search of how to find repositories containing the data.

Search queries enabled by the use of metadata in data publication
#	Search query^*	Instances retrieved:
1	http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:*	InChI identifier
2	http://search.datacite.org/ui?q=alternateIdentifier:InChI:*	InChI key
3	http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:CULPUXIDFLIQBT-UHFFFAOYSA-N	InChI key CULPUXIDFLIQBT-UHFFFAOYSA-N
4	http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChIKey:*	ORCID 0000-0002-8635-8390 AND (boolean) InChI key.
5	http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChI:InChI=1S/C9H11N5O3*	ORCID 0000-0002-8635-8390 AND (boolean) + InChI string 1S/C9H11N5O3 with the * wild.
6	http://search.datacite.org/ui?q=has_media:true&fq=prefix:10.14469	Has content media^‡ for Publisher 10.14469 (Imperial College)
7	http://search.datacite.org/ui?q=format:chemical/x-*	Data format type chemical/x-*
8	http://search.datacite.org/api?&q=prefix:10.14469& fq=alternateIdentifier:InChIKey:& fl=doi,title,alternateIdentifier& wt=json&rows=15 http://api.labs.datacite.org/works?q=prefix:10.14469+AND+alternateIdentifier:InChIKey:	First 15 hits in JSON format, batch query mode
9	http://stats.datacite.org/?fq=datacentre_facet:"BL.IMPERIAL – Imperial College London"	resolution statistics for publisher 10.14469 (Imperial College) per month
10	http://service.re3data.org/search?query=&subjects[]=31 Chemistry	Research data repository search for Chemistry (135 hits)

^‡In this instance the three MIME media types are chemical/x-wavefunction, chemical/x-gaussian-checkpoint and chemical/x-gaussian-log. See[1] for chemical MIME (multipurpose internet media extensions).

Anyone familiar with the standard ways of finding data (CAS, CSD, Reaxys) will appreciate that the above does not yet have the finesse to find eg sub-structures of chemical structures, synthetic procedures or molecular properties. My including it here is primarily to show some of the potential such systems have, and to remark particularly that the batch query capability of this infrastructure could indeed be used in the future to construct much more sophisticated systems. Oh, and to the end-user at least, the searches shown above do not require institutional licenses to use. Both the data and its metadata is free, mostly with a CC0 or CC BY 3.0 license for re-use (the R of FAIR).

If more of interest related to this topic emerges at the ACS session, I will report back here.

References

H.S. Rzepa, P. Murray-Rust, and B.J. Whitaker, "The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange", Journal of Chemical Information and Computer Sciences, vol. 38, pp. 976-982, 1998. https://doi.org/10.1021/ci9803233

Tags:Academic publishing, chemical, chemical information division, Chemical nomenclature, chemical structures, Chemical substance, chemical/x-wavefunction, Cheminformatics, City: San Diego, content media, data repository search, format type chemical/x-*&nbsp, Identifiers, Imperial College, Imperial College London, International Chemical Identifier, JSON, media types, multipurpose internet media extensions, ORCiD, PDF, potential such systems, research data management, Search queries, Technical communication, Technology/Internet
Posted in Chemical IT | 2 Comments »

Reproducibility in science: calculated kinetic isotope effects for cyclopropyl carbinyl radical.

Saturday, July 11th, 2015

Previously on the kinetic isotope effects for the Baeyer-Villiger reaction, I was discussing whether a realistic computed model could be constructed for the mechanism. The measured KIE or kinetic isotope effects (along with the approximate rate of the reaction) were to be our reality check. I had used ΔΔG energy differences and then HRR (harmonic rate ratios) to compute[1] the KIE, and Dan Singleton asked if I had included heavy atom tunnelling corrections in the calculation, which I had not. His group has shown these are not negligible for low-barrier reactions such as ring opening of cyclopropyl carbinyl radical.[2] As a prelude to configuring his suggested programs for computing tunnelling (GAUSSRATE and POLYRATE), it was important I learnt how to reproduce his KIE values.[2] Hence the title of this post. Now, read on.

I felt I could contribute to the cause by extending the published results in two respects:

The reported[2] calculations are for the model B3LYP/6-31G(d) but the article does not report the tolerance to e.g. basis set variation (6-31G(d), a modest basis set by 2015 standards),
or to the quantum model used (B3LYP, a veritable DFT method).

These two model chemistries can both be tested by “increasing” their accuracy. The Def2-QZVPP basis set is nearing the CBS, or complete basis set limit. The coupled-cluster CCSD(T) method is regarded as the gold standard for single reference calculations. The CASSCF method tests the response to a multi-reference wave function. Each is applied separately to ensure only one variable is being changed at a time.

Method	Expt. KIE[2]^‡	Pred. KIE (my result)	Pred. ΔG₂₉₈^‡	Pred. KIE[2]	KIE + Tunnelling correction[2]
B3LYP/6-31G(d)[3],[4]	1.079₂₉₅	1.0582	8.0	1.058	1.073
B3LYP/6-31G(d)[3],[4]	1.163₁₇₃	1.1067	8.0	1.106	1.169
B3LYP/Def2-QZVPP[5],[6]	1.079₂₉₅	1.0563	6.6	1.058	1.073
B3LYP/Def2-QZVPP[5],[6]	1.163₁₇₃	1.1031	6.6	1.106	1.169
CASSCF(5,5)/6-31G(d)[7],[8]	1.079₂₉₅	1.0572	8.2	1.058	1.073
CASSCF(5,5)/6-31G(d)[7],[8]	1.163₁₇₃	1.1050	8.2	1.106	1.169
CASSCF(5,5)/Def2-TZVPP[9],[10]	1.079₂₉₅	1.0561	7.9	1.058	1.073
CASSCF(5,5)/Def2-TZVPP[9],[10]	1.163₁₇₃	1.1028	7.9	1.106	1.169
CCSD(T)/6-31G(d)[11],[12]	1.079₂₉₅	1.0597^†	9.7	1.058	1.073
CCSD(T)/6-31G(d)[11],[12]	1.163₁₇₃	1.1099	9.7	1.106	1.169

†Actually separate ratios of ¹³C/¹²C(C-4)/¹³C/¹²C(C-3) since C-3 and C-4 are not equivalent in the reactant species because of the methylene group pyramidalisation. The KIE calculation input and outputs are archived.[13]

The first two rows of table are my attempt at an exact replication of the literature. The start point of such a project would be the supporting information or SI[2] which contains coordinates for the program GAUSSRATE and defines key structures in the form of a double-column, page thrown (broken might be a better word) PDF file. It was going to be a bit of a struggle to reconstitute this format into the structure required for a Gaussian calculation, so I simply constructed the models from scratch and optimised to the ring-opening transition state[4] and reactant.[3] I used a more recent version of the Gaussian program (G09/D.01 rather than G03/D.02) to do this, and tightened up some of the criteria to modern cutoff standards. A continuum solvent model could have been specified (the solvent used in the experiments was 1,2-dichlorobenzene) but since no mention was made of solvent, I assumed a gas phase calculation had originally been done. The starting geometry of the reactant deliberately had no symmetry, but during optimisation it converged to having a plane of symmetry using the B3LYP/6-31G(d) level of theory (the SI does not note this symmetry, it is implicit). I then used my code[1] to compute the isotope effects. The KIE program used in the original literature calculation was not directly mentioned in the supporting information but is presumed to be Quiver. Dan Singleton has recently sent me these codes, but they still need to be compiled and tested at my end. I ended up with splendid agreement for the KIE as you can see above (top two lines). Its reproducible! Hence the various assumptions I made in achieving this appear justified.

Returning to the geometry of the cyclopropyl carbinyl radical as having a plane of symmetry, two of the other methods, CCSD(T)/6-31G(d) and CASSCF(5,5)/6-31G(d), as well as CASSCF(5,5) at the better Def2-TZVPP basis all predicted that the methylene radical is twisted by about 20° with respect to the Cs plane of the ring.

It is useful to check whether this twisting has any impact on the predicted KIE. The answer is clear (Table). ALL the methods predict similar KIE to ± 0.003,^† which is as about as accurate as can be measured experimentally at the 1σ level of confidence. This is a remarkable result; few other computed molecular properties turn out to be so insensitive to the quantum procedure used. The next stage will be to check if the tunnelling corrections required to bring the calculation into congruence with the measured values are similarly insensitive.

^‡ The “barrier height” is quoted as 7 kcal/mol[2]. This is probably NOT the activation free energy.

References

H.S. Rzepa, "KINISOT. A basic program to calculate kinetic isotope effects using normal coordinate analysis of transition state and reactants.", 2015. https://doi.org/10.5281/zenodo.19272
O.M. Gonzalez-James, X. Zhang, A. Datta, D.A. Hrovat, W.T. Borden, and D.A. Singleton, "Experimental Evidence for Heavy-Atom Tunneling in the Ring-Opening of Cyclopropylcarbinyl Radical from Intramolecular <sup>12</sup>C/<sup>13</sup>C Kinetic Isotope Effects", Journal of the American Chemical Society, vol. 132, pp. 12548-12549, 2010. https://doi.org/10.1021/ja1055593
H.S. Rzepa, "C4H7(2)", 2015. https://doi.org/10.14469/ch/191357
H.S. Rzepa, "C4H7(2)", 2015. https://doi.org/10.14469/ch/191358
H.S. Rzepa, "C 4 H 7", 2015. https://doi.org/10.14469/ch/191353
H.S. Rzepa, "C 4 H 7", 2015. https://doi.org/10.14469/ch/191352
H.S. Rzepa, "C 4 H 7", 2015. https://doi.org/10.14469/ch/191361
H.S. Rzepa, "C 4 H 7", 2015. https://doi.org/10.14469/ch/191364
H.S. Rzepa, "C 4 H 7", 2015. https://doi.org/10.14469/ch/191363
H.S. Rzepa, "C 4 H 7", 2015. https://doi.org/10.14469/ch/191362
H.S. Rzepa, "C 4 H 7", 2015. https://doi.org/10.14469/ch/191367
H.S. Rzepa, "C 4 H 7", 2015. https://doi.org/10.14469/ch/191356
H.S. Rzepa, "Reproducibility In Science: Calculated Kinetic Isotope Effects For Cyclopropyl Carbonyl Radical.", 2015. https://doi.org/10.5281/zenodo.19949

Tags:activation free energy, Basis set, Dan Singleton, energy differences, gas phase calculation, Kinetic isotope effect, PDF, Physical organic chemistry
Posted in reaction mechanism | 2 Comments »

Personal web pages on digital repositories.

Saturday, June 20th, 2015

The university sector in the UK has quality inspections of its research outputs conducted every seven years, going by the name of REF or Research Excellence Framework. The next one is due around 2020, and already preparations are under way! Here I describe how I have interpreted one of its strictures; that all UK funded research outputs (i.e. research publications in international journals) must be made available in open unrestricted form within three months of the article being accepted for publication, or they will not be eligible for consideration in 2020.

At the outset, I should say that one infrastructure to help researchers adhere to the guidelines is being implemented in the form of the Symplectic system. This allows a researcher to upload the final accepted version of a manuscript. At Imperial College, a digital repository called Spiral serves this purpose and also acts as the front end for collecting informative metadata to enhance discoverability. The final accepted version is then converted by the publisher into a version-of-record. This contains styling unique to the publisher and the content is subjected to further scrutiny by the authors as proof corrections. In an ideal world, these latter changes should also be faithfully propagated back to the final accepted version, as would all the supporting information associated with the article. Since most authors do not exactly enjoy the delights of proof corrections, this final reconciliation of the two versions may not always be assiduously undertaken.

I became concerned about the existence of two versions of any given scientific report and that the task of ensuring total fidelity in the content of both versions may negatively impact on the author’s time. Much better if the publisher could grant permission for the author to archive the version-of-record into a digital repository.

Some experiments were needed, and I decided to start them in reverse, by archiving my oldest publications. Since Symplectic now provides a system to do this, I began by using it. Symplectic identifies each publisher’s policies for archival, of which the most liberal are known as ROMEO GREEN. To quote from the definition, this colour allows the author to “archive pre-print and post-print or publisher’s version/PDF“. In an afternoon I had processed most of my ROMEO green articles. You know how it is sometimes, you do not read the fine print! And so the library soon informed me that archival of ROMEO GREEN was in fact only permitted on the author’s “personal web page”. Spiral, as an institutional repository, does not apparently constitute a personal web page for me and so none of my Symplectic submissions could be accepted for archival there.

Time to rethink the experiment. Firstly, I very much wanted the reprints to be held by a proper digital repository rather than a conventional web page. Why? I wanted my reprints to adhere as much as possible to FAIR: findable, accessible, interoperable and re-usable. Well, at least the first two of those (the last two relate more to data). A repository is designed to hold metadata in a formal and standards-based manner and metadata helps achieve FAIR. So I asked the Royal Society of Chemistry (as a ROMEO GREEN publisher) whether a personal web page hosted on a digital repository would qualify. I was soon informed that I had proposed a neat solution here, and they couldn’t see an issue.

Now, all I had to do is find a repository where I could create such a personal web page. The chemistry department at Imperial College has for ten years hosted a DSpace repository called SPECTRa[1] which already has the functionality for individuals to create personal collections. I had also picked up on the increasing attention being given to Zenodo, like the World-Wide Web itself an offshoot of CERN (of large Hadron Collider fame) and born from the need for researchers to more permanently archive the outputs of their researches. These outputs include software, videos, images, presentations, posters, publications and (most obviously for CERN) datasets. I thought I would include them in my experiment as well. There results are summarised below.

	DSpace-SPECTRa	Zenodo
Community	Henry Rzepa personal web page reprint collection	Rzepa personal computational chemistry data and reprint page
Collection	Royal Society of Chemistry reprints
Publication	10042/195577	10.5281/zenodo.18758[2]
Thesis	10044/1/20860[3]	10.5281/zenodo.18777[4]
Dataset	10.14469/ch/191342[5]	10.5281/zenodo.18632[6]
Harvesting	OAI-ORE	OAI-PMH

The last line of this table includes a link to another design feature of a repository, facilitating the ability to harvest the content. The ContentMine project (“The right to read is the right to mine!“) has shown how such harvesting of facts from the literature can be automated on a vast scale, and (IMHO) represents an example of those disruptive innovations that have the power to change the world forever. It also enshrines the idea that scientific facts funded by the public purse should be capable of being openly liberated from their containers. A harvestable repository seems an ideal container for achieving this.

My experiment is part of what might be seen as the increasingly subtle interplay between:

scientific authors, whose creative endeavour research is and without whom scientific publishers would not exist
publishers who create a business model from the content freely given them by authors but also (especially if a commercial publisher) need to be accountable to their shareholders.
the funding councils, many of whom now wish the outcomes of the research they fund to be openly available to all
the local libraries/administrators who have to adhere to/enforce all the rules contractually handed down to them by publishers whose direct customers they are, but who also need to serve their community of readers and authors.
researchers who would rather do research than fret about the above, and who would rather spend limited resources doing that research rather than diverting an increasing amount of their attention into the above system.
readers, who need unimpeded access to the research endeavours of others, but often have little influence on the policies and actions of all the other stakeholders, since they are NOT considered customers (of the publishers).
etc. etc.

My experiment was in part designed to explore these rules, their interpretations and their boundaries. For the time being at least I seem to have found an arrangement that allows me to distribute versions-of-record of my own work, thanks to a generous and far-sighted learned society publisher. Watch this space!

References

J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
H.S. Rzepa, and B.C. Challis, "The Mechanism Of Diazo-Coupling To Indoles And The Effect Of Steric Hindrance On The Rate Limiting Step", Zenodo, 1975. https://doi.org/10.5281/zenodo.18758
H.S. Rzepa, "Hydrogen transfer reactions of indoles", 1974. http://doi.org/10044/1/20860
H.S. Rzepa, "Hydrogen transfer reactions of Indoles", 1974. https://doi.org/10.5281/zenodo.18777
H.S. Rzepa, "C 25 H 34 Cl 1 N 3 O 1", 2015. https://doi.org/10.14469/ch/191342
H.S. Rzepa, A. Lobo, M.S. Andrade, V.S. Silva, and A.M. Lourenco, "Chiroptical properties of streptorubin B – the synergy between theory and experiment.", 2015. https://doi.org/10.5281/zenodo.18632

Tags:Academia, Academic publishing, Archival science, author, Data management, Digital library, EPrints, Institutional repository, Knowledge, Knowledge representation, Library science, metadata, Open access, PDF, personal web page, Preprint, Publishing, Repository, researcher, ROMEO GREEN, Science, Technology/Internet, United Kingdom, web server
Posted in Chemical IT | No Comments »

A convincing example of the need for data repositories. FAIR Data.

Thursday, January 15th, 2015

Derek Lowe in his In the Pipeline blog is famed for spotting unusual claims in the literature and subjecting them to analysis. This one is entitled Odd Structures, Subjected to Powerful Computations. He looks at this image below, and finds the structures represented there might be a mistake, based on his considerable experience of these kinds of molecules. I expect he had a gut feeling within seconds of seeing the diagram.

Indeed, so, you will now find that the authors have apparently acknowledged a mistake[1]. My interest piqued, I went to the article, and immediately tracked down the supplementary information. Surely, if these molecules had been subjected to powerful computation, this supporting information should contain coordinates of some kind that would allow a correlation with the 2D structural representation shown above. I have just returned from FORCE2015, a three-day event in Oxford. From the detailed agenda, you can see that a lot of the conference centered around what is called FAIR Data. FAIR stands for:

Findable
Accessible
Interoperable
Re-usable

So I then set out to find if the supplementary information WAS FAIR. Well, check for yourself (unlike the narrative article, the data should be accessible outside of the paywall, i.e. you should not need a subscription to access it). It is certainly big, running out to 45 pages, in the form of a paginated PDF file (the norm). The table of contents does not refer to data as such, but it does quote 25 figures, from which you might just be able to extract some data. But no molecules as such! So:

No data is findable, although the PDF which might contain it is reasonably so.
The data is not easily accessible,
let alone interoperable (thus many of the charts were probably created using spreadsheet software, but the source files for these are not available),
and not-reusable (certainly not without loss and possible error in any attempt at capture).

I think it fair to say that the data for these powerful computations are not FAIR. Had we had at least some coordinates (the computations involved molecular mechanics based dynamics simulations, which certainly involve manipulating atom coordinates in some form) then the structures shown in the figure above could be checked, and perhaps even the apparent error would have been quickly spotted.

Derek does not make the point about FAIR data (to be fair, he was not at FORCE2015) and so I will make the case. If you are reporting a computational model or simulation, there is no excuse for not supplying FAIR data to accompany it. If the data is FAIR it will be inter-operable and re-usable. And this will instantly allow anyone to check e.g. the structures above. You would not need to have Derek’s vast experience and instinct (although having it is also helps). And of course we might presume that there were 2-3 referees that also looked at the article, and presumably none of them requested FAIR data.

Oh, if you are interested in my take on FAIR data, I gave a talk about that at FORCE2015, which you are welcome to view; I hope it constitutes a FAIR talk!

References

K.J. Kohlhoff, D. Shukla, M. Lawrenz, G.R. Bowman, D.E. Konerding, D. Belov, R.B. Altman, and V.S. Pande, "Cloud-based simulations on Google Exacycle reveal ligand modulation of GPCR activation pathways", Nature Chemistry, vol. 6, pp. 15-21, 2013. https://doi.org/10.1038/nchem.1821

Tags:created using spreadsheet software, Derek Lowe, Oxford, PDF, simulation
Posted in Chemical IT, General | No Comments »

A computed mechanistic pathway for the formation of an amide from an acid and an amine in non-polar solution.

Wednesday, November 12th, 2014

In London, one has the pleasures of attending occasional one day meetings at the Burlington House, home of the Royal Society of Chemistry. On November 5th this year, there was an excellent meeting on the topic of Challenges in Catalysis, and you can see the speakers and (some of) their slides here. One talk on the topic of Direct amide formation – the issues, the art, the industrial application by Dave Jackson caught my interest. He asked whether an amide could be formed directly from a carboxylic acid and an amine without the intervention of an explicit catalyst. The answer involved noting that the carboxylic acid was itself a catalyst in the process, and a full mechanistic exploration of this aspect can be found in an article published in collaboration with Andy Whiting’s group at Durham.[1] My after-thoughts in the pub centered around the recollection that I had written some blog posts about the reaction between hydroxylamine and propanone. Might there be any similarity between the two mechanisms?

That mechanism can be represented as above, which (as per the hydroxylamine mechanism) comprises three transition states and two intermediates. The original study[1] reported just the one TS1. Editing out the starting coordinates from the PDF-based supporting information (the process is not always easy) enabled an IRC (intrinsic reaction coordinate) for TS1 to be easily computed.[2]

This reveals that TS1 is not the complete story, there is still much of the reaction left to complete. The energy profile is charted (using the ωB97XD/6-311G(d,p/SCRF=p-xylene method) according to the scheme above as reactants ⇒ TS1 ⇒ Intermediate 1 ⇒ TS2 ⇒ Tetrahedral intermediate ⇒ TS3 ⇒ products. Computed properties for this more detailed pathway are transcluded here from the digital repository[3] and appear at the end of this post.

TS1 yields what might be called a zwitterionic intermediate. However, this has a relatively small dipole moment (5.7D). Thus, against accepted wisdom, such apparently ionic intermediates CAN be involved in reactions occurring in non-polar solvents!
TS2 is rather unexpected, involving synchronous proton transfer coupled to anomerically related C-OH bond rotation. This rotation changes the anomeric interactions with the adjacent substituents; in my experience I have never before seen a reaction mode quite like this one!
TS3 collapses the tetrahedral intermediate by synchronous proton transfer and C-O bond cleavage, and is (in this model) the rate determining step. The free energy barrier corresponds to a half-life at 298K of about half an hour.
The product is calculated as exoenergic with respect to reactants,; the reaction does drive to form an amide (and any catalysis of course will not influence that final outcome, only its kinetics).

If you read the original article[1] you will realise the above only scratches the surface of the many fascinating properties of this apparently very simple reaction. Thus, not addressed above is why amides are only formed in certain solvents (xylene for example) but not others. The solvent may have a specific role to play which is not modelled simply by its continuum dielectric or its boiling point. There is much else that could be said.

References

H. Charville, D.A. Jackson, G. Hodges, A. Whiting, and M.R. Wilson, "The Uncatalyzed Direct Amide Formation Reaction – Mechanism Studies and the Key Role of Carboxylic Acid H‐Bonding", European Journal of Organic Chemistry, vol. 2011, pp. 5981-5990, 2011. https://doi.org/10.1002/ejoc.201100714
H.S. Rzepa, "C21H21NO4", 2014. https://doi.org/10.14469/ch/74636
H.S. Rzepa, "A computed mechanistic pathway for the formation of an amide from an acid and an amine in non-polar solution.", 2014. https://doi.org/10.6084/m9.figshare.1235300

Tags:Andy Whiting, Dave Jackson, dielectric, Durham, energy profile, free energy barrier, London, non-polar solution, PDF, Royal Society of Chemistry
Posted in reaction mechanism | 6 Comments »

Electronic notebooks: a peek into the future?

Tuesday, September 16th, 2014

ELNs (electronic laboratory notebooks) have been around for a long time in chemistry, largely of course due to the needs of the pharmaceutical industries. We did our first extensive evaluation probably at least 15 years ago, and nowadays there are many on the commercial market, with a few more coming from opensource communities. Here I thought I would bring to your attention the potential of an interesting new entrant from the open community.

My very first post on this blog six years ago related to incorporation of the Jmol molecular viewer into posts, and it has been a feature of many since. A little more than two years ago, Jmol was recast into JSmol. This had become possible because JavaScript engines built into modern web browsers were finally getting the sort of performance needed to display molecules (years and years ago, lets say ~1990, such display required very fancy hardware kit such as Silicon Graphics workstations). Around the same time, another well-established Java-based molecule sketcher, JME (Java molecular editor) also became JavaScript based. My own interest in this sort of Web-based behaviour actually crystallised last December, when I decided to refactor my own lecture notes into a tablet-friendly format using JSmol, with some questions directed at the formidably excellent Jmol discussion list. One of these related to how students might annotate such lecture notes with chemical sketches and store the results for future study or revision. Otis Rothenberger starting exploring various mechanisms for such local storage (using Web browsers), and in the last month or so has found a way of exploiting something called HTML5 local storage, which allows the sort of capacity needed. These three technologies have now come together on Otis’ site, which you can now view as CheMagic Notebook (this might be a .com site, but I believe the concept is very much open).

Together with the Virtual model kit (VMK, itself now part of JSmol) this combination is starting to resemble a very interesting mechanism for creating an immersive lecture note environment, almost you might say a lecture note ecosystem. I would argue that for the first 30 years of the digital document era, most people preparing lecture notes became mesmerised (distracted?) by the need to print the outcomes with complete fidelity. It is only recently that the focus has turned to “beyond the PDF” (or beyond the PPT) and much richer mechanisms. So now we have lecture notes morphing into an ecosystem where:

the objects themselves can be interactive (3D models, spectra, animations etc)
or reference further models and associated data held in digital repositories
or built from scratch in response to stimulation from peers, tutorials, workshops or lectures (using eg VMK or JME)
and such annotations in effect themselves can be spliced into the student’s own copy of these notes,
with the whole being regarded as a running notebook created from the initial seed of a lecturer’s materials augmented by the student’s own annotations.

I have focused here on where I started, i.e. refactoring my own lecture notes. But the above concepts could easily morph into eg a research project notebook, a rebundling into smaller segments which are themselves published into digital repositories (and there assigned their own persistent digital object identifiers) and ultimately further morphing into scholarly articles submitted to say a journal. These could represent a continuum, not discrete (and non-communicating) objects.

So will “lecture notes” actually start to change from their conventional (printable) form into something related to the above? Well, I have not addressed the largest hurdle preventing this; giving the content creators (i.e. the lecturers) the training, skills and most importantly the motivation to start to venture down this pathway. Otis has shown it should be technically possible. Come back and revisit this post in ten years time to see what actually did happen!

Tags:.com, chemical sketches, Java, JavaScript, lecturer, molecular editor, PDF, pharmaceutical industries, Silicon Graphics, three technologies, web browsers, Web-based behaviour
Posted in Chemical IT | No Comments »

Data nightmares: B40 and counting its π-electrons

Saturday, July 19th, 2014

Whilst clusters of carbon atoms are well-known, my eye was caught by a recent article describing the detection of a cluster of boron atoms, B₄₀ to be specific.[1] My interest was in how the σ and π-electrons were partitioned. In a C₄₀, one can reliably predict that each carbon would contribute precisely one π-electron. But boron, being more electropositive, does not always play like that. Having one electron less per atom, one might imagine that a fullerene-like boron cluster would have no π-electrons. But the element has a propensity[2] to promote its σ-electrons into the π-manifold, leaving a σ-hole. So how many π-electrons does B₄₀ have? These sorts of clusters are difficult to build using regular structure editors, and so coordinates are essential. The starting point for a set of coordinates with which to compute a wavefunction was the supporting information. Here is the relevant page: The coordinates are certainly there (that is not always the case), but you have to know a few tricks to make them usable.

Open Adobe Reader, select the coordinates and copy
Paste into any application which recognises text. I used an old stalwart on the Mac, BBedit. It is reliable!
But no, it produces a row of skull&crossbones characters (the authors of the program clearly have a sense of humour)
Thinking that BBedit might have let me down (for the first time), I tried Word. A little less humour, but the same result.
There are lots of web sites out there that claim to convert PDF files directly to Word files. Again, no luck, the coordinates are now entirely missing!
Right, time for the big guns. Adobe Acrobat XI converts .PDF to .DOC, and (if you jump through a lot of hoops to register etc) they even give you a 30 day trial. Well, at least it gives numbers. But notice that the line breaks are missing, and all the numbers flow from one line to another.
Another copy/paste from Word to BBedit, and now I have all the numbers, and adding 40 line breaks is all that is needed (there is sometimes some skill in knowing where to add them by the way).^‡ The time taken from step 1 to step 7 was about 90 minutes (including a necessary cup of tea to recover from steps 1-5, and the realisation that the time was not wasted, since I could blog the experience!).

Well, I am sure you know what is coming next; my usual rant about how little most chemists truly value data and particularly its integrity and its semantics. And how little almost all journals understand data. Notice that the original article was published in Nature Chemistry. Note also a new journal from that stable, Scientific Data. The journal clearly thinks there is mileage in receiving scholarly articles about scientific data, and what they call data descriptors (they even got me to write a data descriptor a year or so back). Its a shame then that the same publisher allowed the decimation of the core data related to an article about B₄₀.

They have a widely read blog, perhaps they can comment?

One more point to make about data: a phrase has recently been coined: deposition with recognition. Here, I show how my own data has been recognised:

There are various other ways as well, and perhaps I will leave this to another post. To return to the chemistry (where we should have been at the start). I ran the calculation (B3LYP+D3/TZVP) and published the newly enhanced data, citing it in the usual way.[3],[4]^† To answer my question, for the D_2d geometry, B₄₀ has 24 π-electrons (there is some ambiguity, it could be 26). On average, the boron retains only ~0.65s, balanced by ~2.35p electrons. The most stable π-pair is shown below. At the centre of the ring is a strongly diatropic ring current (NICS = -42 ppm)[5] suggesting aromaticity (26 electrons = 4n+2).^¶

B40-29

I conclude by pondering whether the properties of any such boron cluster may in time prove to be directly related to the number of σ-to-π promotions.

^‡ Sadly, line breaks in lists of atom coordinates date back to an era of about 50 years ago when text files were first treated differently from binary files. Three different “standards” emerged for specifying a line break (DOS, Mac and Unix) in a text file and much confusion has there been ever since when moving these text files across operating systems. The modern way of doing it is to make line breaks redundant by instead marking up the file. The standard chemical markup, invented in 1996, and formally published in 1999[6], is CML. You will find such CML coordinates in the deposited data from this calculation.[3] You will not have any problems with line breaks!

^†Publication assigns a DataCite DOI. This takes about 48 hours to propagate to CrossRef, which is here used by the KCite WordPress plugin to retrieve the metadata and compose a citation. If KCite queries CrossRef before the metadata has propagated, it does not generate a citation. If you are reading this and see no citation, please revisit after 48 hours have elapsed.

^¶The diatropicity is inverted to paratropicity (NICS = +28 ppm) when two electrons are removed to create the dication.[7] This inversion is normally a good test of aromaticity/antiaromaticity.

References

H. Zhai, Y. Zhao, W. Li, Q. Chen, H. Bai, H. Hu, Z.A. Piazza, W. Tian, H. Lu, Y. Wu, Y. Mu, G. Wei, Z. Liu, J. Li, S. Li, and L. Wang, "Observation of an all-boron fullerene", Nature Chemistry, vol. 6, pp. 727-731, 2014. https://doi.org/10.1038/nchem.1999
H.S. Rzepa, "The distortivity of π-electrons in conjugated boron rings", Physical Chemistry Chemical Physics, vol. 11, pp. 10042, 2009. https://doi.org/10.1039/b911817a
H.S. Rzepa, "Gaussian Job Archive for B40", 2014. https://doi.org/10.6084/m9.figshare.1111454
H.S. Rzepa, "B 40", 2014. https://doi.org/10.14469/ch/24884
H.S. Rzepa, "Gaussian Job Archive for B40", 2014. https://doi.org/10.6084/m9.figshare.1111518
P. Murray-Rust, and H.S. Rzepa, "Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles", Journal of Chemical Information and Computer Sciences, vol. 39, pp. 928-942, 1999. https://doi.org/10.1021/ci990052b
H.S. Rzepa, "Gaussian Job Archive for B40(2+)", 2014. https://doi.org/10.6084/m9.figshare.1111534

Tags:Acrobat, Adobe, chemical markup, DOS, operating systems, PDF, pence, Unix
Posted in Chemical IT, Interesting chemistry | 2 Comments »

“Text” Books in a (higher) education environment.

Friday, May 18th, 2012

Text books (is this a misnomer, much like “papers” are in journals?) in a higher-educational chemistry environment, I feel, are at a cross-roads. What happens next?

Faced with the ever-increasing costs of course texts, the department where I teach introduced a book-bundle about five years ago. The bundle included all the recommended texts for an appreciable discount over individual purchase. In their first week at Uni, students were encouraged to acquire the bundle. As it happened, I met them for a tutorial shortly after this acquisition. The bundle weighed some 9 kg, and came shrink-wrapped into a strapless plastic sheath, posing a rather slippery and weighty challenge for the student to get back to their residency. A few months later, I asked the students how they were getting on reading their newly acquired texts. You must appreciate that it does take a few months for students to start getting “street-wise” about their uni experience. One savvy student recounted they had learnt that if one did not remove the plastic outer layer from the bundle, it would retain much of its resale value to the next generation of incoming students.

Now, I will not mention the publisher of this particular bundle, but its cost is in the region of £50 per text. And for some students, the 1500 or so pages of each volume remain largely unread. Rarely if ever do I see these texts brought into tutorials, and I expect the margins remain blank, un-annotated with any questions or notes (it affects the resale value if you do that). Which is a stark contrast to how the students nowadays annotate their lecture note hand-outs often (but not invariably) issued to them at the start of a lecture. I also observe that increasingly my tutorials are effectively annotated by the students attending (2-4 pages of notes can be taken during a 50 minute discussion. The unit can be declared as pages, since this is also done on paper).

Despite these trends, pedagogic usage of tablet devices such as Kindles and iPads remains relatively low. It is a chicken-and-egg situation. The aforementioned book bundle is not available for these devices, and if it were, then in the current business model, it would be DRM (digital-rights-management) protected to prevent resale, and would also probably retain (if not exceed) the cost of the printed version. Hardly attractive to a student. The lecture notes we distribute (as printed handouts) do indeed come as PDF versions which can be placed on a mobile tablet, but this advantage alone has not sufficed to promote rapid uptake of tablet here. Few materials are specifically optimised to take advantage of the unique features of a tablet, and so the printed lecture notes are considered acceptable. Perhaps this comes to the core of what such tablets are supposed to be. Are they devices for “content consumption”, or should we also expect them to be capable of “content creation”? Lecture (and tutorial) annotation is of course content creation (or perhaps augmentation).

I might also take a look at the situation from the point of view of the textbook author. Unless you are a big name, you might expect to redeem about 10% royalties from one of the traditional publishers of academic texts. It might take you a year or so to write it, and you would expect to issue a further edition five years down the line if the book is successful. Two generations ago, every academic might be expected to write at least one book. I suspect that aspect has reduced nowadays; authors can hardly be encouraged to write if they think there is a prospect that the shrink-wrapping might not even be removed! If you are intending to write a text about, lets say stereochemistry, you also have to accept the 2D limitations of a printed book, or the inability to say animate a reaction path.

Where are these thoughts leading? Well, I do have to give an explicit example; Steve Jobs’ vision of the educational text-book, re-invented along the lines of what he famously introduced for music distribution. There, he recognised that the (presumed illegal) sharing of music via download sites that preceded the iTunes store was not a sustainable model. The $.99 download was conspicuously cheaper than the price of a physical music CD (excepting classical music, which did become absurdly cheap in this form), and a compromise on sharing stipulated only on devices owned by you rather than more widely amongst your friends. The same model was introduced for the iBook store. Here, the author of an eBook (I am no longer calling it a textbook) can if they wish retain 70% of whatever income it generates (it can also be free of course). The unit price was a fraction of the traditional paper-based book, low enough that the DRM-imposed inability to resell it was less of an issue.

What are the downsides of moving on from paper?

Well, unlike a paper book which is instantly useable, the reader has to purchase a device. This device can cost more than the book bundle referred to above, although at its cheapest, the device is actually only about half the cost of the book bundle. And one might expect that device to last only 2-4 years before it becomes obsolete.
It can be lost or damaged, although unlike a paper book, the online content can be readily restored at zero cost .
If you purchase an eBook for one (proprietary) device, you cannot transfer it to another such device (say Kindle to iPad or vice versa), although if the content is free, that would not matter.
Authors of such texts will have to retrain themselves to produce ebooks; it is not just a matter of using a standard word processor any more. I suspect writing/imaging/styling/scripting/widgeting (a verb for this collective process is needed; how about to flow?) an ebook takes a lot longer than word processing a text-book.
You might have to consider the ongoing cost of using an ebook. By this I mean the data-plan that you might need in place to download components which are not actually part of the book (see below).

The upsides? Well, rather than my producing a list at this point, you might want to take a look at the first two examples below, both created by Bob Hanson, and think about how such inclusion in an ebook might enhance it:

A device-sensitive page for display (try this out on an iPad or Android tablet; the Kindle might be more of a challenge).
A page for building and minimising a molecular model
This example is included, since it belongs to a chemistry text book, but actually would exist on a mobile device in functional form, if not actually a component of an ebook.

So an ebook becomes an environment where you can download a model from public databases, and annotate it with properties etc. Or you could use your ebook to build a model from scratch, then minimise its (molecular mechanics) energy, to say explore conformational analysis in the context of a chapter on the topic.

Well, at the start I posed the question what happens next? The two above examples give possible answers. An equally interesting question might then be who makes it happen? Will that be the evolutionary role of the traditional publishing houses? Will a new generation of skilful author capable of “flowing” an ebook emerge? Will students instead favour retaining their dependency on paper? Watch this space.

Tags:author, Bob Hanson, energy, GBP, iPads, PDF, skilful author, Skolnik, Steve Job, Steve Jobs, tablet devices, textbook author, Tutorial material, USD
Posted in General | 3 Comments »

"Text" Books in a (higher) education environment.

Friday, May 18th, 2012

Text books (is this a misnomer, much like “papers” are in journals?) in a higher-educational chemistry environment, I feel, are at a cross-roads. What happens next?

What are the downsides of moving on from paper?

Well, unlike a paper book which is instantly useable, the reader has to purchase a device. This device can cost more than the book bundle referred to above, although at its cheapest, the device is actually only about half the cost of the book bundle. And one might expect that device to last only 2-4 years before it becomes obsolete.
It can be lost or damaged, although unlike a paper book, the online content can be readily restored at zero cost .
If you purchase an eBook for one (proprietary) device, you cannot transfer it to another such device (say Kindle to iPad or vice versa), although if the content is free, that would not matter.
Authors of such texts will have to retrain themselves to produce ebooks; it is not just a matter of using a standard word processor any more. I suspect writing/imaging/styling/scripting/widgeting (a verb for this collective process is needed; how about to flow?) an ebook takes a lot longer than word processing a text-book.
You might have to consider the ongoing cost of using an ebook. By this I mean the data-plan that you might need in place to download components which are not actually part of the book (see below).

A device-sensitive page for display (try this out on an iPad or Android tablet; the Kindle might be more of a challenge).
A page for building and minimising a molecular model
This example is included, since it belongs to a chemistry text book, but actually would exist on a mobile device in functional form, if not actually a component of an ebook.

Tags:author, Bob Hanson, energy, GBP, iPads, PDF, skilful author, Skolnik, Steve Job, Steve Jobs, tablet devices, textbook author, Tutorial material, USD
Posted in General | 3 Comments »

Henry Rzepa's blog

Posts Tagged ‘PDF’

Research data: Managing spectroscopy-NMR.

Global initiatives in research data management and discovery: searching metadata.

References

Reproducibility in science: calculated kinetic isotope effects for cyclopropyl carbinyl radical.

References

Personal web pages on digital repositories.

References

A convincing example of the need for data repositories. FAIR Data.

References

A computed mechanistic pathway for the formation of an amide from an acid and an amine in non-polar solution.

References

Electronic notebooks: a peek into the future?

Data nightmares: B40 and counting its π-electrons

References

“Text” Books in a (higher) education environment.

"Text" Books in a (higher) education environment.

Recent Posts

Archives

Blogroll

Meta