Chemical IT « Henry Rzepa's blog

Archive for the ‘Chemical IT’ Category

A visualisation of the effects of conjugation; dienes and biaryls.

Tuesday, August 25th, 2015

Here is another exploration of simple chemical concepts using crystal structures. Consider a simple diene: how does the central C-C bond length respond to the torsion angle between the two C=C bonds?

arm1

The search of the CSD (Cambridge structure database) is constrained to R < 5%, no errors and no disorder and the central C-C bond is specific to be acyclic.^‡

arm1

Note first that the hotspot occurs for a torsion angle of 180°, a trans diene.
There is just a hint that the C-C distance for a cis-diene might be a little shorter than the trans diene, but this might not be significant.
There is a gentle curve illustrating that the C-C distance is indeed a maximum at 90°
The C-C bond extends from ~1.445Å when the two double bonds are coplanar (fully conjugated) to ~1.48Å when orthogonal. Not much of a change, but statistically highly significant.

Here is another search, this time of the C=C-C=C motif embedded into a biaryl, of which there are far more examples. This time, the (red) hotspot is actually at 90°, with local (green) hotspots at 0 and 180° but also at 45 and 135°. Again, you can easily spot the maximum in C-C bond length at 90° but notice how much smaller the bond lengthening is (~ 0.01Å). This lengthening is inhibited by retention of the aromaticity of the two aryl rings; again the statistical effect is highly significant. Perhaps also significant is that the C-C bond at torsions of 0 or 180° appear to be no shorter than the values at 45 and 135°.

arm1

Both these searches took about 5 minutes each, and serve to illustrate just how many basic chemical concepts can be teased out of a statistical analysis of crystal structures.

^‡The analogous diagram for O=C-C=C is shown below;

arm1

That for O=C-C=O is different however;

arm1

Tags:basic chemical concepts, Cambridge, Chemical bond, chemical concepts using crystal structures, Quantum chemistry
Posted in Chemical IT, crystal_structure_mining | No Comments »

A (light) introductory tutorial on Research Data Management (in chemistry).

Thursday, August 20th, 2015

Management of research (data) outputs is a hot topic in the UK at the moment, although the topic has been rumbling for five years or more. Most research-active higher educational establishments have or are about to publish general guidelines, which predominantly take the form of aspirational targets rather than actionable examples or use-cases.^‡ Because the concepts remain somewhat abstract, one can encounter questions from researchers such as “how should I go about achieving such RDM (research data management)?” I thought it might be useful for me to here summarise some key features in the form of an FAQ that can help answer that question. I will concentrate purely on the sub-set chemistry about which I know most.

I will start by exploring the acronym FAIR data.

F is findable. This means that metadata is a key part of the process, since it is this information that allows the research data to be more easily found, not only by other humans but by software engines which specialise in such activity.
A is accessible. And easily so. Which means a standard identifier to get to the research data, with no paywalls, account registrations or other obstructions. It should ideally be possible to access data anonymously, without necessarily revealing personal information.
I is inter-operable. This is harder to define exactly, but the essence is that it should be possible to re-use the data in a context different from the original, and perhaps even outside the subject domain where it was created. For example, if data was collected using one specific instrument, it should be able to use it without necessarily having access to either an identical instrument or to the software associated with that instrument.
R is reusable. There should be sufficient information about the data and its parameters to if necessary repeat its collection independently of the original, or to re-use it to start a new data collection. Reusable also means by software, and not just by a human.

The first two properties are easily achieved, since standard procedures can be used. The last two properties are potentially more difficult, since they require more intervention or thought by both the depositor and the re-user. So I will concentrate really on the first two, since by and large they will satisfy most of the general guidelines issued by funders and universities, but note that we must not in the medium to longer term forget the last two.

I will now list some typical types of data that I have personal experience of. As the community increasingly participates in such RDM, this list will expand by “crowd-sourcing”; if your type of data is not listed, do not give up!

Data generated by software without instrumental inputs, a good example of which are the outputs of computational chemistry. I have the most personal experience in this area, having been at it for ten years or more[1],[2] and examples are scattered throughout this blog (and in many of our recent research publications).
Software developed as part of the data collection process and which might be required by others to re-use the data. An example of such was described in a previous post, and has been RDMed here.[3].
Data generated by software associated with instrumental outputs. In chemistry this means spectrometers and other instruments, most of which now have computers which handle the data outputs. Specific examples might be crystal structures, NMR, IR, MS and optical (including chiroptical) spectra.
- Crystal structures are the gold standard in RDM, since they fulfil all the requirement of FAIR and so merit a special mention here. In the last year, the Cambridge structural database (CSD) has had implemented a standard access mechanism based on a digital object identifier (DOI).[4]
- The end point of many other instrumental outputs are PDF files. These do not easily achieve the IR of FAIR (see my comment above), but we will admit the PDF format as a temporary expedient until the use of semantically richer formats increases (the gold example here being the CIF format for crystal structures). You can see an example of PDF files here as a fileset[5] describing ¹H, ¹³C NMR, Mass spectrometry, ECD (electronic circular dichroism) and VCD (vibrational circular dichroism). Perhaps a better format for expressing many types of spectra is the Excel spreadsheet, which achieves a reasonable proportion of the IR aspirations of FAIR. Both expressions can be included in the collection.
- As a postscript to this list, I should mention that instrumental data is often found as:
  - raw (unreduced or unprocessed) data, which can be very large (e.g. Free induction decay time-domain data in NMR).
  - A version which has already been subjected to processing (Fourier transformed frequency-domain data in NMR, i.e. a spectrum). This is probably more suitable for archiving, but its a fine judgement.
  - A a rough rule of thumb, chemistry data intended for archival should be ~ < 1 Gb.
Synthetic methodologies that describe the preparation and characterisation of molecules. You can see an example of such data here.[6]

Now I come to how the (molecular) data is packaged, and this is best described in terms of its granularity. There are perhaps four classes:

All the data is packaged into a single compressed (ZIP) archive. An example can be found here[7] containing coordinates for 134,000 molecules. If your interest is in just one of these molecules, then you could argue that this data does not fully conform to the F of FAIR, since it contains no information (metadata) about individual molecules.
The next packaging is (in chemistry) for a specific molecule (or perhaps reaction). An example is again[5], which contains data about a specific molecule, and that molecule is itself defined by the inclusion of e.g. a Chemdraw file. Another example[6] relates to reaction information, and also includes spectroscopic data in the form of a JCAMP-DX file, which is semantically preferable to eg an Excel spreadsheet or just a PDF file. Most of the examples on this blog are in this category, relating to quantum chemical computations of a specific molecule.[8] I will concentrate here just on this second type of packaging.
The most finely-grained packaging is at the molecular property level. To illustrate this, go visit e.g. the Wikipedia page for aspirin, where you will find a ChemBox containing property data. In the future, these ChemBox properties will be interactively populated from a data repository known as WikiData. This type of RDM is still developing, and I include it here as a placeholder and to counterbalance the first category above!
Thus category is a little different from the previous three; it relates to a collection of packages, where the granularity of class 2 above is retained, but boxed up into a project collection.[9]

And now to look at the life cycle of some data.

The data starts off as live. This is some sort of holding store which members of the group can access/contribute to. It can be a local sharepoint or a cloud-based resource such as DropBox, but it could still be a simple DVD or USB storage device.
- We have for some ten years now used a locally built live data store (which is itself archived at Zenodo as software[10]) and which serves to track a user’s experiments, including initiation and completion dates and times, to serve as a simple interface for archival, to record published experiments and to flag requested data embargoes (see below) and to provide a search interface for all of this. Pretty much the description of an electronic (laboratory) notebook. We created our own[2] because few commercial products (either ten years ago, or even now) offer the ability to seamlessly incorporate a Publish workflow which automates all the required actions of RDM as described here, and because it is something we might want to do 5-20 times a day. If your requirement is much less, such automation may not be needed.
When the data is stable and edited down to that which needs to be associated with an article (the narrative), it now needs archiving in a manner that will ensure its persistence for at least a decade or even longer.
Associated metadata describing the data also now needs to be assembled and this combined package is now sent to a data archive. These archives have special characteristics, one of which is that they can issue a persistent identifier we know as the DOI. This itself is issued by a registry, which for data is usefully done by an organisation known as DataCite. If desired, two or more of these packages can be associated with a collection, and the collection itself can also be given a DOI.[9]
A copy of the metadata is sent to DataCite when the DOI is issued. The search engine that indexes this information is also at DataCite.
Now all that needs doing is that the Data DOIs are all cited in the article to be published, or you can (also or instead) cite the DOI for a collection. An accepted article is itself issued in due course with a DOI (this time by an agency known as CrossRef on behalf of the publisher).
To complete the virtuous cycle, the article DOIs can be retrospectively added to the metadata for each data package (or the collection of packages), ensuring that the data references the narrative, and that the narrative references the data.
You will note from the virtuous cycle in item 5, that timing becomes important. You have to archive the data and mint a DOI in order to cite it in an article. This sounds like publishing the data before the article has been accepted, which would have the advantage that referees could access it as part of their QA process for the article. However, it may be more suitable to simply reserve a DOI for the data for inclusion in an article, but not make it public until that article has itself been accepted and published. This process is called embargoing; I will defer discussion of this, because this tends to vary according to repository and its implementation is still evolving.
The final action might be to register this activity on any institutional software that monitors and aggregates research outputs. We use Symplectic to achieve this, it having the ability to record both a research publication and increasingly properties of the data itself.

By now you might be asking where you could explore further, and perchance even try things out.

zenodo.org/features is one good place to start; it will cost nothing; there is (within reason) no limitation to how much data can be archived. Zenodo also allows data to be retrieved from DropBox and Github (for code) for archival.
figshare.com allows you to sign up for free, but with limitations to the total data storage unless you upgrade to an institutional or paid account.
www.datadryad.org/pages/faq which charges $80-90 per deposition.
Institutional data repositories. The notes above were written based on the experiences we have had for almost nine years now with a local data repository we call SPECTRa,[1] where some 230,000 individual data packages are now archived. This one[11] dates from 2007 to illustrate its longevity. Unfortunately, only members of Imperial College can make use of it.

I realise now that I have written this all down that it is somewhat longer than I was expecting, and that this very length may well put some researchers off. Apart from RDM now being mandatory in the UK, it is also reasonable for researchers to ask “what was in it for me?” as a reward for persisting. I can only answer that one from my personal experiences:

The live data store (or uportal as we call it) has proved invaluable for recording our (computational) experiments. I often use it to track down calculations from years ago. As a laboratory notebook, it is minimalist, as is the learning curve and hence does not overwhelm. If more information is needed, one simply goes to the DOI recorded there for each experiment if archived, or the original inputs and outputs if not.
Assigning a DOI to a data package makes it really easy to share this with both collaborators and other researchers who express interest (the data is often too large to send by email).
Sometimes I use e.g. search.labs.datacite.org/help/examples to search the metadata created during the process in order to find (F) and access (A) old data, which is then very quickly amenable to re-use (R). OK, SciFinder or Reaxys it is not (yet!), but it is getting there.
One can get access statistics for the data. If you click on the link, you can see some datasets have been accessed more than 200 times. Someone must be finding them valuable! If you want to find out how much (UK) data is searchable in this manner, click here. Perhaps such statistics may even help get you promoted one day!
Having data available in this way enables one to construct more interesting tables or figures. This “figable” (yes, its both a table and a figure) comes from a recent publication of ours.[12] It retrieves the data purely by its DOI and inserts it into display software (JSmol) to construct an instant molecular model. One can also use this approach for lecture notes and labs,[13] for blogs as here, and (if you are very brave) for research presentations.
Google Scholar detects data and citations to it equally with journal articles. This is part of my profile there, and there you can see both articles AND data. If you are keen-eyed, you will however note that the data does not contribute to my h-index (but arguably, it is more valuable to have some data sets accessed 200+ times rather than to be cited!).

^‡Some selected use-case examples can be viewed,[14] along with one specific to computational chemistry[15].

References

J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
M.J. Harvey, N.J. Mason, and H.S. Rzepa, "Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks", Journal of Chemical Information and Modeling, vol. 54, pp. 2627-2635, 2014. https://doi.org/10.1021/ci500302p
H.S. Rzepa, "Reproducibility In Science: Calculated Kinetic Isotope Effects For Cyclopropyl Carbonyl Radical.", 2015. https://doi.org/10.5281/zenodo.19949
Jana, Anukul., Huch, Volker., Rzepa, Henry S.., and Scheschkewitz, David., "CCDC 977840: Experimental Crystal Structure Determination", 2014. https://doi.org/10.5517/cc11tj7m
H.S. Rzepa, F.L. Cherblanc, W.A. Herrebout, P. Bultinck, M.J. Fuchter, and Ya-Pei Lo., "Mechanistic and chiroptical studies on the desulfurization of epidithiodioxopiperazines reveal universal retention of configuration at the bridgehead carbon atoms.", 2013. https://doi.org/10.6084/m9.figshare.777773
S. Gülten, "Bis dihydropyrimidine", ChemSpider Synthetic Pages, 2011. https://doi.org/10.1039/sp501
Raghunathan Ramakrishnan., P. Dral, P.O. Dral, M. Rupp, and O. Anatole Von Lilienfeld., "Quantum chemistry structures and properties of 134 kilo molecules", 2014. https://doi.org/10.6084/m9.figshare.978904
H.S. Rzepa, "C 8 H 8 B 2", 2015. https://doi.org/10.14469/ch/191378
Y. Zhang, H.S. Rzepa, J.J.P. Stewart, P. Murray-Rust, M.J. Harvey, N. Mason, A. McLean, and Imperial College High Performance Computing Service., "Revised Cambridge NCI database", 2014. https://doi.org/10.14469/ch/2
SimonClifford., and M J Harvey., "hpc-portal: Public release", 2015. https://doi.org/10.5281/zenodo.19174
H.S. Rzepa, "C 7 H 10 Br 1 1", 2007. https://doi.org/10.14469/ch/46
H.S. Rzepa, A.V. Shernyukov, G.E. Salnikov, V.G. Shubin, and A.M. Genaev, "Noncatalytic Bromination of Benzene: A Combined Computational and Experimental Study", 2015. https://doi.org/10.6084/m9.figshare.1299202
K.K.(. Hii, H.S. Rzepa, and E.H. Smith, "Asymmetric Epoxidation: A Twinned Laboratory and Molecular Modeling Experiment for Upper-Level Organic Chemistry Students", Journal of Chemical Education, vol. 92, pp. 1385-1389, 2015. https://doi.org/10.1021/ed500398e
M. Addis, "RDM workflows and integrations for HEIs using hosted services", figshare, 2015. https://doi.org/10.6084/m9.figshare.1476832
M. Addis, and H.S. Rzepa, "Use of DOIs in data publishing in Computational Chemistry at Imperial College London", 2015. https://doi.org/10.6084/m9.figshare.1477994

Tags:Chemical IT
Posted in Chemical IT | 4 Comments »

Single Figure (nano)publications, reddit AMAs and other new approaches to research reporting

Wednesday, August 5th, 2015

I recently received two emails each with a subject line new approaches to research reporting. The traditional 350 year-old model of the (scientific) journal is undergoing upheavals at the moment with the introduction of APCs (article processing charges), a refereeing crisis and much more. Some argue that brand new thinking is now required. Here are two such innovations (and I leave you to judge whether that last word should have an appended ?).

To set the scene for the first, I will quote the abstract: “The single figure publication is a novel, efficient format by which to communicate scholarly advances. It will serve as a forerunner of the nano-publication, a modular unit of information critical for machine-driven data aggregation and knowledge integration[1] The kernel of this suggestion is (again I quote) “We offer the idea of the micro-publication unit, the single figure publication (SFP), to provide scholars with a real-world, manageable method to inform research.” I was struck by the overlap between this suggestion and the one you may find on many of the posts on this blog, where what I refer to as FAIR Data is assigned a digital object identifier (DOI) and included in the citation lists at the end of the post. The key phrase in the above abstract is machine-driven data aggregation and knowledge, although the article does not really go into any mechanisms for easily achieving this. It is my argument that the act of assigning a DOI carries with it the association that there is machine searchable metadata which can be retrieved and used for the aggregation and knowledge mining. The authors of this article, Do and Mobley, advocate adoption of nanopublications defined by inclusion of just a single figure (notably, not a table of results!) and some accompanying context which they claim would reduce the unit of publication to a more tractable size. This does raise the question of whether science needs more publications (in chemistry alone there are said to be more than a million published each year) or whether we should instead be concentrating our efforts on improving the data side of things by increasing its semantic content and formalising its structures, its preservation and curation. I certainly argue that far too little effort has been poured into these latter activities. You only have to look at the typical SI (supporting information) associated with many chemistry articles to realise that in many cases they are still hardly fit for purpose. There is one concept introduced by Do and Mobley that also deserves mention. Their nanopublications are structured to be read by machines, not people. They will therefore not be refereed by people (my inference). They do not really discuss how else the quality will be assessed, but of course if you treat their nanopublication as essentially FAIR data, then it does become possible to develop methods of machine refereeing.

The second email alerted me to an article[2] in the Winnower, a forum that offers a bridge between “traditional scholarly publishing tools to traditional and non-traditional scholarly outputs—because scholarly communication doesn’t just happen in scholarly journals“. Here, the concept of scholarly communication is extended to the New Reddit Journal of Science and introduces the concept pioneered by reddit of the AMA, or “ask me anything” environment. I occasionally publish some of the posts on this blog to the Winnower, receiving in return the increasingly ubiquitous DOI. I have also occasionally quoted these DOIs in articles submitted to conventional chemistry journals. What we see now is the propagation of a Winnower DOI on to e.g. https://www.reddit.com/r/science/ where anyone^† can post a question related to the original research reporting. I must state that I do have some reservations about this. Whilst it is likely that the majority of traditional scholarly reporting is likely to receive no AMAs (just as a very high proportion of research articles attract few if any citations in other articles over a period of decades), it is also likely that the quality of posted AMAs may turn out to be very low. At which point the original researcher has to make a judgement as to whether to devote any of their increasingly precious and fragmented time to answering them. And if few if any answers are posted in response to an AMA, the system seems unlikely to flourish.

But what we see here are two serious attempts to develop new approaches to research reporting, and not doubt others will emerge. To quote Yogi Berra, the future is not what it used to be.

^†Anyone can also post to this blog to ask similar questions. But note that associating an ORCID with such comments is highly recommended. I do not think that reddit currently supports ORCID, but I would argue if the intent is serious, it certainly should.

References

L. Do, and W. Mobley, "Single Figure Publications: Towards a novel alternative format for scholarly communication", F1000Research, vol. 4, pp. 268, 2015. https://doi.org/10.12688/f1000research.6742.1
. RobustTempComparison, and . r/Science, "Science AMA Series: Climate models are more accurate than previous evaluations suggest. We are a bunch of scientists and graduate students who recently published a paper demonstrating this, Ask Us Anything!", The Winnower, . https://doi.org/10.15200/winn.143871.12809

Tags:10.15200, 143871.12809, Academia, Academic publishing, advocate, Citation, data mining, Digital Object Identifier, Do, Knowledge, knowledge mining, Microattribution, Mobley, original researcher, Peer review, Publishing, scholarly publishing tools, Technology/Internet, the New Reddit Journal, Yogi Berra
Posted in Chemical IT, General | No Comments »

Intermolecular atom-atom bonds in crystals? The O…O case.

Saturday, July 25th, 2015

I recently followed this bloggers trail; link1 → link2 to arrive at this delightful short commentary on atom-atom bonds in crystals[1] by Jack Dunitz. Here he discusses that age-old question (to chemists), what is a bond? Even almost 100 years after Gilbert Lewis’ famous analysis,[2] we continue to ponder this question. Indeed, quite a debate on this topic broke out in a recent post here. My eye was caught by one example in Jack’s article: “The close stacking of planar anions, as occurs in salts of croconic acid …far from producing a lowering of the crystal energy, this stacking interaction in itself leads to an increase by several thousand kJ mol⁻¹ arising from Coulombic repulsion between the doubly negatively charged anions” I thought I might explore this point a bit further in this post.

A search query of the Cambridge structure database was defined as below. Two non-bonded oxygen atoms are each attached to one carbon, each oxygen was defined as having one bonded atom (to carbon) and each assigned one negative charge. Addition of the usual constraints of R < 0.05, no errors, no disorder and specifying an intermolecular search produced 103 hits with the distance distribution shown below.

Firstly, you should be aware that the van der Waals radius for oxygen is ~1.5Å, and so any contacts less than 3.0Å become interesting. What becomes particularly exciting is the distinct cluster at ~2.5Å. Could these be ~30 examples of close encounters of the type noted by Dunitz? Well, a control search has to be done, this time for O-H-O motifs, with each OH distance plotted as below:

The hot-spot occurs when both OH distances are equal at ~1.22Å, or an O…O separation close to 2.45Å. Time to quote Dunitz again “This large destabilization is, of course, more than compensated in the overall energy balance by the large stabilization arising from Coulombic interactions of the croconate anions with the surrounding cations.” In this case of course, the cation is a proton, residing at the half way point between the two oxygens. So two oxygens can indeed approach ~0.5Å closer than the sum of the vdw radii if a proton sits in-between them.

What do we learn? Well, firstly that one should always have a reality check of the results of any crystal structure search. The search did specify that the oxygens be non-bonded but also that they should both carry a negative charge and that both should only have one bonded atom. That should in theory at least have excluded any C-O-H-O-C structures, so why were about 30 such examples found? I can only speculate here, but recollect that 50 years ago when the CSD was founded, hydrogen atoms were rarely identified from the electron density. They were instead placed or “idealised” to where they might be expected. Nowadays any contentious hydrogens are almost always located rather than idealised, but clearly their status as bona-fide atoms is not quite so strong as the rest of the periodic table. So in at least some of these 30 examples with short O…O contacts, we might expect there to lurk a (possibly unrecognised) proton. But one never knows, there may be some real examples of O…O contacts with no such proton intervening. Now these really would be interesting.

Postscript. F is isoelectronic with O(-); below is the same search as defined above, but for non-bonded CF…FC approaches.

The vdw radius of F is 1.45Å hence any non-bonded contact <2.9Å is worth taking a look at. But notice the small cluster of about 10 compounds for which the value is ~2.15Å. The F-H-F plot shows a hot spot at ~2.3 for the F…F separation, but there are zero hits for CF-H-FC. So these ten hits are indeed tantalising.

References

J.D. Dunitz, "Intermolecular atom–atom bonds in crystals?", IUCrJ, vol. 2, pp. 157-158, 2015. https://doi.org/10.1107/s2052252515002006
G.N. Lewis, "THE ATOM AND THE MOLECULE.", Journal of the American Chemical Society, vol. 38, pp. 762-785, 1916. https://doi.org/10.1021/ja02261a002

Tags:Carbon, Cations, Chemical bond, control search, Croconic acid, crystal energy, crystal structure search, Gilbert Lewis, intermolecular search, Jack Dunitz, overall energy balance, Proton
Posted in Chemical IT, crystal_structure_mining, Interesting chemistry | 1 Comment »

The 2015 Bradley-Mason prize for open chemistry.

Friday, June 26th, 2015

Open principles in the sciences in general and chemistry in particular are increasingly nowadays preached from funding councils down, but it can be more of a challenge to find innovative practitioners. Part of the problem perhaps is that many of the current reward systems for scientists do not always help promote openness. Jean-Claude Bradley was a young scientist who was passionately committed to practising open chemistry, even though when he started he could not have anticipated any honours for doing so. A year ago a one day meeting at Cambridge was held to celebrate his achievements, followed up with a special issue of the Journal of Cheminformatics. Peter Murray-Rust and I both contributed and following the meeting we decided to help promote Open Chemistry via an annual award to be called the Bradley-Mason prize. This would celebrate both “JC” himself and Nick Mason, who also made outstanding contributions to the cause whilst studying at Imperial College. The prize was initially to be given to an undergraduate student at Imperial, but was also extended to postgraduate students who have promoted and showcased open chemistry in their PhD researches.

Peter and I are delighted to announce the inaugural winners of this prize.

The postgraduate winner is Tom Phillips for his open blog describing his experiences as a PhD student and for leading by example. He has published his instrumental codes on Github (and now Zenodo[1]) and data and codes for reproducing the graphs in his work on the “lab on a chip” in Figshare[2] and through his blog has encouraged other research students to do the same. Tom has worked assiduously to ensure that all the articles describing his PhD work are or will be open access.[3]

The undergraduate winner is Tom Arrow for his “spare time” involvement with WikiMedia (the foundation that underpins the open Wikipedia), including participating in a Wikimedia EU hackathon in Lyon France, and feeding his experiences and skills back into his undergraduate environment as well as enhancing the teaching Wiki used by his fellow students. Tom took the lead in introducing us to Wikidata[4] for storing chemical data in an open Wikibase data repository and in promoting its use for enriching Wikipedia chemistry pages and showcasing open data in undergraduate teaching environments.

References

T. Phillips, and S. Macbeth, "pumpy: Zenodo release", 2015. https://doi.org/10.5281/zenodo.19033
T. Phillips, J.H. Bannock, and J.D. Mello, "Data for microscale extraction and phase separation using a porous capillary", 2015. https://doi.org/10.6084/m9.figshare.1447208
T.W. Phillips, J.H. Bannock, and J.C. deMello, "Microscale extraction and phase separation using a porous capillary", Lab on a Chip, vol. 15, pp. 2960-2967, 2015. https://doi.org/10.1039/c5lc00430f
D. Vrandečić, and M. Krötzsch, "Wikidata", Communications of the ACM, vol. 57, pp. 78-85, 2014. https://doi.org/10.1145/2629489

Tags:Cambridge, chemical data, Chemistry Central, Collective intelligence, Crowdsourcing, Doctor of Philosophy, Education, European Union, France, GITHUB INC., Imperial College, Jean Claude Bradley, lab on a chip, Lyon, Nick Mason, Nonprofit technology, Open content, Peter Murray-Rust, reward systems, Technology/Internet, Tom Arrow, Tom Phillips, Wikimedia Foundation, wikipedia, World Wide Web, young scientist
Posted in Bradley-Mason Prize for Open Chemistry, Chemical IT | 1 Comment »

Personal web pages on digital repositories.

Saturday, June 20th, 2015

The university sector in the UK has quality inspections of its research outputs conducted every seven years, going by the name of REF or Research Excellence Framework. The next one is due around 2020, and already preparations are under way! Here I describe how I have interpreted one of its strictures; that all UK funded research outputs (i.e. research publications in international journals) must be made available in open unrestricted form within three months of the article being accepted for publication, or they will not be eligible for consideration in 2020.

At the outset, I should say that one infrastructure to help researchers adhere to the guidelines is being implemented in the form of the Symplectic system. This allows a researcher to upload the final accepted version of a manuscript. At Imperial College, a digital repository called Spiral serves this purpose and also acts as the front end for collecting informative metadata to enhance discoverability. The final accepted version is then converted by the publisher into a version-of-record. This contains styling unique to the publisher and the content is subjected to further scrutiny by the authors as proof corrections. In an ideal world, these latter changes should also be faithfully propagated back to the final accepted version, as would all the supporting information associated with the article. Since most authors do not exactly enjoy the delights of proof corrections, this final reconciliation of the two versions may not always be assiduously undertaken.

I became concerned about the existence of two versions of any given scientific report and that the task of ensuring total fidelity in the content of both versions may negatively impact on the author’s time. Much better if the publisher could grant permission for the author to archive the version-of-record into a digital repository.

Some experiments were needed, and I decided to start them in reverse, by archiving my oldest publications. Since Symplectic now provides a system to do this, I began by using it. Symplectic identifies each publisher’s policies for archival, of which the most liberal are known as ROMEO GREEN. To quote from the definition, this colour allows the author to “archive pre-print and post-print or publisher’s version/PDF“. In an afternoon I had processed most of my ROMEO green articles. You know how it is sometimes, you do not read the fine print! And so the library soon informed me that archival of ROMEO GREEN was in fact only permitted on the author’s “personal web page”. Spiral, as an institutional repository, does not apparently constitute a personal web page for me and so none of my Symplectic submissions could be accepted for archival there.

Time to rethink the experiment. Firstly, I very much wanted the reprints to be held by a proper digital repository rather than a conventional web page. Why? I wanted my reprints to adhere as much as possible to FAIR: findable, accessible, interoperable and re-usable. Well, at least the first two of those (the last two relate more to data). A repository is designed to hold metadata in a formal and standards-based manner and metadata helps achieve FAIR. So I asked the Royal Society of Chemistry (as a ROMEO GREEN publisher) whether a personal web page hosted on a digital repository would qualify. I was soon informed that I had proposed a neat solution here, and they couldn’t see an issue.

Now, all I had to do is find a repository where I could create such a personal web page. The chemistry department at Imperial College has for ten years hosted a DSpace repository called SPECTRa[1] which already has the functionality for individuals to create personal collections. I had also picked up on the increasing attention being given to Zenodo, like the World-Wide Web itself an offshoot of CERN (of large Hadron Collider fame) and born from the need for researchers to more permanently archive the outputs of their researches. These outputs include software, videos, images, presentations, posters, publications and (most obviously for CERN) datasets. I thought I would include them in my experiment as well. There results are summarised below.

	DSpace-SPECTRa	Zenodo
Community	Henry Rzepa personal web page reprint collection	Rzepa personal computational chemistry data and reprint page
Collection	Royal Society of Chemistry reprints
Publication	10042/195577	10.5281/zenodo.18758[2]
Thesis	10044/1/20860[3]	10.5281/zenodo.18777[4]
Dataset	10.14469/ch/191342[5]	10.5281/zenodo.18632[6]
Harvesting	OAI-ORE	OAI-PMH

The last line of this table includes a link to another design feature of a repository, facilitating the ability to harvest the content. The ContentMine project (“The right to read is the right to mine!“) has shown how such harvesting of facts from the literature can be automated on a vast scale, and (IMHO) represents an example of those disruptive innovations that have the power to change the world forever. It also enshrines the idea that scientific facts funded by the public purse should be capable of being openly liberated from their containers. A harvestable repository seems an ideal container for achieving this.

My experiment is part of what might be seen as the increasingly subtle interplay between:

scientific authors, whose creative endeavour research is and without whom scientific publishers would not exist
publishers who create a business model from the content freely given them by authors but also (especially if a commercial publisher) need to be accountable to their shareholders.
the funding councils, many of whom now wish the outcomes of the research they fund to be openly available to all
the local libraries/administrators who have to adhere to/enforce all the rules contractually handed down to them by publishers whose direct customers they are, but who also need to serve their community of readers and authors.
researchers who would rather do research than fret about the above, and who would rather spend limited resources doing that research rather than diverting an increasing amount of their attention into the above system.
readers, who need unimpeded access to the research endeavours of others, but often have little influence on the policies and actions of all the other stakeholders, since they are NOT considered customers (of the publishers).
etc. etc.

My experiment was in part designed to explore these rules, their interpretations and their boundaries. For the time being at least I seem to have found an arrangement that allows me to distribute versions-of-record of my own work, thanks to a generous and far-sighted learned society publisher. Watch this space!

References

J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
H.S. Rzepa, and B.C. Challis, "The Mechanism Of Diazo-Coupling To Indoles And The Effect Of Steric Hindrance On The Rate Limiting Step", Zenodo, 1975. https://doi.org/10.5281/zenodo.18758
H.S. Rzepa, "Hydrogen transfer reactions of indoles", 1974. http://doi.org/10044/1/20860
H.S. Rzepa, "Hydrogen Transfer Reactions Of Indoles", Zenodo, 1974. https://doi.org/10.5281/zenodo.18777
H.S. Rzepa, "C 25 H 34 Cl 1 N 3 O 1", 2015. https://doi.org/10.14469/ch/191342
H.S. Rzepa, A. Lobo, M.S. Andrade, V.S. Silva, and A.M. Lourenco, "Chiroptical properties of streptorubin B – the synergy between theory and experiment.", 2015. https://doi.org/10.5281/zenodo.18632

Tags:Academia, Academic publishing, Archival science, author, Data management, Digital library, EPrints, Institutional repository, Knowledge, Knowledge representation, Library science, metadata, Open access, PDF, personal web page, Preprint, Publishing, Repository, researcher, ROMEO GREEN, Science, Technology/Internet, United Kingdom, web server
Posted in Chemical IT | No Comments »

Discovering chemical concepts from crystal structure statistics: The Jahn-Teller effect

Saturday, May 30th, 2015

I am on a mission to persuade my colleagues that the statistical analysis of crystal structures is a useful teaching tool. One colleague asked for a demonstration and suggested exploring the classical Jahn-Teller effect (thanks Milo!). This is a geometrical distortion associated with certain molecular electronic configurations, of which the best example is illustrated by octahedral copper complexes which have a d⁹ electronic configuration. The e_g level shown below is occupied by three electrons and which can therefore distort in one of two ways to eliminate the e_g degeneracy by placing the odd electron into either a x²-y² or a z² orbital. Here I explore how this effect can be teased out of crystal structures.

The search is set up with Cu specified as precisely 6-coordinate, and X=oxygen. The six X-Cu distances are defined as DIST1-DIST6. The R-factor is specified as < 0.05 (no disorder, no errors). The problem now is how to plot what is in effect a six-dimensional set of data, from which we are exploring whether four of the distances are different from the other two, and whether those four are the longer or the shorter. This requires analysis beyond the capability (as far as I know) of the Conquest program, and so here I will show sets of plots showing just the relationship between any two distances at a time. Of the 15 possible combinations of two distances, only four are shown below.

Some obvious patterns can already be spotted in the 400 or so compounds which satisfy the search criteria.

The largest clustering occurs at ~1.95Å, with two clusters each of fewer hits at ~2.5Å. The Wikipedia page notes that for Cu(OH₂)₆ the Jahn-Teller distortion favours four short bonds at ~1.95Å and two long ones at ~2.38Å, which agrees approximately with the positions and sizes of the centroids of these clusters.^†
Plots 1 and 2 show very little along the diagonals, where the two plotted distances have the same value. This probably means that one of the distances relates to an equatorial ligand and the other to an axial ligand.
Plots 3 and 4 show a strong diagonal trend, and so these distances both relate to either axial or equatorial, but not one of each.
All four plots show a hot spot at ~1.95Å, which hints that the Jahn-Teller distortion is four short bonds/two long.
Plot 4 also shows a green spot at ~2.5Å which is a tantalising suggestion of examples of four long bonds/two short.^‡

Clearly this analysis can be followed up by a visual inspection of individual molecules in each cluster (as well as the outliers which appear to follow no pattern!), together with a more bespoke analysis of the six distances. Unfortunately, the spin state of the complexes cannot be quickly checked (are they all doublets?) since the database does not record these. But the basic search described above takes only a few minutes to do, and it is surprising at how quickly the Jahn-Teller effect can be statistically tested with real experimental data obtained for ~400 molecules. Of course, here I have only explored X=O but this can easily be extended to X=N or X=Cl, to other metals or to alternative coordination numbers such as e.g. 4 where the Jahn-Teller effect can also in principle operate.

^‡ One genuine example of this type, also called compressed octahedral coordination, was reported for the species CuFAsF₆ and CsCuAlF₆[1]

^† The measured geometry of Cu(H₂O)₆ may in fact manifest with six equal Cu-O bond lengths due to the dynamic Jahn-Teller effect, because the kinetic barrier separating one Jahn-Teller distorted form and another (equivalent) isomer is small and hence averaged atom positions are measured which mask the effect. Thus the Jahn-Teller effects shown in the plots above may be under-estimated because of this dynamic masking. Reducing the temperature of the sample at which data was collected would reduce this dynamic effect. Indeed, Cu(D₂O)₆ collected at 93K shows a very clear Jahn-Teller distortion[2] with four long bonds ranging from 1.97-1.99Å and two long bonds 2.37-2.39Å.[3] Another example measured at 89K with dimethyl formamide replacing water and coordinated via oxygen[4] shows four short (1.97-1.98Å) and two long (2.315Å) bonds. This latter example is also noteworthy because this analysis is as yet unpublished in a journal, but the data itself has a DOI via which it can be acquired. A nice example of modern research data management!

References

Z. Mazej, I. Arčon, P. Benkič, A. Kodre, and A. Tressaud, "Compressed Octahedral Coordination in Chain Compounds Containing Divalent Copper: Structure and Magnetic Properties of CuFAsF<sub>6</sub> and CsCuAlF<sub>6</sub>", Chemistry – A European Journal, vol. 10, pp. 5052-5058, 2004. https://doi.org/10.1002/chem.200400397
W. Zhang, L. Chen, R. Xiong, T. Nakamura, and S.D. Huang, "New Ferroelectrics Based on Divalent Metal Ion Alum", Journal of the American Chemical Society, vol. 131, pp. 12544-12545, 2009. https://doi.org/10.1021/ja905399x
Zhang, Wen., Chen, Li-Zhuang., Xiong, Ren-Gen., Nakamura, T.., and Huang, S.D.., "CCDC 755150: Experimental Crystal Structure Determination", 2010. https://doi.org/10.5517/cctbspl
M.M. Olmstead, D.S. Marlin, and P.K. Mascharak, "CCDC 1053817: Experimental Crystal Structure Determination", 2015. https://doi.org/10.5517/cc14cl36

Tags:basic search, Chemical bond, chemical bonding, Chemistry, classical Jahn-Teller, clear Jahn-Teller, Coordination chemistry, Coordination complex, Copper(II) nitrate, dynamic Jahn-Teller, Edward Teller, Inorganic chemistry, Jahn-Teller, Jahn–Teller effect, Metal ions in aqueous solution, search criteria, Technology/Internet, Transition metals
Posted in Chemical IT, crystal_structure_mining | 1 Comment »

R-X≡X-R: G. N. Lewis’ 100 year old idea.

Friday, May 22nd, 2015

As I have noted elsewhere, Gilbert N. Lewis wrote a famous paper entitled “the atom and the molecule“, the centenary of which is coming up.[1] In a short and rarely commented upon remark, he speculates about the shared electron pair structure of acetylene, R-X≡X-R (R=H, X=C). It could, he suggests, take up three forms. H-C:::C-H and two more which I show as he drew them. The first of these would now be called a bis-carbene and the second a biradical.

In 1916, it was too early for Lewis to speculate what the geometries of such species might be, and in particular the C…C (or generalising, X…X) distance, and the two angles, one for each X. Well, we do not need to speculate, we can perform a search of the crystal structure database. Here it is (R < 0.05, no errors, no disorder):

A little more explanation of this 4-dimensional plot is needed:

The two angles are plotted as X and Y.
The X…X distance is plotted as colour, with red representing the longest distances and blue the shortest
The size of each “bin” is represented by the radius of the circle; small circles represent few examples, larger circles represent more examples in each “bin” defined by a regular range of angles.

There are one or two off-diagonal “outliers”, each of which probably deserves individual inspection. But dealing just with the obvious clusters, the overwhelmingly largest is for both angles of ~180°, and these are the triple bonds we know and love. As far as I know, Lewis was the first to propose a triple bond between two atoms, but if anyone reading this blog knows of an antecedent, do let me know. The next cluster is for angles of ~109° and these are clearly bis-carbenes. These all occur when X ≠ C. There are two small clusters worthy of note; one ~130° and one ~90°. The latter are mostly Pb-Pb and Sn-Sn, where the bonding is unhybridised pure p.

One of the limitations of searching for crystal structures is that the spin state of each molecule is never given. The biradical structure given by Lewis could well have a triplet ground state, and perhaps that might have very characteristic angles (~130° ?). It would be great to identify a genuine example of this biradical form!

As usual, the search itself took around 10 minutes, and it provides much interesting food for thought; not bad for a 100-year-old idea!

References

G.N. Lewis, "THE ATOM AND THE MOLECULE.", Journal of the American Chemical Society, vol. 38, pp. 762-785, 1916. https://doi.org/10.1021/ja02261a002

Tags:Carbene, Carbenes, Chemistry, Cluster chemistry, food, Functional groups, Gilbert N. Lewis, Non-Kekulé molecule, Organic chemistry, Organic compounds
Posted in Chemical IT, Historical, Interesting chemistry | 1 Comment »

R-X≡X-R: G. N. Lewis' 100 year old idea.

Friday, May 22nd, 2015

A little more explanation of this 4-dimensional plot is needed:

The two angles are plotted as X and Y.
The X…X distance is plotted as colour, with red representing the longest distances and blue the shortest
The size of each “bin” is represented by the radius of the circle; small circles represent few examples, larger circles represent more examples in each “bin” defined by a regular range of angles.

As usual, the search itself took around 10 minutes, and it provides much interesting food for thought; not bad for a 100-year-old idea!

References

G.N. Lewis, "THE ATOM AND THE MOLECULE.", Journal of the American Chemical Society, vol. 38, pp. 762-785, 1916. https://doi.org/10.1021/ja02261a002

Impact factors, journals and blogs: a modern distortion.

Thursday, May 21st, 2015

A lunchtime conversation with a colleague had us both bemoaning the distorting influence on chemistry of bibliometrics, h-indices and journal impact factors, all very much a modern phenomenon of scientific publishing. Young academics on a promotion fast-track for example are apparently advised not to publish in a well-known journal devoted to organic chemistry because of its apparently “low” impact factor. Chris suggested that the real reason the impact factor was “low” is that this particular journal concentrates on full articles, which for a subject area such as organic chemistry can take years to assemble and hence years for others to assimilate and report their own results, and only then creating a citation for the first article. So this slow but steady evolution of citations in a long time frame apparently shows such a journal up as having less (short-term) impact than the fast-publishing notes-type variety where the impact is immediate but possibly less long-lived. That would be no reason of itself not to publish there of course!

Most would describe a blog as an ultimate medium for short-term publishing (shortened only by e.g. Twitter). I began to wonder what the statistics for this particular blog would show. So I looked at the time lines for the five most read posts, ranked below in terms of total hits. The oldest (and second most popular) is exactly six years old and so represents a reasonably long evolutionary time frame. The graphs below show the daily hits (red bars are annual). The immediate impact of each lasts less than a week, but the long-term analysis shows each accumulated their totals not by such immediate impact but by long-term accretion. For most, the first derivatives are still on the increase. This might all come as a surprise to those who tend to regard scientific blogs as having only short-term impact. But it would also be true to say that chemistry operates not on a time scale of days or years but of centuries, and so it will take a little while longer to assess impacts on that scale.

What of course is not measured by simply integrating total views over time is what purpose if any each viewing serves. This is as true of journal articles as it is of blogs. And a viewing is not quite the same as a citation (although the latter does not always imply a viewing!). But it is tempting to conclude that we have all become far too fixated on short-term impacts and the bibliometrics that provide this information.

Tags:Academia, Academic publishing, Bibliometrics, Impact, Impact factor, Knowledge, Publishing, TWITTER INC.
Posted in Chemical IT | 1 Comment »

Henry Rzepa's blog

Archive for the ‘Chemical IT’ Category

A visualisation of the effects of conjugation; dienes and biaryls.

A (light) introductory tutorial on Research Data Management (in chemistry).

References

Single Figure (nano)publications, reddit AMAs and other new approaches to research reporting

References

Intermolecular atom-atom bonds in crystals? The O…O case.

References

The 2015 Bradley-Mason prize for open chemistry.

References

Personal web pages on digital repositories.

References

Discovering chemical concepts from crystal structure statistics: The Jahn-Teller effect

References

R-X≡X-R: G. N. Lewis’ 100 year old idea.

References

R-X≡X-R: G. N. Lewis' 100 year old idea.

References

Impact factors, journals and blogs: a modern distortion.

Recent Posts

Archives

Blogroll

Meta