Chemical IT « Henry Rzepa's blog

Archive for the ‘Chemical IT’ Category

Data-free research data management? Not an oxymoron.

Tuesday, May 24th, 2016

I occasionally post about "RDM" (research data management), an activity that has recently become a formalised essential part of the research processes. I say recently formalised, since researchers have of course kept research notebooks recording their activities and their data since the dawn of science, but not always in an open and transparent manner. The desirability of doing so was revealed by the 2009 "Climategate" events. In the UK, Climategate was apparently the catalyst which persuaded the funding councils (such as the EPSRC, the Royal Society, etc) to formulate policies which required all their funded researchers to adopt the principles of RDM by May 2015 and in their future researches. An early career researcher here, anxious to conform to the funding body instructions, sent me an email a few days ago asking about one aspect of RDM which got me thinking.

The question related to the divide between data as a separate research object (and which therefore has to be managed), and data as an inseparable part of the article narrative, which is of course ostensibly managed by the journal publication processes. Such data may often be the description of a process rather than simply tables of numbers or graphs. In chemistry it may include chemical names and chemical terms as part of an experimental procedure. For one nice illustration of such embedded data, go look at the chemical tagger page. Here the data is blending with the semantics, and the two are not easily separated. So, when such separation is not easily achieved, should the specific processes required by RDM as illustrated in the five bullet points below actually be followed?

Specify a data management plan to be followed, as for example points 2-5 below.
Decide upon a location for your data, separated into one for "live" or working data (the purpose simply being to ensure it is properly backed up) and the other for a sub-set of formally "published data" which has to be available for at least ten years after its publication.
Use 2 to gather metadata (see 6-14 below) and in return get a DOI representing the location of the combined metadata + data, from a suitable registration authority such as DataCite.
Quote this DOI(s) in the article describing the results of analysing the data and presenting hypotheses, and conversely once the article itself is allocated its own DOI from a registration authority such as CrossRef, update the metadata in item 3 so as to achieve a bidirectional link between the data and its narrative (and we assume that DataCite and CrossRef will also increasingly exchange the metadata they each hold about the items).
Add both the data and the article DOIs to any institutional CRIS or current research information system (parenthetically, I regard this last stipulation as rather redundant if items 3 and 4 are working effectively, but its a good interim measure whilst the overall system matures).

So, should step 2 be included if the data itself is inextricably intertwined with the narrative and cannot be separated? The slightly surprising advice I would suggest is yes! And the answer is that it IS possible to generate metadata (data about the, possibly entwined, data) which CAN be processed in such a step. What forms would such metadata take?

Identification of the researcher(s) involved. This would nowadays take the form of an ORCID (Open Research and Collaborator Identifier).
Identification of the hosting institution where the data has been produced. There is currently no equivalent to an ORCID for institutions, but it is very likely to come in the future.
A date stamp formalising when the (meta)data is actually deposited.
A title for the project being described. Here we see a blurring between the narrative/article and the data; a title is the shortest possible description of the narrative/article, and it may also apply to the data object(s) or it could have its own title.
A slightly fuller abstract of the project being described. Here we see further blurring between the narrative and data objects.
One can include "related identifiers", in particular the DOIs of any other relevant articles that might have been published which may expand the context of the data, and also the DOIs of any other relevant datasets which may have been allocated in step 2 above.
It is also beneficial to include "chemical identifiers". These can take the form of InChI strings and InChI keys, which allow discretely defined molecular objects which were the object of the research to be tracked and which relate to both the narrative and any other data objects.
If specific software has been used to analyse data, it too can be included as a "related identifier" (e.g. [1]
Potentially at least, if a well-defined instrument has been involved, it too could be included with its own "related identifier". With both 13-14, other issues may need addressing, such as versioning etc, but this no doubt will be sorted in due course.
etc.

So items such as 6-14 can be collected and sent to e.g. DataCite with a DOI received in return as part of item 2 of the RDM processes. No "pure" data need be involved, only metadata. Nonetheless such metadata can only increase the visibility and discoverability of the research, as illustrated in how such metadata can be searched for the components described above.

References

H.S. Rzepa, "KINISOT. A basic program to calculate kinetic isotope effects using normal coordinate analysis of transition state and reactants.", 2015. https://doi.org/10.5281/zenodo.19272

Tags:Academic publishing, chemical identifiers, chemical names and chemical terms, chemical tagger page, CrossRef, Data management, Data management plan, DataCite, Identifiers, ORCiD, RDM, researcher, Royal Society, Singular spectrum analysis, Technical communication, Technology/Internet
Posted in Chemical IT | No Comments »

What is the approach trajectory of enhanced (super?) nucleophiles towards a carbonyl group?

Wednesday, May 11th, 2016

I have previously commented on the Bürgi–Dunitz angle, this being the preferred approach trajectory of a nucleophile towards the electrophilic carbon of a carbonyl group. Some special types of nucleophile such as hydrazines (R₂N-NR₂) are supposed to have enhanced reactivity[1] due to what might be described as buttressing of adjacent lone pairs. Here I focus in on how this might manifest by performing searches of the Cambridge structural database for intermolecular (non-bonded) interactions between X-Y nucleophiles (X,Y= N,O,S) and carbonyl compounds OC(NM)₂.

The search query[2] is shown above and involves plotting the distance from the nucleophilic atom (N above) to the carbon of the carbonyl group. The carbon is defined as having 3-coordination, one of which is O=C and two non-metal attachments. The torsion is constrained to values of |70-110|° to ensure that the approach of the nucleophile is approximately perpendicular to the plane of the carbonyl in order to overlap with the π*-orbital as electrophile. The pairwise sums of van der Waals radii are NC, 3.25; OC, 3.22 and SC, 3.5Å and the plots show all contacts shorter than these. The results of the searches are shown below.

The general observation is that the red hotspots do tend to come at trajectory angles of <100° and many are <90° such as the X=Y=N or X=Y=S examples. Given that the original Bürgi–Dunitz hypothesis (actually based on a small number of molecules synthesized for the purpose) proposed rather larger angles (105±5°) corresponding to optimum alignment of the nucleophile with the carbonyl π*-orbital, we might speculate whether the use of enhanced nucleophiles is the reason for the apparent decrease in the angle. And if so, what the underlying reasons would be.

I also cannot help but observe that the term supernucleophile is quite rare in the literature; SciFinder gives only 45 hits, but most are about neither hydrazines nor peroxides. There are also some unusual nucleophile varieties such as Cob(I)alamin[3], of which there are probably insufficient examples to reflect in the crystal structure statistics shown above. Given the interest in superbases, the relative lack of examples of unusual supernucleophiles seems surprising.

References

G. Klopman, K. Tsuda, J. Louis, and R. Davis, "Supernucleophiles—I", Tetrahedron, vol. 26, pp. 4549-4554, 1970. https://doi.org/10.1016/s0040-4020(01)93101-1
H. Rzepa, "Crystal structure search using enhanced nucleophiles", 2016. https://doi.org/10.14469/hpc/487
K.P. Jensen, "Electronic Structure of Cob(I)alamin: The Story of an Unusual Nucleophile", The Journal of Physical Chemistry B, vol. 109, pp. 10505-10512, 2005. https://doi.org/10.1021/jp050802m

Tags:Bases, Bürgi–Dunitz angle, Carbonyl, Electrophile, Ester, Flippin–Lodge angle, Functional groups, hydrazine, non-metal attachments, Nucleophile, Physical organic chemistry, search query, Superbase
Posted in Chemical IT, crystal_structure_mining | 1 Comment »

Collaborative FAIR data sharing.

Sunday, April 17th, 2016

I want to describe a recent attempt by a group of collaborators to share the research data associated with their just published article.[1]

I am here introducing things in a hierarchical form (i.e. not necessarily the serial order in which actions were taken).

The data repository selected for the data sharing is described by (m3data) doi: 10.17616/R3K64N[2]
A collaborative project collection was established on this repository (doi: 10.14469/hpc/244[3]). This data collection has some of the following attributes:
Its metadata is sent here: https://search.datacite.org/ui?&q=10.14469/hpc/244 where it can be queried for other details.
The project collaborators are all identified by their ORCID, used to obtain further individual information about the researchers. This information is also propagated to the metadata sent to DataCite.
In the section labelled associated DOIs there is a link to the recently published peer-reviewed article, which itself cites the data via doi: 10.14469/hpc/244 and which thus establishes a bidirectional link between the article and its data.
Also in the associated DOIs section are other DOIs (to two figures and two tables) held in a separate location. One example: doi: 10.14469/hpc/332[4]) which illustrates the original type of data sharing we started about 10 years ago. This form has been variously called a "WEO" or Web-enhanced object (by the ACS) or interactivity boxes (RSC, etc). In such WEOs, we wrap the data into an interactive visual appearance using Jmol or JSmol software. The data itself is directly available to the reader using the Jmol export functions (right mouse click in the visual window).
- In this specific example the WEO has been assigned its DOI using the repository noted above.[2]
- We have in the past also used Figshare[5]) for this purpose, see e.g. 10.6084/m9.figshare.1181739^‡
- The WEO itself can itself reference a more complete set of data used to create the visual appearance, for example data that allows the wavefunction of the molecule to be computed, doi: 10.6084/m9.figshare.2581987.v1[6] In this instance this is held on the Figshare[5] repository.
The collection has another section labelled Members. These are individual datasets associated with the collection and held on the SAME repository as the collection itself. In this case, there are five such members, two of which are listed below:
1. 10.14469/hpc/281[7] contains a variety of other data such as outputs from an IRC (intrinsic reaction coordinate), energy profile diagrams and ZIP archives of other calculations.
2. 10.14469/hpc/272[8] itself contains five members, one of which is e.g.
  - 10.14469/hpc/267[9] which contains a ZIP archive with NMR data (see here for how this might be packaged in the future) and a file for a GPC (chromatography) instrument.
  - This last item also contains a new section labelled Metadata, which includes e.g. the InChI key and InChI string for the molecule whose properties are reported.

If this mode of presenting data seems a little more complex than a single monolithic PDF file, its because its designed for:

collaboration between scientists, potentially at different locations and institutions.
attribution of provenance/credit for the individual items (via ORCID).
separate date stamping by the various contributors.
providing bi-directional links between data and publications.
holding what we call FAIR (findable, accessible, interoperable and reusable) data, rather than just data encapsulated in a PDF file.
Collecting, storing and sending metadata for aggregation in a formal way, i.e. to DataCite using a formal schema to render the metadata properly searchable.

Thus 10.14469/hpc/244 represents our most complex attempt yet at such collaborative FAIR data sharing with multiple contributors. The tools for packaging many of the datasets are still quite limited (see again here) and the design is still being optimised (call it α). When the repository[2] has been more extensively tested, we intend to make it available as open source for others to experiment with. And of course, when this happens the source code too will have its own DOI!

^‡A refactoring of the Figshare site in December 2015 meant that the DOI no longer points directly to the WEO, and you have to follow a manually inserted link on that page to see it.

References

C. Romain, Y. Zhu, P. Dingwall, S. Paul, H.S. Rzepa, A. Buchard, and C.K. Williams, "Chemoselective Polymerizations from Mixtures of Epoxide, Lactone, Anhydride, and Carbon Dioxide", Journal of the American Chemical Society, vol. 138, pp. 4120-4131, 2016. https://doi.org/10.1021/jacs.5b13070
Re3data.Org., "Imperial College Research Computing Service Data Repository", 2016. https://doi.org/10.17616/r3k64n
C. ROMAIN, "Chemo-Selective Polymerizations Using Mixtures of Epoxide, Lactone, Anhydride and CO2", 2016. https://doi.org/10.14469/hpc/244
H. Rzepa, "Table S8: Comparison of two different basis sets for selected intermediates for CHO/PA ROCOP.", 2016. https://doi.org/10.14469/hpc/332
Re3data.Org., "figshare", 2012. https://doi.org/10.17616/r3pk5r
P. Dingwall, "Gaussian Job Archive for C6H10O", 2016. https://doi.org/10.6084/m9.figshare.2581987.v1
C. ROMAIN, "Figure 9, Figure S18, Figure S19: ROCOP of PA/CHO + IRC", 2016. https://doi.org/10.14469/hpc/281
C. ROMAIN, "Table 1 : Polymerizations Using Lactone, Epoxide, and CO2", 2016. https://doi.org/10.14469/hpc/272
C. ROMAIN, "Table 1, entry 1 : Polymerizations Using Lactone, Epoxide, and CO2", 2016. https://doi.org/10.14469/hpc/267

Tags:10.17616, Academic publishing, DataCite, energy profile diagrams, Figshare, Identifiers, Open science, ORCiD, PDF, Scholarly communication, Technical communication, Technology/Internet, Web-enhanced object
Posted in Chemical IT | No Comments »

Metametadata: data about data about (chemical) data.

Saturday, April 16th, 2016

Scientists are familiar with the term data, at least in a scientific or chemical context, but appreciating metadata (meaning "after", or "beyond") is slightly more subtle, in the sense of using it to mean data about data. The challenge lies in clarifying where the boundary between data and its metadata lies and in specifying and controlling the vocabulary used for these metadata descriptions. Items in a chemical metadata dictionary might include e.g. subject classifications such as Organic Molecular Chemistry or identifiers such as InChIkey. But what could metametadata be? Here I briefly show some examples by way of illustration.

Let me start by defining a data repository as a store of both data and the metadata describing it. The metadata is to be exposed in a standard manner which allows it to be aggregated by other agencies. Nowdays, it is becoming common to identify such a data object together with its metadata using a persistent identifier, or DOI. But to decide if any particular repository and the data objects contained therein is generally useful to you, you need information about the metadata itself. Technically, this is defined using a schema[1] describing the metadata (which might e.g. identify any dictionaries used); hence metametadata. Now you need to store the metametadata and so I introduce the concept of a registry which does this. This metametadata object is itself assigned a DOI^‡ and here I list these DOIs for a personal selection of some chemically oriented examples, in this case deriving from the largest registry of research data repositories re3data.org. You can search for your own entry at their site: http://service.re3data.org/search.

Data repository	The repository metametadata DOI^♣	Badge
Figshare	10.17616/R3PK5R[2]
Zenodo	10.17616/R3QP53[3]
Cambridge structure database	10.17616/R36011[4]
Crystallographic open database	10.17616/R37S31[5]
Oxford University Research Archive	10.17616/R3Q056[6]
Open Notebook Science	10.17616/R3859D[7]
Usefulchem	10.17616/R3Z89N[8]
Chemotion	10.17616/R34P5T[9]
Chemspider	10.17616/R38P4P[10]
Chemical Database Service	10.17616/R36P42[11]
Imperial College HPC data repository.	r3d100011965[12],[13]
Imperial College SPECTRa repository.[14]	10.17616/R30316[15]

Not all of the repositories listed in the table above assign formal DOIs to their data collections, meaning that the metadata for their entries cannot be aggregated in a searchable manner using e.g. search.datacite.org/ui (or search.datacite.org/api for the machine version). Currently, the metametadata does not fully carry this information, an aspect which I gather will be rectified in a future revision of the re3data schema.[1]

Importantly, both metadata and (repository) metametadata can be searched using APIs (application programmer interface), ensuring that the entire flow of meta information can be subject to automated software analysis rather than just visual inspections by a human.This should allow a rich and open infrastructure for handling research objects or data to be built up using hierarchical metadata. The examples above indeed show that the chemical space is already the largest component of the Natural Sciences space.

Although the edifice is still largely in its infancy, already I think we can start to see an alternative open approach emerging to "Googling" for data, or the even older traditional bespoke (i.e. non-open) services offered by commercial human-based abstractors of chemical metadata.

^‡This DOI is information about the metametadata, and hence it is metametametadata, or m3data. Sorry! ^♣The citations at the foot of this post are generated entirely automatically (by a WordPress plugin called Kcite) from the m3data associated with each entry, i.e. the DOI listed. Were the persistent identifier for the entry ever to be changed, this would propagate automatically to the citation, unlike the static entries in the table.

References

J. Rücknagel, P. Vierkant, R. Ulrich, G. Kloska, E. Schnepf, D. Fichtmüller, E. Reuter, A. Semrau, M. Kindling, H. Pampel, M. Witt, F. Fritze, S. Van De Sandt, J. Klump, H. Goebelbecker, M. Skarupianski, R. Bertelmann, P. Schirmbacher, F. Scholze, C. Kramer, C. Fuchs, S. Spier, and A. Kirchhoff, "Metadata Schema for the Description of Research Data Repositories", 2015. https://doi.org/10.2312/re3.008
Re3data.Org., "figshare", 2012. https://doi.org/10.17616/r3pk5r
Re3data.Org., "Zenodo", 2013. https://doi.org/10.17616/r3qp53
Re3data.Org., "The Cambridge Structural Database", 2013. https://doi.org/10.17616/r36011
Re3data.Org., "Crystallography Open Database", 2013. https://doi.org/10.17616/r37s31
Re3data.Org., "Oxford University Research Archive", 2014. https://doi.org/10.17616/r3q056
Re3data.Org., "ONSchallenge", 2013. https://doi.org/10.17616/r3859d
Re3data.Org., "UsefulChem", 2014. https://doi.org/10.17616/r3z89n
Re3data.Org., "chemotion", 2013. https://doi.org/10.17616/r34p5t
Re3data.Org., "ChemSpider", 2013. https://doi.org/10.17616/r38p4p
Re3data.Org., "Chemical Database Service", 2012. https://doi.org/10.17616/r36p42
https://doi.org/
H. Rzepa, "Imperial College High Performance Computing Service Data Repository Metadata Schema", 2016. https://doi.org/10.14469/hpc/382
J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
Re3data.Org., "SPECTRa Project", 2013. https://doi.org/10.17616/r30316

Tags:Academic publishing, automated software analysis, BASE, chemical context, Chemical Database Service, chemical metadata, chemical metadata dictionary, chemical space, City: Cambridge, Data dictionary, Data management, Identifiers, Knowledge representation, programmer, Registry of Research Data Repositories, search.datacite.org/api, SPECTRa, Technology/Internet
Posted in Chemical IT | No Comments »

Publishing embargoes.

Wednesday, April 13th, 2016

Publishing embargoes seem a relatively new phenomenon, probably starting in areas of science when the data produced for a scientific article was considered more valuable than the narrative of that article. However, the concept of the embargo seems to be spreading to cover other aspects of publishing, and I came across one recently which appears to take such embargoes into new and uncharted territory.

One example (there are many others) of embargoes continuing to operate in the era of open science and open data relates to crystallographically derived coordinates for macromolecules. Biomolecular structures are allowed to be embargoed for a maximum of one year before becoming openly available or “released” (considered a friendlier term than embargo). A more recent phenomenon is of embargoes on press releases which may be prepared by authors and or publishers to accompany the appearance of any article considered especially newsworthy. The publisher will then request that the press release is only released to coincide with the actual publication time and date of the article itself. Both of these types of embargo are more or less accepted by both parties. But in the last five years or so, new types of embargo have been introduced and it is these I want to discuss here.

The self-archive or “green open access” version of an article, in the form of the last author version of an accepted manuscript prior to copy-editing and other operations by a publisher. Such Green OA versions are now a mandatory requirement from funders (in the UK), arising from the need to conduct a “REF” or research excellence framework assessment of all (UK) universities every seven years or so. In order to allow assessors and funding councils unencumbered access to these research outputs, the authors must self-archive their publications in a suitable institutional repository. In general therefore, there should always exist two versions of any scientific paper authored within these guidelines, the AV (author version) and VoR (Version of Record, held by the publisher, and carrying the guarantee of peer review). Publishers now embargo author versions until the VoR version has been published, and sometimes even up to 18 months beyond this period.
The “supporting information” or SI embargo. This is closely related to the crystallographic data embargo noted above, but it applies in general to most other data and information associated with an article. Until very recently, most SI was in fact handled by the publisher themselves, and so it was released at the same time as the article. Since it is becoming more common to deposit data and SI in a separate repository, some publishers mandate that the release dates of this material must not precede the article itself. Deposition of such data has also become a mandatory requirement from (UK) funders since May 2015, and I have blogged about such “research data management” often here. In effect, both the scientific article and the data supporting it achieve their own DOIs or persistent digital identifiers, allowing easy and independent access to either the article OR its data. In fact, assigning such a DOI has a more subtle effect; creating a DOI means that metadata describing the object is also created and then aggregated by the agency issuing the DOI such as CrossRef and DataCite. Importantly, one should note that SI which is handled purely by the publisher will not have its own separate DOI and it will not have its own metadata. The data metadata for example can include the DOI for the article, and vice versa. I have shown examples of the utility of such metadata for data in an earlier post.
So now we come to the most recent embargo, which has surfaced since around May 2015, as increasingly data has become a first class object in its own right with its own DOI and importantly its own metadata. There is now evidence that some publishers are requesting that this very metadata about data is also subjected to an embargo, not to be released before the article which makes use of that data is itself released. So data can be deposited in “dark form” prior to a publication, but the metadata (which carries the date stamp and provenance for the deposition) may have to be “dark” or embargoed. Actually, this is not yet very common; for example I asked the Royal Society of Chemistry what their policy was, with the reply “the Royal Society of Chemistry wouldn’t require metadata about the data files to be embargoed”.

We live in an era where the very careers of reseachers can be determined by their claim to priority about scientific discoveries. The date stamps for priority continue to be largely controlled and issued by publishers and some may decide that it will be in their business interests to extend their control to data. Perhaps they may even wish to control all aspects of publication including the data and its metadata, acting as self-proclaimed research facilitators.

At this moment, this has not happened; both data and its metadata can remain open and FAIR. Which is where I think we should go in the future in the interests of open science itself.

Tags:Academic publishing, Embargo, Open access, Publishing, Royal Society of Chemistry, Technology/Internet, Uncharted, Uncharted Territory
Posted in Chemical IT | No Comments »

Celebrating Paul Schleyer: searching for hidden treasures in the structures of metallocene complexes.

Saturday, April 2nd, 2016

A celebration of the life and work of the great chemist Paul von R. Schleyer was held this week in Erlangen, Germany. There were many fantastic talks given by some great chemists describing fascinating chemistry. Here I highlight the presentation given by Andy Streitwieser on the topic of organolithium chemistry, also a great interest of Schleyer's over the years. I single this talk out since I hope it illustrates why people still get together in person to talk about science.

The presentation focused on the structure of the simplest possible metallocene, lithium cyclopentadienyl and why the calculated structure showed that the hydrogen atoms attached to the cyclopentadienyl ring pointed slightly away from the metal rather than towards it (by ~1-2°).^† Various explanations had been put forward, some had waxed and then waned. It was still basically an open problem. Now, the title of the symposium was Theory and Experiment: A Meeting at the Interface; Streitwieser had given the theory and whilst listening, I realised I might be able to help relate this to known experiments, i.e. crystal structure data. I could do so by analysing the known crystal structures of metallocenes.[1] So here is the basic search query, and I will go through it thus:

A general ring is defined (sizes 4,5,6,7,8) and the ring and metal-C bonds are all specified as of type "any" (it is difficult to know how such bonds might be classified, ie delocalised, aromatic, etc, so best not to constrain things) and a metal is attached.
4M is basically any metal; again the search is unconstrained, but one could focus on certain columns of the periodic table if one wished.
A ring centroid is computed.
ANG1 is defined as the angle H-C-centroid, the angle of interest in Andy's talk. The limits were constrained to lie between 140° and 179°. I did this because when the angle becomes 180°, the torsion becomes mathematically undefined and I did not want to risk this happening.
TOR1 is defined as the torsion H-C-centroid-metal. Values of 180° would indicate that the hydrogen was pointing away from the metal; values of 0° would indicate it was pointing towards the metal. The absolute value of the torsion is taken to avoid confusion induced by its sign.
ANG2 is one test whether the ring is planar. For an even membered ring, it is the angle subtended at the centroid to opposing carbon atoms. For odd membered rings it is the angle at the centroid involving one carbon and a centroid defined by an opposing pair of atoms (see below).
The quality of the crystal structure determination is controlled by specifying that the R value be < 5%, no errors, no disorder. Also, the terminal H-positions are normalised (to correct known errors in H distances deriving from x-ray diffraction). I would point out that in the early days, the actual positions of the hydrogen were often not actually determined, but "idealised". In this case this would mean that the H-C-centroid angle would probably be set to 180°. For perhaps the last 20 years or so however, the positions of hydrogen atoms have been routinely refined. Unfortunately, I know of no search query that can separate the two cases, and so we will have to live with the mixture and see what we get.
We define another constraint separately, which is that the temperature of the data collection sample is <140K. This ensures that the data will be free of more vibrational/thermal noise and so should be rather more accurate.
Finally, a note on the topic of "research data management" or RDM. I have deposited the files defining the search query in a repository and have assigned DOIs both to the overall search collection[2] and to each individual search definition, the DOIs for which are shown below.

The 4-ring case.[3] Here the temperature constraint was relaxed, since there are few entries. The two red "hot-spots" occur at torsion angles of ~180° (hydrogen pointing away from metal) at bond angle values of between 173-176°.

The 5-ring case.[4] This includes the classic ferrocene example, the first metallocene for which the structure was correctly identified. There are many more examples, and this search is now constrained to <140K. The two hot spots occur at bond angles of very close to 180°, at which values the torsion itself becomes undetermined. That the hot spots actually occur at 0° and 180° and are not spread evenly across the right hand side axis is remarkable given this. There is a significant tail for the 180° torsion (H pointing away from metal) down to H-C-centroid angles of about 170°, but there is no evidence of this tail for torsions of 0°.

One more test must be applied to see if the 5-ring is planar or not. The deviation from planarity is only 2-3°, and there seems to be no correlation between lower values of the H-C-centroid bond angle and non-planarity.

The 6-ring case.[5] There are again numerous examples of data <140K for such rings. There is now a very distinct hotspot at angles of ~170° for the case/torsion where the hydrogen is pointing towards the metal.

This feature persists when the ring planarity is tested, and it occurs specifically for rings where the angle subtended at the centroid is ~180° and H-C-centroid angles of ~170°. So this is clear-cut effect which demands explanation #1.

The 7-ring case[6] again shows a strong hot spot at ~172° for a torsion corresponding to the hydrogens pointing towards the metal. This hot spot is matched by angles subtended at the ring centroid that are close to 180° (i.e. planar). This is clear-cut effect which demands explanation #2.

The 8-ring case[7] also shows a hot spot for hydrogens pointing towards the metal by the strikingly large degree of ~157°, and this feature is associated with a linear C-centroid-C angle. This is clear-cut effect which demands explanation #3.

The 9- and 10-ring cases. There are no examples! Time to make some?

To summarise.

The above was done during a conference in response to a point made by one of the speakers. In fact, it proved possible to show the speaker the diagrams above <18 hours after he gave the talk.^‡
An immediate question that arose from this discussion was whether the hot-spots were artefacts of non-planar rings. So the ANG2 test was added to the plots the next day (today) as part of this dissemination.
Also discussed (yesterday) was how these conference insights might be shared. I suggested the forum here and Professor Streitwieser heartily agreed. Another alternative was to write it up as a regular journal article. But we both agreed that ..
what you see here is just a statistical analysis. The next stage would be to individually inspect all the molecules which make up these statistics. You see it might just be that every molecule contributing to a "hot-spot" cluster might have special circumstances which conspire to make it look as if there is an interesting chemical effect going on. It is unlikely that such coincidences could accrue in such a manner, but the possibility does have to be considered.
I think we both felt that a better way was to expose the basic effects here, as a sort of open science research project, and anyone interested could then (a) try to replicate these plots, which is why you will find the DOIs of datasets containing the definition files to assist in any such replication and (b) tunnel down to any specific hot spot to identify the precise chemical characteristics that might give rise to the geometrical effect.
This could then be followed up by computational analysis of the electronic properties which might give rise to the effect. This would in effect complete the cycle, since this was the starting point for Streitwieser's original talk. Remember, the theme of the celebration was the interplay between theory and experiment, a particular favourite of Schleyer's.
Regarding the chemical insights, a distinct trend over the ring sizes 4-8 can be seen. The 4-ring shows the hydrogens pointing away from the metal, the 5-ring could be said to be largely agnostic (remember the error in crystallographic angles is probably in the region 1-3°) whilst there is an indication that for the 6-8 rings the ring hydrogens tend to point towards the metal. I have summarised three key points illustrating this as #1-3 above.
It is tempting to conclude that a fairly general chemical effect is operating here over #1-3, although of course it could be a number of effects specific to each ring which merely look like a general trend.

So the chemical interpretation of this project is unfinished, a general feature of much of science of course. But my aim here was to give a flavour of how a scientific meeting at its best can bring together like (or often unlike) minds which can tease out new connections and lead perchance to new discoveries.

^‡These hours were productively employed by sharing a Franconian banquet together, and a modicum of sleep, as well as the searches described above. And in case you see no citations at the bottom of this post, they too take about 48 hours to propagate through the CrossRef and DataCite systems. Be patient and they will appear. ^†In my original representation, I showed the Hs pointing towards the metal. In fact Prof Streitwieser has just contacted me reversing this orientation and correcting my recollection of his lecture.

References

H.S. Rzepa, "Discovering More Chemical Concepts from 3D Chemical Information Searches of Crystal Structure Databases", Journal of Chemical Education, vol. 93, pp. 550-554, 2015. https://doi.org/10.1021/acs.jchemed.5b00346
H. Rzepa, "Crystallographic searches of metallocene type complexes.", 2016. https://doi.org/10.14469/hpc/346
H. Rzepa, "4-Ring metallocene search query", 2016. https://doi.org/10.14469/hpc/347
H. Rzepa, "The 5-ring case.", 2016. https://doi.org/10.14469/hpc/348
H. Rzepa, "6-ring metallocene search queries", 2016. https://doi.org/10.14469/hpc/349
H. Rzepa, "7-ring metallocene search queries", 2016. https://doi.org/10.14469/hpc/350
H. Rzepa, "8-ring metallocene search queries", 2016. https://doi.org/10.14469/hpc/351

Tags:Centroid, chemical effect, chemical insights, chemical interpretation, City: Erlangen, Country: Germany, Degree of a continuous mapping, Ferrocene, Hydrogen bond, individual search definition, metal, overall search collection, Streitwieser, terminal H-positions, Torsion, X-ray
Posted in Chemical IT, crystal_structure_mining, Interesting chemistry | 6 Comments »

Does combining molecules with augmented reality have a future?

Monday, March 28th, 2016

Augmented reality, a superset if you like of virtual reality (VR), has really been hitting the headlines recently. Like 3D TV, its been a long time coming! Since ~1994 or earlier, there have been explorations of how molecular models can be transferred from actual reality to virtual reality using conventional computers (as opposed to highly specialised ones). It was around then that a combination of software (Rasmol) and hardware (Silicon Graphics, and then soon after standard personal computers with standard graphics cards) became capable of such manipulations. VRML (virtual reality modelling language) also proved something of a false start^‡ So have things changed?

Many of the posts on this blog have some element of such VR in the form of the Jmol or JSmol software (the natural successor to Rasmol) that allows a 2D projection of a 3D model to be manipulated in "real-time", allowing the geometrical features to be inspected and even animations of reactions. Google cardboard is a (minor?) variation on the VR theme, allowing a 3D object to be viewed through a simple cardboard headset containing a mounted phone, but controlled by head movements acting on the accelerometers in the phone rather than a mouse or trackpad. But the full-blown experience is something else, and watching this TED video really brought it home to me. The virtual object, such as say a molecule, is superimposed upon one's view of the real world (AR) and this object can now be controlled with hands as well as eyes. Again, this is not new; so-called haptic control of virtual objects has been around for a decade or more, in which you can e.g. probe how "hard" an object is using a haptic or hands-on device such as a joystick. All of this quickly convinces one that the secret of successful use of VR and now AR to augment chemistry is going to be the software!

We now need inspired programmers to create the Rasmol/Jmol of augmented reality. But beyond mere software, chemistry with AR needs to be placed into the appropriate environment or context. One might presume this will include the stereoscopic video inputs from other AR headsets (the research team, the collaborators, etc) but what else? The pages of a blog? Or a journal article? Could indeed one recast the journal article itself into an AR scene, with the various components floating in space, with molecules conjured out of a table (or a synthetic procedure) to float in full 3D glory to be played with by the participants? I rather suspect this might be quite a few steps too many for most! Think how little ~22 years of the Web (and perhaps ~36 years of the Internet itself) has actually changed the construction (I do not mean the delivery) of the average scientific article. Even now, tables in which molecules can be treated interactively are extremely rare. Most of this is because authoring tools such as Microsoft Word have not yet made the production of such documents viable. So perhaps the augmented-reality scientific or chemical article may not be quite around the corner. Perhaps the AR hype will end in the same way that 3D TV appears to have. But unless we experiment, we will never know the answer. So if any reader of this blog knows of interesting work in chemistry AR, do drop me a line.

^‡ Virtual Reality Modelling Language (VRML) in Chemistry, O. Casher, C. Leach, C. S. Page and H. S. Rzepa, Chem. in Brit., 1998, 34(9), 26. But VRML has made a come-back as the language of choice for 3D printing!

Tags:Augmented reality, chemical article, Company: Microsoft, Company: Silicon Graph, for 3D printing, Google Cardboard, jmol, RasMol, User interface techniques, Virtual reality, VRML
Posted in Chemical IT, General | No Comments »

Research data: Managing spectroscopy-NMR.

Wednesday, March 16th, 2016

At the ACS conference, I have attended many talks these last four days, but one made some “connections” which intrigued me. I tell its story (or a part of it) here.

But to start, try the following experiment.

Find a Word document of .docx type on your hard drive
Remove the .docx suffix and replace it with a .zip suffix.
Expand as if it is an archive (it is!).
A folder is created and this itself contains four further folders. These all contain XML files, and in the sub-folder actually called word you will find something called document.xml That file contains the visible content of the document; all the others are support documents, including styles etc.

The reason this is important was made clear in Santi Dominguez’ talk. Most of it was concerned with introducing Mbook, an ELN (electronic laboratory notebook) but the relevance to the above comes from his introduction of Mpublish, a forthcoming product targeting the area of research data management. What is the connection? Well, NMR spectrometers produce raw outputs as collections of files, much in the manner of the exploded word document above. Some files contain the raw FID, others contain the acquisition parameters, etc. These files are then turned into the traditional spectra by suitable processing software such as Mestrenova (part of the same ecosystem as Mpublish). Most users of such programs then squirt the spectra into a PDF file and it is this last document that is preserved as “research data” – almost invariably this is the version sent off to journals as the supporting information or SI for the article. SI is called information for a good reason; in such a container it is very often not easily usable data, and functions just visually.

So what is the problem? Well, the conversion of the NMR fileset (and quite possibly many other forms of spectroscopy) into a PDF file is a lossy process. It cannot be reversed; information has been lost. And only really a human who can easily retrieve and interpret such a visual presentation.

Santi described how Mpublish can assemble all the files associated with the instrumental outputs, optionally add chemical structure and other information, collect suitable metadata describing the contents and create a .zip archive. As we saw with Word however, the suffix does not even need to be .zip. It was suggested that it be this information-complete archive that should really be used as SI to accompany an article in which NMR data is invoked to support the narrative. In the reverse process, anyone downloading this zip archive could themselves potentially acquire full access, without information loss, to the original NMR data. There is a little further magic that needs to be included to make the process work which I do not include here. When Mpublish becomes available to play with, I will complete that story here.

It is good to report that software is starting to appear which enhances the management and reporting of research data as part of the publication process. The “rules” and “best practice” of this game are still being written however. In this regard, I feel that it is the researchers themselves that must play a vital role in defining the rules. Let us not cede that role just to publishers.

Tags:Archive formats, chemical structure, ELN, Nuclear magnetic resonance, PDF, research data management, spectroscopy, suitable processing software, XML, Zip
Posted in Chemical IT | 1 Comment »

Global initiatives in research data management and discovery: searching metadata.

Monday, March 7th, 2016

The upcoming ACS national meeting in San Diego has a CINF (chemical information division) session entitled "Global initiatives in research data management and discovery". I have highlighted here just one slide from my contribution to this session, which addresses the discovery aspect of the session.

Data, if you think about it, is rarely discoverable other than by intimate association with a narrative or journal article. Even then, the standard procedure is to identify the article itself as being of interest, and then digging out the "supporting information", which normally takes the form of a single paginated PDF document. If you are truly lucky, you might also get a CIF file (for crystal structures). But such data has little life of its own outside of its parent, the article. Put another way, it has no metadata it can call its own (metadata is data about an object, in this case research data). An alternative is to try to find the data by searching conventional databases such as CAS, Beilstein/Reaxys or CSD, and there of course the searches can be very precise. But (someone) has to pay the bills for such accessibility.

We are now starting to see quite different solutions to finding data (the F in FAIR data, the other letters representing accessibility, interoperability and re-usability). These solutions depend on metadata being a part of the solution from the outset, rather than any afterthought produced as a commercial solution. The collection of metadata is part of the overall process called RDM, or research data management, perhaps even the most important part of it. In exchange for identifying metadata about one's data, one gets back a "receipt" in the form of a persistent identifier for the data, more commonly known as a DOI. The agency that issues the DOI also undertakes to look after the donated metadata, and to make it searchable. The table below shows eight searches of such metadata, one example of how to acquire statistics relating to the usage of the data and one search of how to find repositories containing the data.

Search queries enabled by the use of metadata in data publication
#	Search query^*	Instances retrieved:
1	http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:*	InChI identifier
2	http://search.datacite.org/ui?q=alternateIdentifier:InChI:*	InChI key
3	http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:CULPUXIDFLIQBT-UHFFFAOYSA-N	InChI key CULPUXIDFLIQBT-UHFFFAOYSA-N
4	http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChIKey:*	ORCID 0000-0002-8635-8390 AND (boolean) InChI key.
5	http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChI:InChI=1S/C9H11N5O3*	ORCID 0000-0002-8635-8390 AND (boolean) + InChI string 1S/C9H11N5O3 with the * wild.
6	http://search.datacite.org/ui?q=has_media:true&fq=prefix:10.14469	Has content media^‡ for Publisher 10.14469 (Imperial College)
7	http://search.datacite.org/ui?q=format:chemical/x-*	Data format type chemical/x-*
8	http://search.datacite.org/api?&q=prefix:10.14469& fq=alternateIdentifier:InChIKey:& fl=doi,title,alternateIdentifier& wt=json&rows=15 http://api.labs.datacite.org/works?q=prefix:10.14469+AND+alternateIdentifier:InChIKey:	First 15 hits in JSON format, batch query mode
9	http://stats.datacite.org/?fq=datacentre_facet:"BL.IMPERIAL – Imperial College London"	resolution statistics for publisher 10.14469 (Imperial College) per month
10	http://service.re3data.org/search?query=&subjects[]=31 Chemistry	Research data repository search for Chemistry (135 hits)

^‡In this instance the three MIME media types are chemical/x-wavefunction, chemical/x-gaussian-checkpoint and chemical/x-gaussian-log. See[1] for chemical MIME (multipurpose internet media extensions).

Anyone familiar with the standard ways of finding data (CAS, CSD, Reaxys) will appreciate that the above does not yet have the finesse to find eg sub-structures of chemical structures, synthetic procedures or molecular properties. My including it here is primarily to show some of the potential such systems have, and to remark particularly that the batch query capability of this infrastructure could indeed be used in the future to construct much more sophisticated systems. Oh, and to the end-user at least, the searches shown above do not require institutional licenses to use. Both the data and its metadata is free, mostly with a CC0 or CC BY 3.0 license for re-use (the R of FAIR).

If more of interest related to this topic emerges at the ACS session, I will report back here.

References

H.S. Rzepa, P. Murray-Rust, and B.J. Whitaker, "The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange", Journal of Chemical Information and Computer Sciences, vol. 38, pp. 976-982, 1998. https://doi.org/10.1021/ci9803233

Tags:Academic publishing, chemical, chemical information division, Chemical nomenclature, chemical structures, Chemical substance, chemical/x-wavefunction, Cheminformatics, City: San Diego, content media, data repository search, format type chemical/x-*&nbsp, Identifiers, Imperial College, Imperial College London, International Chemical Identifier, JSON, media types, multipurpose internet media extensions, ORCiD, PDF, potential such systems, research data management, Search queries, Technical communication, Technology/Internet
Posted in Chemical IT | 2 Comments »

LEARN Workshop: Embedding Research Data as part of the research cycle

Monday, February 1st, 2016

I attended the first (of a proposed five) workshops organised by LEARN (an EU-funded project that aims to ...Raise awareness in research data management (RDM) issues & research policy) on Friday. Here I give some quick bullet points relating to things that caught my attention and or interest. The program (and Twitter feed) can be found at https://learnrdm.wordpress.com where other's comments can also be seen.

Henry Oldenburg, founder member and first secretary of the Royal Society, was the first Open Scientist.
About 100 people attended the workshop. Of these ~3-5 identified themselves as researchers creating data, and the rest comprised research data managers, administrators, librarians, publishers (but see below) etc. Many were new to their posts.
Not publishing scientific data should become recognised as scientific malpractice.
Central libraries should pro-actively disperse their knowledge to data scientists in departments.
If a scientist is concerned that openly publishing their data might give advantage to their competitors, they are urged to counteract this by "being cleverer than the others".
The three great bastions of open science are (a) Open Data, (b) Open access articles and (c) doing science openly. Examples of this third category include open notebook science (ONS), a form notably pioneered by Jean-Claude Bradley. One attribute of ONS was noted as no insider knowledge.
Learned societies should endow medals for Open Science.
(Some) publishers are reinventing themselves as Research Facilitators.

The plenaries are all well worth dipping into (certainly the video and in some cases all the slides are scheduled to appear).

If you are a researcher (undergraduate students, PGs, PDRAs, early career researchers and academics) you should immediately track down your local evangelist/expert in RDM and ask what the local infrastructures are (or will be shortly built).

Tags:Academic publishing, European Union, first Open Scientist, first secretary, Free culture movement, Henry Oldenburg, Jean Claude Bradley, Open access, Open data, Open science, RDM, Research, researcher, Royal Society, Science, Scientific method, Scientific misconduct, scientist, Technology/Internet
Posted in Chemical IT | 1 Comment »

Henry Rzepa's blog

Archive for the ‘Chemical IT’ Category

Data-free research data management? Not an oxymoron.

References

What is the approach trajectory of enhanced (super?) nucleophiles towards a carbonyl group?

References

Collaborative FAIR data sharing.

References

Metametadata: data about data about (chemical) data.

References

Publishing embargoes.

Celebrating Paul Schleyer: searching for hidden treasures in the structures of metallocene complexes.

References

Does combining molecules with augmented reality have a future?

Research data: Managing spectroscopy-NMR.

Global initiatives in research data management and discovery: searching metadata.

References

LEARN Workshop: Embedding Research Data as part of the research cycle

Recent Posts

Archives

Blogroll

Meta