Deviations from tetrahedral four-coordinate carbon: a statistical exploration.

September 6th, 2015

An article entitled “Four Decades of the Chemistry of Planar Hypercoordinate Compounds[1] was recently reviewed by Steve Bacharach on his blog, where you can also see comments. Given the recent crystallographic themes here, I thought I might try a search of the CSD (Cambridge structure database) to see whether anything interesting might emerge for tetracoordinate carbon.

The search definition is shown below using a  simple carbon with four ligands, the ligands themselves also being tetracoordinate carbon. The search is restricted to data collected below temperatures of 140K, as well as R-factor <5%, no errors and no disorder. Cyclic species are allowed and a statistically reasonable 2773 hits emerged from the search.

Scheme

Recollect that the idealised angle subtended at the centre is 109.47°. I show below three separate heat plots of the search results. Why three? The way the search software (Conquest) works is that one could define four C-C distances and six angles, and then plot any combination of one distance and one angle. I show just three combinations here, but could have included many more.

There appear to be four distinct clusters of values for this angle that emerge from the three plots shown below (the “bin size” is 100, and the frequency colour code indicates how many hits there are in each bin).

  1. The hotspot is unsurprisingly ~109° with a corresponding C-C distance of ~1.54Å.
  2. There may be two clusters at angles of ~60° (cyclopropane), with C-C values ranging from ~1.47 to ~1.55Å.
  3. A collection at ~90° (mostly cyclobutane?), with C-C values up to 1.6Å.
  4. A collection at ~140° (again small rings), now with much shorter C-C values of ~1.46Å. This reminds of the approximation that the hybridisation in e.g. cyclopropane is a combination of sp5 and sp3.

Scheme

Scheme

Scheme

Ideally, what one might want to plot would be sums of four angles; for a pure tetrahedral carbon the sum would always be 438° (4*109.47°) but for a pure planar carbon it could be as low as 360° (4*90°). One could then see how closely the distribution approaches to the latter and hence reveal whether there are any true planar tetracoordinate carbon species known. Although the Conquest software cannot analyse in such terms, a Python-based API has recently been released that should allow this to be done, although I should state that this requires a commercial license and it is not open access code. If we manage to get it working, I will report!


As a teaser I also include a plot of six-coordinate carbon, in which the ligands can be any non-metal. Note the clusters at angles of 60, ~112 and ~120-130°. It is worth pointing out that the definition of the connection between a carbon and a ligand as a “bond” becomes increasingly arbitrary as the coordination becomes “hyper”. Because crystallography does not measure electron densities in “bonds”, we know nothing of its topology in this region. It is therefore quite possible that the appearance of the heat plot below might be related just as much to whatever convention is being used in creating the entry in the CSD as it would be to a quantum analysis of the bonding.

Scheme

References

  1. L. Yang, E. Ganz, Z. Chen, Z. Wang, and P.V.R. Schleyer, "Four Decades of the Chemistry of Planar Hypercoordinate Compounds", Angewandte Chemie International Edition, vol. 54, pp. 9468-9501, 2015. https://doi.org/10.1002/anie.201410407

π-Resonance in thioamides: a crystallographic "diff" with amides.

September 5th, 2015

The previous post explored the structural features of amides. Here I compare the analysis with that for the closely related thioamides.

Scheme

Here is the torsional analysis around the C-N bond. The “diff” (difference) is that almost all the hits are concentrated into angles of 0° or 180°; the twist about the C-N bond from co-planarity is much less if S is present. This is normally explained in terms of Spπ-Cpπ overlaps being less favourable than Opπ-Cpπ ones owing to the mismatch in the size of the atomic orbital for S and C. Hence the resonance which reduces the C=S double bond character in favour of greater C=N character is enhanced compared to O.

Scheme

A consequence is that the nitrogen atom is less easily deformed from planarity in a thioamide. Notice also that at the hotspot, the C=N distance is ~1.32Å compared to 1.34Å for a regular amide.

Scheme

This emerges from the plot below as well; the range of values for the C-N bond is reduced compared to amides, but the diagonal trend that as the C=N bond gets longer so the C-S gets shorter is still seen.

Scheme

All these trends are described qualitatively in most text books of organic chemistry, but one never sees statistical evidence for them. And it truly only takes 5-10 minutes to produce.

π-Resonance in thioamides: a crystallographic “diff” with amides.

September 5th, 2015

The previous post explored the structural features of amides. Here I compare the analysis with that for the closely related thioamides.

Scheme

Here is the torsional analysis around the C-N bond. The “diff” (difference) is that almost all the hits are concentrated into angles of 0° or 180°; the twist about the C-N bond from co-planarity is much less if S is present. This is normally explained in terms of Spπ-Cpπ overlaps being less favourable than Opπ-Cpπ ones owing to the mismatch in the size of the atomic orbital for S and C. Hence the resonance which reduces the C=S double bond character in favour of greater C=N character is enhanced compared to O.

Scheme

A consequence is that the nitrogen atom is less easily deformed from planarity in a thioamide. Notice also that at the hotspot, the C=N distance is ~1.32Å compared to 1.34Å for a regular amide.

Scheme

This emerges from the plot below as well; the range of values for the C-N bond is reduced compared to amides, but the diagonal trend that as the C=N bond gets longer so the C-S gets shorter is still seen.

Scheme

All these trends are described qualitatively in most text books of organic chemistry, but one never sees statistical evidence for them. And it truly only takes 5-10 minutes to produce.

π-Resonance in amides: a crystallographic reality check.

September 5th, 2015

The π-resonance in amides famously helped Pauling to his proposal of a helical structure for proteins. Here I explore some geometric properties of amides related to the C-N bond and the torsions about it.

Scheme

The key aspect of amides is that a lone pair of electrons on the nitrogen can conjugate with the C=O carbonyl only if the lone pair orbital is parallel to the C-O π-system. We can define this with the O=C-N-R torsion angle (and equate 0 or 180° with the p-orbitals being parallel). In the above definition, each R can be either 4-coordinate C (to avoid alternative conjugations) or H and the C-N bond is specified as being cyclic. As usual the R-factor is < 5%, no errors, no disorder.

First, the C-N torsion, which adopts values of either 0 or 180°. Notice that whilst the anti R-group shows no more than about 20° deviation from 180°, it does have a small tail tending towards longer C-N distances of >1.4Å. The hotspot is for the syn R-group.  Here there is a strong trend that as the dihedral deviates from 0° the C-N bond very clearly elongates. As the π-π overlap decreases, the bond elongates from the hot spot value of ~1.34Å to 1.41Å at 50°. The greater propensity of the syn-R to twist may be because it incurs more steric hindrance or perhaps because we have defined the C-N bond to be part of a cycle.

Scheme

Next, we plot the C-N distance against the torsion R-N-C-R’, which defines how planar the nitrogen is. A value of 180° is planar and the hot-spot is here. But as the planarity decreases down to almost tetrahedral (110°) the C-N bond elongates to  1.41Å. Notice one rather intriguing aspect;  from 180° to 160° or so, there is little response from the  C-N bond, but the elongation really accelerates from 140° to 110°. A little twisting hardly affects the π-π overlap, but it really starts to matter for twists of >50°.

Scheme

Finally a plot of the C-N vs the C-O distances. As the C-N increases, the C-O contracts, this being a nice summary of the π resonance in amides. 

Scheme

We have not seen any surprises, but this statistical exploration of crystal structures at least puts some numbers on the changes in bond lengths as a result of conjugative resonance.

A sea-change in science citation? The Wikipedia Science conference.

September 3rd, 2015

The first conference devoted to scientific uses of Wikipedia has just finished; there was lots of fascinating stuff but here I concentrate on one report that I thought was especially interesting. To introduce it, I need first to introduce WikiData. This is part of the WikiMedia ecosystem, and one of the newest. The basic concept is really simple.

  1. It is a repository for data objects; 14,757,419 of them as I write this to be precise. These are called items, and each has an ID, prefaced with the letter Q. An example might be mauveine, which is Q421898.
  2. Any item can have one or more properties which can only be selected from a controlled list, of which there are around 3000. An example here might be an individual’s ORCID identifier, which is P496. ORCID is also an item, Q51044.
  3. As with Wikipedia itself, WikiData also has a set of community rules which contributors have to follow. One of the rules in Wikipedia (the COI or conflict of interest rule) which deprecates an interested party from making a significant contribution to any page. Items in Wikidata have a looser COI code; facts are basically facts rather than opinions, but their provenance still matters of course.

With the basic structure set out, I will now describe what I heard today.

  1. An item can be a citation (to the scientific published literature), of which there are currently around 76 million, although currently nothing like that number are currently in WikiData.
  2. One of the properties of a citation item is its DOI or digital object identifier (P356), which would nowadays be regarded as in effect mandatory. Citations can have other properties, which would be populated from CrossRef or DataCite such as metadata associated with e.g. the DOI itself; the journal, the authors, the date, etc. Citations from DataCite can in fact have far richer metadata than the usual, if you follow this link you can see an example of such data properties.
  3. But here is the new stuff. Citations as items can have more subtle properties. Thus a citation could be invoked with a property: A disagrees with B, where A and B are both items (or perhaps properties).

You can see from this that allowing a citation to have such properties can potentially revolutionise the way a scientific article can be constructed. When a citation is invoked, the context the authors wish that citation to have can be added. Contrast this with the context-free way in which articles currently cite other articles. And as with anything in WikiData, instances can be counted, the context in which instances occur can be identified and statistics accumulated.

The way it might work is not so much that any interested reader (a human) would browse through WikiData. Instead it is something that a machine (software) might invoke. In Wikipedia for example, one can transclude or subsume into the article an item from Wikidata. This could be a citation, which you could transclude with one or more associated properties. In chemistry at the moment, the most prominent objects that are constructed from such Wikidata transclusions are ChemBoxes, or tables of properties of molecules as items (Q52426). This is done dynamically at the time of reading the Wikipedia article and so you can imagine that such transclusions can respond as the values of properties are updated/corrected/extended. Unfortunately I do not (yet) know of a good example of all of this which can be linked to here. If any do come to light, I will try to remember to add them here.

As often happens, the concepts above are not entirely new; many were already present in a variation of the Wiki called the Semantic MediaWiki and experiments in chemistry were tried as early as 2007.‡ But WikiData is far easier to use and in symbiosis with a conventional Wiki it might just start to fly now.

The implications of all of this for the way in which a scientific article might work are deep from many different perspectives. I do wonder whether all this data-rich context in which a scientific article or narrative might be couched will be welcomed by either publishers or indeed authors. Perhaps the emotions that humans have but which machines do not will in fact dominate. But it does appear to have the potential for a sea-change in how scientists exchange information.


The number of itemised molecules recently reached 100 million, and there are a few thousand (>? <?) well defined properties that can be associated with molecules. So the whole of known molecular chemistry is actually not that different in scale from the current Wikidata.

Semantic wiki as a model for an intelligent chemistry journal, Rzepa, Henry S. Abstracts of Papers, 233rd ACS National Meeting, Chicago, IL, United States, March 25-29, 2007, CINF-053. Abstract and talk.

A tourist trip around London Overground with a chemical theme.

August 29th, 2015

Most visitors to London use the famous underground trains (the “tube”) or a double-decker bus to see the city (one can also use rivers and canals). So I thought, during the tourism month of August, I would show you an alternative overground circumnavigation of the city using the metaphor of benzene.

Benzene you see is a ring, comprising three “HCCH” segments. The so-called Kekule vibration in benzene  (the b2u mode for anyone interested) induces three pairs of carbon atoms to repeatedly travel towards each other and then reverse and travel away from each other. One can also travel in this manner using the London Overground train system. The three segments connect Clapham Junction (yes, more or less the same Clapham of Kekule’s omnibus) to Willesden Junction.  A second segment goes from there to  Highbury and Islington, and a third from there on to Clapham again to complete the cycle in the clockwise direction. Since trains travel in both directions on each of the three segments, one can (like a carbon atom) oscillate to and fro in any segment, or (like an electron) circulate all the way round (no doubt either diatropic or paratropically with respect to the earth’s magnetic field). Yes, the metaphors are rather contrived; sorry but it is August after all. 

Here are some photos. The first is along the Clapham/Willesden Junctions section, showing the new chemistry building at Imperial College in the early stages of construction. This will be part of the new White City campus about 5km west of the  original South Kensington one. The completed buildings on the right are residences, and the whole site used to be where BBC Enterprises first marketed its productions worldwide and not far from where the BBC television studios broadcast from until recently.

Scheme

This is at Clapham junction itself, platform 1 of 18.

Scheme

This is also along this segment (Imperial’s very own station :-). Way out indeed!

Scheme

And the Thames finally, looking east. On the left is the very exclusive Chelsea harbour apartment complex, some of the most expensive in London. Residents commute by boat rather than train. In the distance somewhere are London and  Tower bridges.

Scheme

A visualization of the anomeric effect from crystal structures.

August 27th, 2015

The anomeric effect is best known in sugars, occuring in sub-structures such as RO-C-OR. Its origins relate to how the lone pairs on each oxygen atom align with the adjacent C-O bonds. When the alignment is 180°, one oxygen lone pair can donate into the C-O σ* empty orbital and a stabilisation occurs. Here I explore whether crystal structures reflect this effect.

Scheme

The torsion angles along each O-C bond are specified, along with the two C-O distances. All the bonds are declared acyclic, and the usual R < 5%, no disorder and no errors specified.

  1. You can see from the plot below that the hotspot occurs when both RO-CO torsions are ~65°. From this we will assume that the two (unseen) lone pairs at any one of the oxygens are distributed approximately tetrahedrally around each oxygen, and if this is true then one of them must by definition be oriented ~ 180° with respect to the same RO-CO bond (the other is therefore oriented -60°). This allows it to be antiperiplanar to the adjacent C-O bond and hence interact with its σ* empty orbital. So the hotspot corresponds to structures where BOTH oxygen atoms have lone pairs which interact with the adjacent O-C anti bond.
  2. There is a tiny cluster for which both RO-CO torsions are ~180° and hence neither oxygen has an antiperiplanar lone pair.
  3. Only slightly larger are clusters where one torsion is ~65° and the other ~180°, meaning that only one oxygen has an antiperiplanar lone pair.
  4. A plot of the two C-O lengths indeed shows an overall hotspot at ~1.40Å for both distances. If the search is filtered to include only torsions in the range 150-180°, the hotspot value increases to 1.415Å for both. If one torsion is restricted to 40-80° and the other to 150-180° the hotspot shows one C-O bond is about 0.012Å shorter than the other.

Scheme

Scheme

I also include a further constraint, that the diffraction data must be collected below 140K. The hotspot moves to ~ 55/60° indicating values free of some vibrational noise.

Scheme

Interestingly, replacing  oxygen with  nitrogen reveals relatively few examples of the effect (C(NR2)4 is an exception). Replacing  O by divalent S produces only 13 hits, with the surprising result (below) that in all of them only one S sets up an anomeric interaction. Arguably, the number of examples is too low to draw any firm conclusions from this observation.

Scheme


Most diffractometers measure low angle scattering of X-rays by high density electrons. These are the core electrons associated with a nucleus rather than the valence electrons associated with lone pairs. Hence very few positions of valence lone pairs have ever been crystallographically measured.

Mesomeric resonance in substituted benzenes: a crystallographic reality check.

August 26th, 2015

Previously, I showed how conjugation in dienes and diaryls can be visualised by inspecting bond lengths as a function of torsions. Here is another illustration, this time of the mesomeric resonance on a benzene ring induced by an electron donating substituent (an amino group) or an electron withdrawing substituent (cyano).

Scheme

In both cases, you can see this resonance showing as a lengthening of the C(ipso)-C(ortho) and C(meta)-C(para) bonds, and a contracting of the C(ortho)-C(meta) bonds. Does this reflect in the measured structures? The usual search is applied (R < 5%, no disorder, no errors) and qualified with the following:

  1. The amino has three bonds, and can bear either H, or 4-bonded carbon only.
  2. R on the ring can be either H or C.
  3. Three distances are defined.

Scheme

The results of a search are shown below; the hotspot shows the C-C(ortho) distance is close to 1.40Å, whilst the corresponding value for C(ortho)-C(meta) is 1.38Å, a contraction of ~0.02Å. The contraction is smaller for phenols (~0.01Å).

Scheme

The C(ortho)-C(meta) vs C(meta)-C(para) amino plot shows a cluster of hotspots for which the former (1.38Å) is  shorter than the latter (~1.39Å) but the effect is less clear cut as the distance from the substituent increases.

Scheme

For an electron withdrawing cyano substituent, C(ipso)-C(ortho) at 1.395Å is longer than C(ortho)-C(meta) at 1.385Å, although the difference seems smaller than for the amino substituent. The (ortho)-C(meta) to C(meta)-C(para) comparison is similar.

Scheme

Scheme

These searches take but a few minutes to perform, and do serve as a reality check on the oft-seen mesomeric π-resonance shown in all organic text books.

A visualisation of the effects of conjugation; dienes and biaryls.

August 25th, 2015

Here is another exploration of simple chemical concepts using crystal structures. Consider a simple diene: how does the central C-C bond length respond to the torsion angle between the two C=C bonds?

arm1

The search of the CSD (Cambridge structure database) is constrained to R < 5%, no errors and no disorder and the central  C-C bond is specific to be acyclic.

arm1

  1. Note first that the hotspot occurs for a torsion angle of 180°, a trans diene.
  2. There is just a hint that the C-C distance for a cis-diene might be a little shorter than the trans diene, but this might not be significant.
  3. There is a gentle curve illustrating that the C-C distance is indeed a maximum at 90°
  4. The C-C bond extends from ~1.445Å when the two double bonds are coplanar (fully conjugated) to ~1.48Å when orthogonal. Not much of a change, but statistically highly significant.

Here is another search, this time of the C=C-C=C motif embedded into a biaryl, of which there are far more examples. This time, the (red) hotspot is actually at 90°, with local (green) hotspots at 0 and 180° but also at 45 and 135°. Again, you can easily spot the maximum in C-C bond length at 90° but notice how much smaller the bond lengthening is (~ 0.01Å). This lengthening is inhibited by retention of the aromaticity of the two aryl rings; again the statistical effect is highly significant. Perhaps also significant is that the  C-C bond at torsions of 0 or 180° appear to be no shorter than the values at 45 and 135°.

arm1

arm1

Both these searches took about  5 minutes each, and serve to illustrate just how many basic chemical concepts can be teased out of a statistical analysis of crystal structures.


The analogous diagram for O=C-C=C is shown below;

arm1

That for  O=C-C=O is different however;

arm1

A (light) introductory tutorial on Research Data Management (in chemistry).

August 20th, 2015

Management of research (data) outputs is a hot topic in the UK at the moment, although the topic has been rumbling for five years or more. Most research-active higher educational establishments have or are about to publish general guidelines, which predominantly take the form of aspirational targets rather than actionable examples or use-cases. Because the concepts remain somewhat abstract, one can encounter questions from researchers such as “how should I go about achieving such RDM (research data management)?” I thought it might be useful for me to here summarise some key features in the form of an FAQ that can help answer that question. I will concentrate purely on the sub-set chemistry about which I know most.


I will start by exploring the acronym FAIR data.

  • F is findable. This means that metadata is a key part of the process, since it is this information that allows the research data to be more easily found, not only by other humans but by software engines which specialise in such activity.
  • A is accessible. And easily so. Which means a standard identifier to get to the research data, with no paywalls, account registrations or other obstructions. It should ideally be possible to access data anonymously, without necessarily revealing personal information.
  • I is inter-operable. This is harder to define exactly, but the essence is that it should be possible to re-use the data in a context different from the original, and perhaps even outside the subject domain where it was created. For example, if data was collected using one specific instrument, it should be able to use it without necessarily having access to either an identical instrument or to the software associated with that instrument.
  • R is reusable. There should be sufficient information about the data and its parameters to if necessary repeat its collection independently of the original, or to re-use it to start a new data collection. Reusable also means by software, and not just by a human.

The first two properties are easily achieved, since standard procedures can be used. The last two properties are potentially more difficult, since they require more intervention or thought by both the depositor and the re-user. So I will concentrate really on the first two, since by and large they will satisfy most of the general guidelines issued by funders and universities, but note that we must not in the medium to longer term forget the last two.


I will now list some typical types of data that I have personal experience of. As the community increasingly participates in such RDM, this list will expand by “crowd-sourcing”; if your type of data is not listed, do not give up! 

  1. Data generated by software without instrumental inputs, a good example of which are the outputs of computational chemistry. I have the most personal experience in this area, having been at it for ten years or more[1],[2] and examples are scattered throughout this blog (and in many of our recent research publications).
  2. Software developed as part of the data collection process and which might be required by others to re-use the data. An example of such was described in a previous post, and has been RDMed here.[3].
  3. Data generated by software associated with instrumental outputs. In chemistry this means spectrometers and other instruments, most of which now have computers which handle the data outputs. Specific examples might be crystal structures, NMR, IR, MS and optical (including chiroptical) spectra.
    • Crystal structures are the gold standard in RDM, since they fulfil all the requirement of FAIR and so merit a special mention here. In the last year, the Cambridge structural database (CSD) has had implemented a standard access mechanism based on a digital object identifier (DOI).[4]
    • The end point of many other instrumental outputs are PDF files. These do not easily achieve the IR of FAIR (see my comment above), but we will admit the PDF format as a temporary expedient until the use of semantically richer formats increases (the gold example here being the CIF format for crystal structures). You can see an example of PDF files here as a fileset[5] describing 1H, 13C NMR, Mass spectrometry, ECD (electronic circular dichroism) and VCD (vibrational circular dichroism). Perhaps a better format for expressing many types of spectra is the Excel spreadsheet, which achieves a reasonable proportion of the IR aspirations of FAIR. Both expressions can be included in the collection. 
    • As a postscript to this list, I should mention that instrumental data is often found as:
      • raw (unreduced or unprocessed) data, which can be very large (e.g. Free induction decay time-domain data in NMR).
      • A version which has already been subjected to processing (Fourier transformed frequency-domain data in NMR, i.e. a spectrum). This is probably more suitable for archiving, but its a fine judgement.
      • A a rough rule of thumb, chemistry data intended for archival should be ~ < 1 Gb.
  4. Synthetic methodologies that describe the preparation and characterisation of molecules. You can see an example of such data here.[6] 

Now I come to how the (molecular) data is packaged, and this is best described in terms of its granularity. There are perhaps four classes:

  1. All the data is packaged into a single compressed (ZIP) archive. An example can be found here[7] containing coordinates for 134,000 molecules. If your interest is in just one of these molecules, then you could argue that this data does not fully conform to the F of FAIR, since it contains no information (metadata) about individual molecules.
  2. The next packaging is (in chemistry) for a specific molecule (or perhaps reaction). An example is again[5], which contains data about a specific molecule, and that molecule is itself defined by the inclusion of e.g. a Chemdraw file. Another example[6] relates to reaction information, and also includes spectroscopic data in the form of a JCAMP-DX file, which is semantically preferable to eg an Excel spreadsheet or just a PDF file. Most of the examples on this blog are in this category, relating to quantum chemical computations of a specific molecule.[8] I will concentrate here just on this second type of packaging.
  3. The most finely-grained packaging is at the molecular property level. To illustrate this, go visit e.g. the Wikipedia page for aspirin, where you will find a ChemBox containing property data. In the future, these ChemBox properties will be interactively populated from a data repository known as WikiData. This type of RDM is still developing, and I include it here as a placeholder and to counterbalance the first category above!
  4. Thus category is a little different from the previous three; it relates to a collection of packages, where the granularity of class 2 above is retained, but boxed up into a project collection.[9]

  And now to look at the life cycle of some data.

  1. The data starts off as live. This is some sort of holding store which members of the group can access/contribute to. It can be a local sharepoint or a cloud-based resource such as DropBox, but it could still be a simple DVD or USB storage device.
    • We have for some ten years now used a locally built live data store (which is itself archived at Zenodo as software[10]) and which serves to track a user’s experiments, including initiation and completion dates and times, to serve as a simple interface for archival, to record published experiments and to flag requested data embargoes (see below) and to provide a search interface for all of this. Pretty much the description of an electronic (laboratory) notebook. We created our own[2] because few commercial products (either ten years ago, or even now) offer the ability to seamlessly incorporate a Publish workflow which automates all the required actions of RDM as described here, and because it is something we might want to do 5-20 times a day. If your requirement is much less, such automation may not be needed.
  2. When the data is stable and edited down to that which needs to be associated with an article (the narrative), it now needs archiving in a manner that will ensure its persistence for at least a decade or even longer.
  3. Associated metadata describing the data also now needs to be assembled and this combined package is now sent to a data archive. These archives have special characteristics, one of which is that they can issue a persistent identifier we know as the DOI. This itself is issued by a registry, which for data is usefully done by an organisation known as DataCite. If desired, two or more of these packages can be associated with a collection, and the collection itself can also be given a DOI.[9]
  4. A copy of the metadata is sent to DataCite when the DOI is issued. The search engine that indexes this information is also at DataCite.
  5. Now all that needs doing is that the Data DOIs are all cited in the article to be published, or you can (also or instead) cite the DOI for a collection. An accepted article is itself issued in due course with a DOI (this time by an agency known as CrossRef on behalf of the publisher). 
  6. To complete the virtuous cycle, the article DOIs can be retrospectively added to the metadata for each data package (or the collection of packages), ensuring that the data references the narrative, and that the narrative references the data. 
  7. You will note from the virtuous cycle in item 5, that timing becomes important. You have to archive the data and mint a DOI in order to cite it in an article. This sounds like publishing the data before the article has been accepted, which would have the advantage that referees could access it as part of their QA process for the article. However, it may be more suitable to simply reserve a DOI for the data for inclusion in an article, but not make it public until that article has itself been accepted and published. This process is called embargoing; I will defer discussion of this, because this tends to vary according to repository and its implementation is still evolving.
  8. The final action might be to register this activity on any institutional software that monitors and aggregates research outputs. We use Symplectic to achieve this, it having the ability to record both a research publication and increasingly properties of the data itself.

By now you might be asking where you could explore further, and perchance even try things out.

  1. zenodo.org/features  is one good place to start; it will cost nothing; there is (within reason) no limitation to how much data can be archived. Zenodo also allows data to be retrieved from DropBox and Github (for code) for archival.
  2. figshare.com allows you to sign up for free, but with limitations to the total data storage unless you upgrade to an institutional or paid account.
  3. www.datadryad.org/pages/faq  which charges $80-90 per deposition.
  4. Institutional data repositories. The notes above were written based on the experiences we have had for almost nine years now with a local data repository we call SPECTRa,[1] where some 230,000 individual data packages are now archived. This one[11] dates from 2007 to illustrate its longevity. Unfortunately, only members of  Imperial College can make use of it.

I realise now that I have written this all down that it is somewhat longer than I was expecting, and that this very length may well put some researchers off. Apart from RDM now being mandatory in the UK, it is also reasonable for researchers to ask “what was in it for me?” as a reward for persisting. I can only answer that one from my personal experiences:

  • The live data store (or uportal as we call it) has proved invaluable for recording our (computational) experiments. I often use it to track down calculations from years ago. As a laboratory notebook, it is minimalist, as is the learning curve and hence does not overwhelm. If more information is needed, one simply goes to the DOI recorded there for each experiment if archived, or the original inputs and outputs if not.
  • Assigning a DOI to a data package makes it really easy to share this with both collaborators and other researchers who express interest (the data is often too large to send by email).
  • Sometimes I use e.g. search.labs.datacite.org/help/examples to search the metadata created during the process in order to find (F) and access (A) old data, which is then very quickly amenable to re-use (R). OK, SciFinder or Reaxys it is not (yet!), but it is getting there.
  • One can get access statistics for the data. If you click on the link, you can see some datasets have been accessed more than 200 times. Someone must be finding them valuable! If you want to find out how much (UK) data is searchable in this manner, click here. Perhaps such statistics may even help get you promoted one day!
  • Having data available in this way enables one to construct more interesting tables or figures. This “figable” (yes, its both a table and a figure) comes from a recent publication of ours.[12] It retrieves the data purely by its DOI and inserts it into display software (JSmol) to construct an instant molecular model. One can also use this approach for lecture notes and labs,[13] for blogs as here, and (if you are very brave) for research presentations.
  • Google Scholar detects data and citations to it equally with journal articles. This is part of my profile there, and there you can see both articles AND data. If you are keen-eyed, you will however note that the data does not contribute to my h-index (but arguably, it is more valuable to have some data sets accessed 200+ times rather than to be cited!).

Some selected use-case examples can be viewed,[14] along with one specific to computational chemistry[15].

References

  1. J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
  2. M.J. Harvey, N.J. Mason, and H.S. Rzepa, "Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks", Journal of Chemical Information and Modeling, vol. 54, pp. 2627-2635, 2014. https://doi.org/10.1021/ci500302p
  3. H.S. Rzepa, "Reproducibility In Science: Calculated Kinetic Isotope Effects For Cyclopropyl Carbonyl Radical.", 2015. https://doi.org/10.5281/zenodo.19949
  4. Jana, Anukul., Huch, Volker., Rzepa, Henry S.., and Scheschkewitz, David., "CCDC 977840: Experimental Crystal Structure Determination", 2014. https://doi.org/10.5517/cc11tj7m
  5. H.S. Rzepa, F.L. Cherblanc, W.A. Herrebout, P. Bultinck, M.J. Fuchter, and Ya-Pei Lo., "Mechanistic and chiroptical studies on the desulfurization of epidithiodioxopiperazines reveal universal retention of configuration at the bridgehead carbon atoms.", 2013. https://doi.org/10.6084/m9.figshare.777773
  6. S. Gülten, "Bis dihydropyrimidine", ChemSpider Synthetic Pages, 2011. https://doi.org/10.1039/sp501
  7. Raghunathan Ramakrishnan., P. Dral, P.O. Dral, M. Rupp, and O. Anatole Von Lilienfeld., "Quantum chemistry structures and properties of 134 kilo molecules", 2014. https://doi.org/10.6084/m9.figshare.978904
  8. H.S. Rzepa, "C 8 H 8 B 2", 2015. https://doi.org/10.14469/ch/191378
  9. Y. Zhang, H.S. Rzepa, J.J.P. Stewart, P. Murray-Rust, M.J. Harvey, N. Mason, A. McLean, and Imperial College High Performance Computing Service., "Revised Cambridge NCI database", 2014. https://doi.org/10.14469/ch/2
  10. SimonClifford., and M J Harvey., "hpc-portal: Public release", 2015. https://doi.org/10.5281/zenodo.19174
  11. H.S. Rzepa, "C 7 H 10 Br 1 1", 2007. https://doi.org/10.14469/ch/46
  12. H.S. Rzepa, A.V. Shernyukov, G.E. Salnikov, V.G. Shubin, and A.M. Genaev, "Noncatalytic Bromination of Benzene: A Combined Computational and Experimental Study", 2015. https://doi.org/10.6084/m9.figshare.1299202
  13. K.K.(. Hii, H.S. Rzepa, and E.H. Smith, "Asymmetric Epoxidation: A Twinned Laboratory and Molecular Modeling Experiment for Upper-Level Organic Chemistry Students", Journal of Chemical Education, vol. 92, pp. 1385-1389, 2015. https://doi.org/10.1021/ed500398e
  14. M. Addis, "RDM workflows and integrations for HEIs using hosted services", figshare, 2015. https://doi.org/10.6084/m9.figshare.1476832
  15. M. Addis, and H.S. Rzepa, "Use of DOIs in data publishing in Computational Chemistry at Imperial College London", 2015. https://doi.org/10.6084/m9.figshare.1477994