data mining « Henry Rzepa's blog

Posts Tagged ‘data mining’

How does an OH or NH group approach an aromatic ring to hydrogen bond with its π-face?

Wednesday, June 22nd, 2016

I previously used data mining of crystal structures to explore the directing influence of substituents on aromatic and heteroatomatic rings. Here I explore, quite literally, a different angle to the hydrogen bonding interactions between a benzene ring and OH or NH groups.

aromatic-pi-query

I start by defining a benzene ring with a centroid. The distance is from that centroid to the H atom of an OH or NH group and the angle is C-centroid-H. To limit the search to approach of the OH or NH group more or less orthogonal to the ring, the absolute value of the torsion between the centroid-H vector and the ring C-C vector is constrained to lie between 70-100° (the other constraints being no disorder, no errors, T < 140K and R < 0.05).[1]

aromatic-pi-HN-140

The above shows the results for NH groups interacting with the aromatic ring. The maximum distance 2.8Å is more or less the van der Waals contact distance between a hydrogen and a carbon and as you can see the contacts "funnel down" to the centroid at < 2.1Å. The shortest distance[2] is for ammonium tetraphenylborate, which you can view in e.g. spacefill mode here[3]

390

The other interesting close contact derives from a protonated pyridine[4], which can in turn be viewed here.[5] The main message from the distribution shown above is that as the distances between the HN and the centroid get shorter, the "trajectory" of approach remains orthogonal to the ring (the angle defined above remains ~90°) and heads towards the centroid of the π-cloud. The hotspot itself (red, ~2.6Å) also lies along this trajectory.

Recollect that when I used such hydrogen bonding to see if crystal structures discriminate between the ortho or meta positions of a ring carrying an electron donating substituent, it was the distance from a HO to the carbon that was measured as the discriminator. So it's a faint surprise to find that with HN, and without the necessary perturbation of an electron donating substituent, the intrinsic preference seems to be for the ring centroid and not any specific carbon atom of the ring.

So how about the OH group? There are in fact rather fewer examples, and so the statistics are a bit less clear-cut. But there is a tantalising suggestion that this time, the trajectory is not ~90° but rather less, implying that the destination is no longer the centroid of the π-cloud but one of the carbon atoms of the ring itself. For those who like to "read between the lines" and spot things that are absent rather than present, you may have asked yourself why I did not use NH probes in my earlier post. Well, it appears that the NH group is less effective at e.g. o/p discrimination than is an OH group.

aromatic-pi-OH-140

I can only speculate as to the origins (real or not) of the difference in behaviour between OH and NH groups towards a phenyl π-face. Perhaps it is simply bias in the CSD database? Or might there be electronic origins? Time to end with that phrase "watch this space".

References

H. Rzepa, "How does an OH or NH group approach an aromatic ring to hydrogen bond with its Ï-face?", 2016. https://doi.org/10.14469/hpc/673
T. Steiner, and S.A. Mason, "Short N<sup>+</sup>—H...Ph hydrogen bonds in ammonium tetraphenylborate characterized by neutron diffraction", Acta Crystallographica Section B Structural Science, vol. 56, pp. 254-260, 2000. https://doi.org/10.1107/s0108768199012318
Steiner, T.., and Mason, S.A.., "CCDC 144361: Experimental Crystal Structure Determination", 2000. https://doi.org/10.5517/cc4v6tz
O. Danylyuk, B. Leśniewska, K. Suwinska, N. Matoussi, and A.W. Coleman, "Structural Diversity in the Crystalline Complexes of <i>para</i>-Sulfonato-calix[4]arene with Bipyridinium Derivatives", Crystal Growth & Design, vol. 10, pp. 4542-4549, 2010. https://doi.org/10.1021/cg100831c
Danylyuk, O.., Lesniewska, B.., Suwinska, K.., Matoussi, N.., and Coleman, A.W.., "CCDC 819118: Experimental Crystal Structure Determination", 2011. https://doi.org/10.5517/ccwhc5w

Tags:10.1021, 10.1107, 10.5517, aromaticity, benzene, Centroid, chemical bonding, data mining, Functional groups, Hydrogen bond, Physical organic chemistry, Pyridine, Simple aromatic rings, Supramolecular chemistry
Posted in Chemical IT, crystal_structure_mining | 3 Comments »

A wider look at π-complex metal-alkene (and alkyne) compounds.

Monday, June 13th, 2016

Previously, I looked at the historic origins of the so-called π-complex theory of metal-alkene complexes. Here I follow this up with some data mining of the crystal structure database for such structures.

Alkene-metal "π-complexes" have what might be called a representational problem; they do not happily fit into the standard Lewis model of using lines connecting atoms to represent electron pairs. Structure 1 was the original representation used by Dewar intending the meaning of partial back donation from a filled metal orbital to the empty π* of the alkene. At the other extreme these compounds can be called metallacyclopropanes (2) in which only single bonds feature (these can be thought of as representing full back bonding from metal to alkene and full forward bonding from alkene to metal). Representations 3 and 4 are a more fuzzy blend of these, implying some sort of partial bond order for the metal-carbon bonds. Taken together, they imply that the formal bond order of the C-C bond might vary between single to double. Structures 1 and 2 in particular imply that there might be two distinct ways in arranging the bonding and that π-complexes and metallacyclopropanes might therefore be distinct valence-bond isomers, each potentially capable of separate existence.

Why do these representations matter? Well, I am going to mine the crystal structure database for these species to try to see if there is any evidence for a bimodal distribution in the C-C lengths, perhaps indicating evidence of the isomerism suggested above. Such a structural database is indexed against atom-pair connectivity in the first instance and then bond type; one can specify the following types of bond connecting any two atoms: single, double, triple, quadruple, polymeric, delocalised, pi and any. It is not entirely obvious which if any of these types apply to structure 1 (it is not possible to draw a bond ending at the mid-point of another bond using the Conquest structure editor); the dashed lines in structures 3 and 4 could be classed as delocalised, pi, or most generally any. The search query can be constructed thus, where the two carbons carry R which can be either H or C and all four C-R bonds are specified as acyclic (to try to avoid complications by excluding compounds such as cyclic metallacenes). Because representation 1 cannot be constructed in the editor, I am going to specify that each carbon carries four bonds of any type in the first instance. The torsion specified is defined as R-C-C-M and the full queries can be found deposited here.[1]

If the metallacyclopropane representation 2 is defined with explicit single bonds, one gets only 22 hits (no errors, no disorder, R < 0.1). The distribution of C-C bond lengths is shown below. Already one sees a representational problem emerging. A true metallacyclopropane might be expected to show a C-C single bond length, say > ~1.5Å. But only one or two of these examples actually have this value, the most probable value being ~1.4Å.

Using representation 3, one gets 1861 hits, but as before one sees a maximum at ~1.4Å with a tail reaching to both single and double bond values for the C-C distance.^‡

If the C-C bond is also specified as "any", the hits increase to 3948, but the bond length distribution is still very similar, with no sign of any bimodal distribution.

Such a distribution is however found if the torsions between the R-C bond vector and the C-M bond vector are plotted (for all types of bond). A large number of the complexes have a torsion <90°, which suggests that in fact the substituent R is probably interacting with the metal (even though this would lead to formal cyclicity, specifying R-C as acyclic does not detect this interaction). Could this be masking a bimodal distribution in the C-C lengths?

If the previous search is repeated, but this time specifying that all four torsions must lie in the range 90-180° (the range expected for a "classical" alkene-metal complex and selecting only the top right hand side cluster in the plot above) the reduced value of 1051 hits are obtained, but the monomodal distribution remains.

For this last set, here is a plot of the two C-metal bond length, with colour indicating the C-C bond length, indicating the two C-metal bonds are clearly linearly correlated.

One final variation; the atom on either C can only be H or a 4-coordinate (sp³) carbon; 645 hits. Again, a monomodal distribution centered at 1.4Å.

So this foray through metal alkene complexes suggests that there is a continuum between the formal metallacyclopropane with a C-C single bond and the only slightly perturbed alkene-metal complex with a C=C double bond. Whilst this would not prevent any one of these compounds existing as two distinctly different valence-bond isomers, it makes it very unlikely. I had noted in an earlier post that for molecules of the type RX≡XR (X=Si, Ge, Sn, Pb) that there was indeed a clear bimodal distribution of the X-X lengths evident in the crystal structures (for a relatively small sample number). The structures 1-4 shown at the start of this post are all simply just variations in a continuum and not distinct isomers.

POSTSCRIPT: I noted above the bimodel distribution in compounds involving formal triple bonds. So I repeated the search above for π-complex metal-alkyne complexes. Specifying an acyclic C-R bond, and any for the CC bond type, one gets the following.

There is now a tantalizing suggestion of two clusters, one at 1.3 and another at 1.4Å. The torsional distribution shows that the latter distance appears to be associated with much smaller torsions, whereas the top right cluster is associated with shorter lengths.

If the torsions are restricted to the range 90-180, then the histogram looses the smaller cluster, and perhaps gains a second cluster at 1.22Å? As I said, all quite tantalizing!

^‡The tail in all the histograms extends into the 1.1-1.3Å region, which seems unreasonable for a carbon where four bonds are specified. This region probably represents errors in the crystallographic analysis or reporting. But who knows, perhaps some very unusual compounds are lurking there!

References

H. Rzepa, "A wider look at the Ï-complex theory of metal-alkene compounds.", 2016. https://doi.org/10.14469/hpc/642

Tags:alkene, alkene-metal complex, alkyne, Bond length, Carbon–carbon bond, Chemical bond, chemical bonding, Cluster chemistry, Conquest structure editor, Coordination complex, data mining, double bond, editor, filled metal orbital, metal, metal-alkene complexes, metal-alkyne complexes, metal-carbon bonds, Pi backbonding, search query, Structural formula, Transition metal alkyne complex
Posted in crystal_structure_mining | No Comments »

A wider look at chlorine trifluoride: crystal structures and data mining.

Friday, June 10th, 2016

A while ago, I explored how the 3-coordinate halogen compound ClF₃ is conventionally analyzed using VSEPR (valence shell electron pair repulsion theory). Here I (belatedly) look at other such tri-coordinate halogen compounds using known structures gleaned from the crystal structure database (CSD).

The search query specifies 7A as the central atom, defined with just three bonded (non-metallic) atoms. Initially, if no constraint on any cyclicity in the three 7A-NM bonds is made (and with R < 0.1, no errors, no disorder), the following result emerges.

I have plotted the three angle variables using the X/Y axes above and used colour to indicate the third angle (red = ~180°, blue = ~90°). The clusters show that two of the angles are ~90° and only one is ~180°. There is also a set of blue points (~90°) which show a linear correlation and which can be shown to derive from cyclicity, as the plot below reveals when acyclicity is specified for all three NM-7A bonds.

In this distribution, the two clusters for ANG1 or ANG2 of ~180° are small and compact, but the cluster where both ANG1 and ANG2 are ~90° is much more diffuse. Not all of the points in this cluster show as red (ANG3 ~180°); there are a few cyan or blue examples here too; indicating all three angles are in the range 140-90°. This result is not arising from cyclic constraints.

This wider look at 3-coordinate compounds in group 17 (the halogens) quickly reveals a class of such molecules where all three angles are relatively small. This suggests that a closer look at the bonding in these systems, especially in terms of VSEPR, might be rewarding!

I end with an equivalent search for group 18 (the noble gases). Although the number of examples is small, all show the two small/one large angle so characteristic of chlorine trifluoride itself.

The above is I think a good example of (big?) data mining, where one is searching for patterns, and if lucky spotting patterns that deviate from the norm to investigate the possibility of new chemical phenomena.[1] It is also interesting to speculate upon the origins of why two of the clusters shown above are small and compact and the third is much more diffuse.

References

H.S. Rzepa, "Discovering More Chemical Concepts from 3D Chemical Information Searches of Crystal Structure Databases", Journal of Chemical Education, vol. 93, pp. 550-554, 2015. https://doi.org/10.1021/acs.jchemed.5b00346

Tags:chemical phenomena, data mining, equivalent search, Halogen, search query specifies 7A
Posted in crystal_structure_mining | 1 Comment »

Single Figure (nano)publications, reddit AMAs and other new approaches to research reporting

Wednesday, August 5th, 2015

I recently received two emails each with a subject line new approaches to research reporting. The traditional 350 year-old model of the (scientific) journal is undergoing upheavals at the moment with the introduction of APCs (article processing charges), a refereeing crisis and much more. Some argue that brand new thinking is now required. Here are two such innovations (and I leave you to judge whether that last word should have an appended ?).

To set the scene for the first, I will quote the abstract: “The single figure publication is a novel, efficient format by which to communicate scholarly advances. It will serve as a forerunner of the nano-publication, a modular unit of information critical for machine-driven data aggregation and knowledge integration[1] The kernel of this suggestion is (again I quote) “We offer the idea of the micro-publication unit, the single figure publication (SFP), to provide scholars with a real-world, manageable method to inform research.” I was struck by the overlap between this suggestion and the one you may find on many of the posts on this blog, where what I refer to as FAIR Data is assigned a digital object identifier (DOI) and included in the citation lists at the end of the post. The key phrase in the above abstract is machine-driven data aggregation and knowledge, although the article does not really go into any mechanisms for easily achieving this. It is my argument that the act of assigning a DOI carries with it the association that there is machine searchable metadata which can be retrieved and used for the aggregation and knowledge mining. The authors of this article, Do and Mobley, advocate adoption of nanopublications defined by inclusion of just a single figure (notably, not a table of results!) and some accompanying context which they claim would reduce the unit of publication to a more tractable size. This does raise the question of whether science needs more publications (in chemistry alone there are said to be more than a million published each year) or whether we should instead be concentrating our efforts on improving the data side of things by increasing its semantic content and formalising its structures, its preservation and curation. I certainly argue that far too little effort has been poured into these latter activities. You only have to look at the typical SI (supporting information) associated with many chemistry articles to realise that in many cases they are still hardly fit for purpose. There is one concept introduced by Do and Mobley that also deserves mention. Their nanopublications are structured to be read by machines, not people. They will therefore not be refereed by people (my inference). They do not really discuss how else the quality will be assessed, but of course if you treat their nanopublication as essentially FAIR data, then it does become possible to develop methods of machine refereeing.

The second email alerted me to an article[2] in the Winnower, a forum that offers a bridge between “traditional scholarly publishing tools to traditional and non-traditional scholarly outputs—because scholarly communication doesn’t just happen in scholarly journals“. Here, the concept of scholarly communication is extended to the New Reddit Journal of Science and introduces the concept pioneered by reddit of the AMA, or “ask me anything” environment. I occasionally publish some of the posts on this blog to the Winnower, receiving in return the increasingly ubiquitous DOI. I have also occasionally quoted these DOIs in articles submitted to conventional chemistry journals. What we see now is the propagation of a Winnower DOI on to e.g. https://www.reddit.com/r/science/ where anyone^† can post a question related to the original research reporting. I must state that I do have some reservations about this. Whilst it is likely that the majority of traditional scholarly reporting is likely to receive no AMAs (just as a very high proportion of research articles attract few if any citations in other articles over a period of decades), it is also likely that the quality of posted AMAs may turn out to be very low. At which point the original researcher has to make a judgement as to whether to devote any of their increasingly precious and fragmented time to answering them. And if few if any answers are posted in response to an AMA, the system seems unlikely to flourish.

But what we see here are two serious attempts to develop new approaches to research reporting, and not doubt others will emerge. To quote Yogi Berra, the future is not what it used to be.

^†Anyone can also post to this blog to ask similar questions. But note that associating an ORCID with such comments is highly recommended. I do not think that reddit currently supports ORCID, but I would argue if the intent is serious, it certainly should.

References

L. Do, and W. Mobley, "Single Figure Publications: Towards a novel alternative format for scholarly communication", F1000Research, vol. 4, pp. 268, 2015. https://doi.org/10.12688/f1000research.6742.1
. RobustTempComparison, and . r/Science, "Science AMA Series: Climate models are more accurate than previous evaluations suggest. We are a bunch of scientists and graduate students who recently published a paper demonstrating this, Ask Us Anything!", The Winnower, . https://doi.org/10.15200/winn.143871.12809

Tags:10.15200, 143871.12809, Academia, Academic publishing, advocate, Citation, data mining, Digital Object Identifier, Do, Knowledge, knowledge mining, Microattribution, Mobley, original researcher, Peer review, Publishing, scholarly publishing tools, Technology/Internet, the New Reddit Journal, Yogi Berra
Posted in Chemical IT, General | No Comments »

A two-publisher model for the scientific article: narrative+shared data.

Sunday, September 15th, 2013

I do go on rather a lot about enabling or hyper-activating[1] data. So do others[2]. Why is sharing data important?

Reproducibility is a cornerstone in science,
To achieve this, it is important that scientific research be open and transparent.
Openly available research data is central to achieving this. It is estimated that less than 20% of the data collected in chemistry is made available in any open manner.
RCUK (the UK research councils) wish increased transparency of publicly funded research and availability of its outputs^‡

But it’s not all hot air, honestly. Peter Murray-Rust and I had started out on a journey to improve reproducibility, openness and transparency in (inter alia) scientific publishing in 1994. In 2001 we published an example of a data-rich article[3] based on CML, and by 2004 the concept had evolved into something Peter termed a datument[4]. Some forty such have now been crafted.[5]

In 2009, the journal Nature Chemistry was starting up, and I approached them with the idea of an interactive data exploratorium on the premise that a new journal might be receptive to new ways of presenting science. It was accepted and published[6] and was followed in 2010 by a second variation.[7] In both cases, these activated-figures were sent to the journal as part of the submission process, and hosted by them (they still are). You can even access them without a subscription to the journal!

Move on to 2012, when David Scheschkewitz had some very exciting silicon chemistry to report, we collaborated on some computational modelling, and sent the resulting article to Nature Chemistry for publication. This included the usual interactive table reporting the modelling and its data. However, it transpired that the production workflows for Nature Chemistry had been streamlined and I was informed that interactive tables could no longer be accepted. This time, we (i.e. the authors) would have to solve the issue of how to host and present the data ourselves.

I was very keen that this table be treated with equal weight to the article itself (citable in its own right) and that it not be downgraded to supporting information (ESI). My objection to ESI is that it is often poorly structured by authors, i.e. it is not prepared in a form which allows the data to be re-used, either by a perceptive human, or a logical machine. As a result it is often given little attention by referees (although bloggers seem to do a far better job) and furthermore can end up being lost behind a pay wall (the two Nature Chem interactive objects noted above can be openly accessed, but only if you know that they exist). So I determined that:

The table should be immediately accessible by non-experts, but not through any convoluted processes of downloading a file, expanding it and finding the correct document within the resulting fileset to view in the correct program, which is how normal ESI is handled.
The table and the data it contained within should be capable of acting as a scientific tool, forming what could be the starting point for a new investigation if appropriate.

To solve this issue, some lateral and quick thinking was needed. The solution was a two-component model in which the original article is treated as a “narrative“, intertwingled with a second, but nevertheless distinct component, the “data“. This data would follow the principles of the Amsterdam Manifesto; it would itself be citable. The two components would become symbiotes (a datument). The narrative[8] could cite this data and the data could back-link to the narrative. The data would inherit trust (i.e. peer review) from that applied to the narrative and the latter would inherit a date stamp and integrity from the data host (in this case Figshare[9]).^*

The data itself can have two layers, presentation [9]^¶ using a combination of software (Jmol or JSmol for chemistry) which are used to invoke the “raw” data. That data itself is citable[10] (this is just a single example, resident as it happens on a different repository). The reader can choose use just the presentation layer or the underlying data.

The data object can be embedded in other pages; here it is below. The data sources for this table are themselves citable[11].

What are the advantages of such an approach? (the “what’s in it for me” question often asked by research students and their supervisors)

Each of the components is held in an environment optimised for it and so can be presented to full advantage.
The conventional narrative publisher does not necessarily also have to develop their own infrastructures for handling the data. They can choose to devolve that task to a “data publisher”.
The data publisher (Figshare in this case) makes the data open. One does not need an institutional subscription to access it.
“Added value” for each component can be done separately. Thus most narrative publishers would not necessarily wish to develop infrastructures for validating it or subsequently mining such “big data”. Indeed data mining of journals is prohibited by many publishers; it simply is either not possible or rendered so administratively difficult as to be impractical.
Whilst a narrative article must clearly exist as a single instance (otherwise the authors would be accused of plagiarism), data can have multiple instances. Indeed, there exist protocols (SWORD) for moving data from one repository to another as the need arises. Publishing the same data in two or more locations is not currently considered plagiarism!
The data component can be published as part of an article or say as part of a PhD thesis. This way, the creator of the data gets the advantages not of a date stamp associated with a narrative citation but of a much earlier stamp associated more closely with the actual creation of the data. That could easily and usefully resolve many disputes about who discovered what first, leaving the other issue of who interpreted what first to the narrative. I should mention that it is perfectly possible to “embargo” the data deposition so that it only becomes public when the narrative does (although you may choose not to do this).
A data deposition cannot be modified, but a new version (which bidirectionally links back to the old one) can be published if say more data is collected at a future date.
A whole infrastructure devoted just to enhancing the cited data can evolve; one that is unlikely to do so if the narrative publishers are the only stakeholders. For example, synthetic procedural data can be tagged using the excellent chemical tagger.
It is relatively simple (=cheap) to build a pre-processor for publishing data, which for a research student can act as an electronic laboratory notebook, holding meta-data about the deposited/published data and the handles (doi) associated with each deposition. I have been using such an environment now for about seven years as the e-notebook for this blog for example. Thus the task of preparing figures and tables for a publication (or a blog post) is greatly facilitated. The same system is also used by research students and undergraduates for their lab work.
I have noted previously how e.g. Google Scholar identifies data citations along with article citations in constructing an individual research profile. A researcher could become known for their published data as well as their published narratives. Indeed, it seems likely that the person who acquires and publishes the data, i.e. the research student, would then get accolades directly rather them all accruing to their supervisor.

But what can you, gentle reader of this blog, do to help? Well, ask if your institution already has, or plans to create a data repository. It can be local (we use DSpace) or “in-the-cloud” (e.g. Figshare). If not, ask why not! And if you are planning to submit an article for publication in the near future, ponder how you might better share its data.

^‡As first circulated on 28 April, 2011. See
http://www.epsrc.ac.uk/about/standards/researchdata/Pages/policyframework.aspx

^†The example given at the start of this post[8] contains only one table processed in this manner; the actual synthetic procedures are still held in more conventional SI.

^*This blog uses the excellent Kcite plugin to manage citations.

^¶The good folks at Figshare were extremely helpful in converting this deposition into an interactive presentation. Thanks guys!

References

O. Casher, G.K. Chandramohan, M.J. Hargreaves, C. Leach, P. Murray-Rust, H.S. Rzepa, R. Sayle, and B.J. Whitaker, "Hyperactive molecules and the World-Wide-Web information system", Journal of the Chemical Society, Perkin Transactions 2, pp. 7, 1995. https://doi.org/10.1039/p29950000007
R. Van Noorden, "Data-sharing: Everything on display", Nature, vol. 500, pp. 243-245, 2013. https://doi.org/10.1038/nj7461-243a
P. Murray-Rust, H.S. Rzepa, and M. Wright, "Development of chemical markup language (CML) as a system for handling complex chemical content", New Journal of Chemistry, vol. 25, pp. 618-634, 2001. https://doi.org/10.1039/b008780g
H.S. Rzepa, "Chemical datuments as scientific enablers", Journal of Cheminformatics, vol. 5, 2013. https://doi.org/10.1186/1758-2946-5-6
H.S. Rzepa, "Transclusions of data into articles", 2013. https://doi.org/10.6084/m9.figshare.797481
H.S. Rzepa, "The importance of being bonded", Nature Chemistry, vol. 1, pp. 510-512, 2009. https://doi.org/10.1038/nchem.373
H.S. Rzepa, "The rational design of helium bonds", Nature Chemistry, vol. 2, pp. 390-393, 2010. https://doi.org/10.1038/nchem.596
M.J. Cowley, V. Huch, H.S. Rzepa, and D. Scheschkewitz, "Equilibrium between a cyclotrisilene and an isolable base adduct of a disilenyl silylene", Nature Chemistry, vol. 5, pp. 876-879, 2013. https://doi.org/10.1038/nchem.1751
D. Scheschkewitz, M.J. Cowley, V. Huch, and H.S. Rzepa, "The Vinylcarbene – Cyclopropene Equilibrium of Silicon: an Isolable Disilenyl Silylene", 2013. https://doi.org/10.6084/m9.figshare.744825
H.S. Rzepa, "Gaussian Job Archive for C60H92Si3", 2012. https://doi.org/10.6084/m9.figshare.96410

Tags:chemical tagger, data mining, datument, David Scheschkewitz, e-notebook, Google, opendata, Peter Murray-Rust, pre-processor, researcher, scientific tool, supervisor, United Kingdom
Posted in Chemical IT, Interesting chemistry | 9 Comments »

Research data and the “h-index”.

Monday, June 24th, 2013

The blog post by Rich Apodaca entitled “The Horrifying Future of Scientific Communication” is very thought provoking and well worth reading. He takes us through disruptive innovation, and how it might impact upon how scientists communicate their knowledge. One solution floated for us to ponder is that “supporting Information, combined with data mining tools, could eliminate most of the need for manuscripts in the first place“. I am going to juxtapose that suggestion on something else I recently discovered.

Someone encouraged me to take a look at Google Scholar. It is one of those resources that, amongst other features, computes an individual’s h-index and i10-index (the former, having gone through its purple patch, is now apparently at the end of the road, at least for chemists). One reason perhaps why proper curation of research data is not high on most chemists’ list of priorities is that it does not contribute to one’s h-index, and particularly one’s prospects of a successful research career. Thus “supporting information (data)” is one of those things, like styling the citations in a research article, that most people probably prepare through gritted teeth (a rather annoying ritual without which a research article cannot be published). So when I inspected my own Google Scholar profile (you can do the same here) I was rather surprised to find, appended to all the regular research articles, a long list of data citations (sic!). Because I have placed much of my own data into a digital repository,^‡ this has opened it up to Google (where don’t they get to nowadays?) for listing (if not actually mining). These citations of themselves actually do not (currently?) contribute to eg the h-index, since currently these entries are not attracting citations by others. And that of course is because doing so is not yet an accepted part of the ritual of preparing a scientific article.

Most scientists must now be pondering what the future holds in terms of how they can bring themselves to the attention of others (in a good way) and hence progress their careers. So I will take Rich’s suggestion one step further. Those scientists who create new data in a process called research, should firstly curate this data properly (via eg a digital repository) and then expect to promote their activity by garnering not only citations for the published narratives (= articles) but also associated published data. Their success as a researcher would be (in part) judged by both. Who knows, as well as famous published narratives, perhaps we will also rank famous published datasets!

^‡I do the same for the data I use to support many of the posts for this blog.

Tags:data mining, data mining tools, Google, opendata, researcher
Posted in Chemical IT | 2 Comments »

Research data and the "h-index".

Monday, June 24th, 2013

^‡I do the same for the data I use to support many of the posts for this blog.

Tags:data mining, data mining tools, Google, opendata, researcher
Posted in Chemical IT | 2 Comments »

Henry Rzepa's blog

Posts Tagged ‘data mining’

How does an OH or NH group approach an aromatic ring to hydrogen bond with its π-face?

References

A wider look at π-complex metal-alkene (and alkyne) compounds.

References

A wider look at chlorine trifluoride: crystal structures and data mining.

References

Single Figure (nano)publications, reddit AMAs and other new approaches to research reporting

References

A two-publisher model for the scientific article: narrative+shared data.

References

Research data and the “h-index”.

Research data and the "h-index".

Recent Posts

Archives

Blogroll

Meta