Chemical IT « Henry Rzepa's blog

Archive for the ‘Chemical IT’ Category

Electronic notebooks: a peek into the future?

Tuesday, September 16th, 2014

ELNs (electronic laboratory notebooks) have been around for a long time in chemistry, largely of course due to the needs of the pharmaceutical industries. We did our first extensive evaluation probably at least 15 years ago, and nowadays there are many on the commercial market, with a few more coming from opensource communities. Here I thought I would bring to your attention the potential of an interesting new entrant from the open community.

My very first post on this blog six years ago related to incorporation of the Jmol molecular viewer into posts, and it has been a feature of many since. A little more than two years ago, Jmol was recast into JSmol. This had become possible because JavaScript engines built into modern web browsers were finally getting the sort of performance needed to display molecules (years and years ago, lets say ~1990, such display required very fancy hardware kit such as Silicon Graphics workstations). Around the same time, another well-established Java-based molecule sketcher, JME (Java molecular editor) also became JavaScript based. My own interest in this sort of Web-based behaviour actually crystallised last December, when I decided to refactor my own lecture notes into a tablet-friendly format using JSmol, with some questions directed at the formidably excellent Jmol discussion list. One of these related to how students might annotate such lecture notes with chemical sketches and store the results for future study or revision. Otis Rothenberger starting exploring various mechanisms for such local storage (using Web browsers), and in the last month or so has found a way of exploiting something called HTML5 local storage, which allows the sort of capacity needed. These three technologies have now come together on Otis’ site, which you can now view as CheMagic Notebook (this might be a .com site, but I believe the concept is very much open).

Together with the Virtual model kit (VMK, itself now part of JSmol) this combination is starting to resemble a very interesting mechanism for creating an immersive lecture note environment, almost you might say a lecture note ecosystem. I would argue that for the first 30 years of the digital document era, most people preparing lecture notes became mesmerised (distracted?) by the need to print the outcomes with complete fidelity. It is only recently that the focus has turned to “beyond the PDF” (or beyond the PPT) and much richer mechanisms. So now we have lecture notes morphing into an ecosystem where:

the objects themselves can be interactive (3D models, spectra, animations etc)
or reference further models and associated data held in digital repositories
or built from scratch in response to stimulation from peers, tutorials, workshops or lectures (using eg VMK or JME)
and such annotations in effect themselves can be spliced into the student’s own copy of these notes,
with the whole being regarded as a running notebook created from the initial seed of a lecturer’s materials augmented by the student’s own annotations.

I have focused here on where I started, i.e. refactoring my own lecture notes. But the above concepts could easily morph into eg a research project notebook, a rebundling into smaller segments which are themselves published into digital repositories (and there assigned their own persistent digital object identifiers) and ultimately further morphing into scholarly articles submitted to say a journal. These could represent a continuum, not discrete (and non-communicating) objects.

So will “lecture notes” actually start to change from their conventional (printable) form into something related to the above? Well, I have not addressed the largest hurdle preventing this; giving the content creators (i.e. the lecturers) the training, skills and most importantly the motivation to start to venture down this pathway. Otis has shown it should be technically possible. Come back and revisit this post in ten years time to see what actually did happen!

Tags:.com, chemical sketches, Java, JavaScript, lecturer, molecular editor, PDF, pharmaceutical industries, Silicon Graphics, three technologies, web browsers, Web-based behaviour
Posted in Chemical IT | No Comments »

One molecule, one identifier: Viewing molecular files from a digital repository using metadata standards.

Monday, September 8th, 2014

In the beginning (taken here as prior to ~1980) libraries held five-year printed consolidated indices of molecules, organised by formula or name (Chemical abstracts). This could occupy about 2m of shelf space for each five years. And an equivalent set of printed volumes from the Beilstein collection. Those of us who needed to track down information about molecules prior to ~1980 spent many an afternoon (or indeed a whole day) in the libraries thumbing through these weighty volumes. Fast forward to the present, when (closed) commercial databases such as SciFinder, Reaxys and CCDC offer information online for around 100 million molecules (CAS indicates it has 89,506,154 today for example). These have been joined by many open databases (e.g. PubChem). All these sources of molecular information have their own way of accessing individual entries, and the wonderful program Jmol (nowadays JSmol) has several of these custom interfaces programmed in. Here I describe some work we have recently done[1] on how one might generalise access to an individual molecule held in what is now called a digital data repository.

Such repositories are gradually becoming more common. Unlike most (all?) of the bespoke molecular repositories noted above, metadata (XML) resourcemap standards have been developed[2] for data repositories to enable rich and open searches and to help in the discoverability of individual entries (e.g. OAI-ORE). Each dataset is characterised by a DOI (digital object identifier), just like individual articles found in a conventional journal. However, there is an issue in quoting just a conventional DOI to describe a dataset. The DOI points to what is called the article landing page in the journal. A landing page which by and large is meant to be navigated by a human. To get a flavour for how this works (or more accurately does not work) for data, visit this DOI[3] for an entry in the CCDC crystal database noted above (and about which I have previously blogged). In essence, a human is needed to complete the requested information in order to proceed to retrieving the data. Data, I contend here, should not need a landing page. It can benefit from being passed straight on to e.g. a visualising program such as JSmol. So a mechanism is needed to encapsulate any bespoke (and potentially changeable) access path to the data by expressing it instead in standard metadata form.

In our first solution to this issue, and the one illustrated here, we used a standard known as 10320/loc[2]. A datafile need only be specified by its DOI (or more generically, its handle) to be recovered from the data repository; no landing page need be involved (and no human need ponder what next to do with the data).

First, let me reference a molecule (as it happens the one described in the preceding post), using the normal invocation[4]. This will take you to a conventional landing page.
The next example is the same dataset, but this time with the landing page replaced by a Javascript/JSmol wrapping. This is achieved using a utility which is itself packaged up and placed on a repository (shortdoi: vjj)[5], and which is embedded here for you to try out. If you want the technical detail, read about it here.[1]

There is more to come. But you will have to wait for part 2!

References

M.J. Harvey, N.J. Mason, and H.S. Rzepa, "Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks", Journal of Chemical Information and Modeling, vol. 54, pp. 2627-2635, 2014. https://doi.org/10.1021/ci500302p
"DOI Name 10320/loc Values"http://doi.org/10320/loc
Jana, Anukul., Omlor, Isabell., Huch, Volker., Rzepa, Henry S.., and Scheschkewitz, David., "CCDC 967887: Experimental Crystal Structure Determination", 2014. https://doi.org/10.5517/cc11h55w
H.S. Rzepa, N. Mason, and M J Harvey., "Retrieval and display of Gaussian log files from a digital repository", 2014. https://doi.org/10.6084/m9.figshare.1164282

Tags:Digital Object Identifier, XML
Posted in Chemical IT | No Comments »

Data galore! 134 kilomolecules.

Wednesday, August 6th, 2014

I do go on a lot about the importance of having modern access to data. And so the appearance of this article[1] immediately struck me as important. It is appropriately enough in the new journal Scientific Data. The data contain computed properties at the B3LYP/6-31G(2df,p) level for 133,885 species with up to nine heavy atoms, and the entire data set has its own DOI[2]. The data is generated by subjecting a molecule set to a number of validation protocols, including obtaining relaxed (optimised) geometries at the B3LYP/6-31G(2df,p) level. It would be good to replicate this set with inclusion of a functional that also includes dispersion, and of course making the coordinates all available in this manner greatly facilitates this. The collection also includes data for e.g. 6095 constitutional isomers of C₇H₁₀O₂, which reminds me of an early, delightfully entitled, article adopting such an approach in quantum chemistry[3]. Such collections are an important part of the process of validating computational methods[4] This way of publishing data does raise some interesting discussion points.

In this case, we have coordinates for 134 kilo molecules, but the individual molecules in this collection do not have formalised metadata. The InChI key is an example of such metadata, and means that the metadata can be specifically searched. Where you have a monolithic collection of 134k molecules, no such structured exposed metadata exists for individual entries and you will have to generate it yourself in order to search it.
Each of the molecules in this collection is revealed (once you have downloaded the compressed archive as above and decompressed it into a 548 Mbyte folder) as separate XYZ files.^‡ This syntax has the merit of being very simple, and can easily be processed by a human. Computed molecular properties in the form of metadata are missing from this particular (relatively ancient) format. To recover them, you would have to repeat the calculation.
In fact the XYZ files in this example do seem to have some (unformalised) properties appended to the bottom of the XYZ file (the SMILES and InChI strings are recognizably there, shown as an example below
```
27
gdb 57483   2.68237 1.10148 0.98017 0.0557  94.95   -0.2958 0.073 ...
C   -0.0805964233    1.5844710741    0.1983967506   -0.41097
.........
29.7376 87.1304 196.1576    216.856 ...
CC(C)(C)C1CCCC1 CC(C)(C)C1CCCC1 
InChI=1S/C9H18/c1-9(2,3)8-6-4-5-7-8/h8H,4-7H2,1-3H3 InChI=1S/C9H18/c1-9(2,3)8-6-4-5-7-8/h8H,4-7H2,1-3H3
```
This of itself does raise some issues.
1. The title line (starting gdb) has extra numbers, but it is not immediately obvious what these are.
2. The XYZ file is no longer standard because extra information is appended, both to each atom line (the charge? shown above as -0.41097) and to the bottom. Much software will not recognise this non-standard XYZ file, and is likely to discard the additional information. Thus I tried wxMacMolPlt (a long time reader of XYZ files) with no success. Human editing of the file was required to remove the additional information before a sensible molecule loaded. Only at this point could one progress to (re)compute the molecular properties.
3. The extra information is not formally described. As a human^† I can recognise it as an atom coordinate list with appended charges (I think), to which is appended a list of normal coordinate harmonic wavenumbers in units of cm^-1, a SMILES and InChI as separate lines. That is really informed guesswork (a human is very good at such pattern recognition) but I cannot be absolutely certain, and a machine seeing this for the first time would certainly struggle.
4. The last lines contains repetitions of the SMILES and InChI strings. I am guessing that this is the connectivity determined before and after geometry optimisation (using quantum mechanics, bonds can indeed break or form during such a process) but I may be quite wrong about that. I have not tried to resolve this issue by actually reading the depths of the article, since the file itself really should carry such information.
5. The XYZ file itself carries no provenance, such as who created the file, which software and version was used to create it, the date of creation etc.
An alternative approach is the one adopted here on this blog. Each individual molecule is assigned a DOI and its own metadata and provenance. It is presented to the user in a variety of syntactical forms, each designed for a different purpose, and each adopted for these needs. Thus the syntax and semantics of a CML file are clearly defined by a Schema, and this format can easily absorb additional information without “breaking the standard”. It too can be scaled to 134 kilo molecules[4] although this does require a suitable container (repository) to handle this scale (and I am not entirely sure that DataCite would approve of the generation of 134 kiloDOIs).

Overall, this sort of data publication must be warmly welcomed by the community, and I do hope that more chemistry data is routinely made available in appropriate manner. The presentation in ready-to-reuse form will no doubt improve as the value of such data becomes more fully appreciated. And ultimately, humans need to be excluded from much of this process (editing the 133,885 sets of XYZ coordinates as described above is not for humans to do).

‡Your computer however might balk at opening a folder with 133,885 items in it. Try this only on a very fast machine with lots of memory and ideally an SSD!

^†Contrary to some rumors, I do not hail from the planet Zog.

References

R. Ramakrishnan, P.O. Dral, M. Rupp, and O.A. von Lilienfeld, "Quantum chemistry structures and properties of 134 kilo molecules", Scientific Data, vol. 1, 2014. https://doi.org/10.1038/sdata.2014.22
Raghunathan Ramakrishnan., P. Dral, P.O. Dral, M. Rupp, and O. Anatole Von Lilienfeld., "Quantum chemistry structures and properties of 134 kilo molecules", 2014. https://doi.org/10.6084/m9.figshare.978904
P.P. Bera, K.W. Sattelmeyer, M. Saunders, H.F. Schaefer, and P.V.R. Schleyer, "Mindless Chemistry", The Journal of Physical Chemistry A, vol. 110, pp. 4287-4290, 2006. https://doi.org/10.1021/jp057107z
P. Murray-Rust, H.S. Rzepa, J.J.P. Stewart, and Y. Zhang, "A global resource for computational chemistry", Journal of Molecular Modeling, vol. 11, pp. 532-541, 2005. https://doi.org/10.1007/s00894-005-0278-1

Tags:Much software, validation protocols
Posted in Chemical IT | 5 Comments »

Data nightmares: B40 and counting its π-electrons

Saturday, July 19th, 2014

Whilst clusters of carbon atoms are well-known, my eye was caught by a recent article describing the detection of a cluster of boron atoms, B₄₀ to be specific.[1] My interest was in how the σ and π-electrons were partitioned. In a C₄₀, one can reliably predict that each carbon would contribute precisely one π-electron. But boron, being more electropositive, does not always play like that. Having one electron less per atom, one might imagine that a fullerene-like boron cluster would have no π-electrons. But the element has a propensity[2] to promote its σ-electrons into the π-manifold, leaving a σ-hole. So how many π-electrons does B₄₀ have? These sorts of clusters are difficult to build using regular structure editors, and so coordinates are essential. The starting point for a set of coordinates with which to compute a wavefunction was the supporting information. Here is the relevant page: The coordinates are certainly there (that is not always the case), but you have to know a few tricks to make them usable.

Open Adobe Reader, select the coordinates and copy
Paste into any application which recognises text. I used an old stalwart on the Mac, BBedit. It is reliable!
But no, it produces a row of skull&crossbones characters (the authors of the program clearly have a sense of humour)
Thinking that BBedit might have let me down (for the first time), I tried Word. A little less humour, but the same result.
There are lots of web sites out there that claim to convert PDF files directly to Word files. Again, no luck, the coordinates are now entirely missing!
Right, time for the big guns. Adobe Acrobat XI converts .PDF to .DOC, and (if you jump through a lot of hoops to register etc) they even give you a 30 day trial. Well, at least it gives numbers. But notice that the line breaks are missing, and all the numbers flow from one line to another.
Another copy/paste from Word to BBedit, and now I have all the numbers, and adding 40 line breaks is all that is needed (there is sometimes some skill in knowing where to add them by the way).^‡ The time taken from step 1 to step 7 was about 90 minutes (including a necessary cup of tea to recover from steps 1-5, and the realisation that the time was not wasted, since I could blog the experience!).

Well, I am sure you know what is coming next; my usual rant about how little most chemists truly value data and particularly its integrity and its semantics. And how little almost all journals understand data. Notice that the original article was published in Nature Chemistry. Note also a new journal from that stable, Scientific Data. The journal clearly thinks there is mileage in receiving scholarly articles about scientific data, and what they call data descriptors (they even got me to write a data descriptor a year or so back). Its a shame then that the same publisher allowed the decimation of the core data related to an article about B₄₀.

They have a widely read blog, perhaps they can comment?

One more point to make about data: a phrase has recently been coined: deposition with recognition. Here, I show how my own data has been recognised:

There are various other ways as well, and perhaps I will leave this to another post. To return to the chemistry (where we should have been at the start). I ran the calculation (B3LYP+D3/TZVP) and published the newly enhanced data, citing it in the usual way.[3],[4]^† To answer my question, for the D_2d geometry, B₄₀ has 24 π-electrons (there is some ambiguity, it could be 26). On average, the boron retains only ~0.65s, balanced by ~2.35p electrons. The most stable π-pair is shown below. At the centre of the ring is a strongly diatropic ring current (NICS = -42 ppm)[5] suggesting aromaticity (26 electrons = 4n+2).^¶

B40-29

I conclude by pondering whether the properties of any such boron cluster may in time prove to be directly related to the number of σ-to-π promotions.

^‡ Sadly, line breaks in lists of atom coordinates date back to an era of about 50 years ago when text files were first treated differently from binary files. Three different “standards” emerged for specifying a line break (DOS, Mac and Unix) in a text file and much confusion has there been ever since when moving these text files across operating systems. The modern way of doing it is to make line breaks redundant by instead marking up the file. The standard chemical markup, invented in 1996, and formally published in 1999[6], is CML. You will find such CML coordinates in the deposited data from this calculation.[3] You will not have any problems with line breaks!

^†Publication assigns a DataCite DOI. This takes about 48 hours to propagate to CrossRef, which is here used by the KCite WordPress plugin to retrieve the metadata and compose a citation. If KCite queries CrossRef before the metadata has propagated, it does not generate a citation. If you are reading this and see no citation, please revisit after 48 hours have elapsed.

^¶The diatropicity is inverted to paratropicity (NICS = +28 ppm) when two electrons are removed to create the dication.[7] This inversion is normally a good test of aromaticity/antiaromaticity.

References

H. Zhai, Y. Zhao, W. Li, Q. Chen, H. Bai, H. Hu, Z.A. Piazza, W. Tian, H. Lu, Y. Wu, Y. Mu, G. Wei, Z. Liu, J. Li, S. Li, and L. Wang, "Observation of an all-boron fullerene", Nature Chemistry, vol. 6, pp. 727-731, 2014. https://doi.org/10.1038/nchem.1999
H.S. Rzepa, "The distortivity of π-electrons in conjugated boron rings", Physical Chemistry Chemical Physics, vol. 11, pp. 10042, 2009. https://doi.org/10.1039/b911817a
H.S. Rzepa, "Gaussian Job Archive for B40", 2014. https://doi.org/10.6084/m9.figshare.1111454
H.S. Rzepa, "B 40", 2014. https://doi.org/10.14469/ch/24884
H.S. Rzepa, "Gaussian Job Archive for B40", 2014. https://doi.org/10.6084/m9.figshare.1111518
P. Murray-Rust, and H.S. Rzepa, "Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles", Journal of Chemical Information and Computer Sciences, vol. 39, pp. 928-942, 1999. https://doi.org/10.1021/ci990052b
H.S. Rzepa, "Gaussian Job Archive for B40(2+)", 2014. https://doi.org/10.6084/m9.figshare.1111534

Tags:Acrobat, Adobe, chemical markup, DOS, operating systems, PDF, pence, Unix
Posted in Chemical IT, Interesting chemistry | 2 Comments »

The price of information: Evaluating big deal journal bundles

Thursday, July 3rd, 2014

Increasingly, our access to scientific information is becoming a research topic in itself. Thus an analysis of big deal journal bundles[1] has attracted much interesting commentary (including one from a large scientific publisher[2]). In the UK, our funding councils have been pro-active in promoting the so-called GOLD publishing model, where the authors (aided by grants from their own institution or others) pay the perpetual up-front publication costs (more precisely the costs demanded by the publishers, which is not necessarily the same thing) so that their article is removed from the normal subscription pay wall erected by the publisher and becomes accessible to anyone. As the proportion of GOLD content increases, it was anticipated (hoped?) that the costs of accessing the remaining non-GOLD articles via a pay-walled subscription would decrease.

But as was shown[1], the publishers have hitherto arranged for the prices of these subscriptions to be covered by non-disclosure clauses. Which makes it quite difficult for us (the readers of these journals, and of course the main sources of their content as well) to find out if this model is (starting) to actually work. Certainly, the entire system does not yet appear to be in any sort of steady state equilibrium; perhaps it never will achieve this in the current model? For example, although extra funds have been made available to promote GOLD publishing, these cover only a small fraction of the total output of a typical research university. One could respond to this in several ways:

Find the missing funds from somewhere else, which probably means less money for the research itself. This of course is the model that maintains or increases a publisher’s incomes.
Decrease the costs of GOLD publishing. Currently a typical article processing charge ranges from £500-5000 depending on the prestige of the journal. Is it beyond the realm of possibility that this range could change to eg £50-500?
Simply persuade everyone to publish less. Perhaps ten times less? Every group might be restricted to one or two block-buster articles a year, and the rest of their output goes into open repositories? Or indeed into blogs! These two options of course are unlikely to increase publishers’ incomes.

Well, after 350 years of scientific publishing, we appear to have arrived at a critical point. A cross-roads if you like. But who should be in charge of deciding what direction is now taken? Should it not be the very people who create and then “consume” scientific information and knowledge!

References

T.C. Bergstrom, P.N. Courant, R.P. McAfee, and M.A. Williams, "Evaluating big deal journal bundles", Proceedings of the National Academy of Sciences, vol. 111, pp. 9425-9430, 2014. https://doi.org/10.1073/pnas.1403006111
C. Woolston, "Secret publishing deals exposed", Nature, vol. 510, pp. 447-447, 2014. https://doi.org/10.1038/510447f

Tags:GBP, typical article processing charge ranges, United Kingdom
Posted in Chemical IT, General | 8 Comments »

Test of JSmol in WordPress: the background story.

Sunday, June 8th, 2014

A word of explanation about this test page for experimenting with JSmol. Many moons ago I posted about how to include a generated 3D molecular model in a blog post, and have used that method on many posts here ever since. It relied on Java as the underlying software (first introduced in 1996), or almost 20 years ago. Like most software technologies, much has changed, and Java itself (as a compiled language) has had to move to improve its underlying security. In the last year, the Java code itself (in this case Jmol) has needed to be digitally signed in a standard manner, and this meant that many an old site that used unsigned older versions has started to throw up increasingly alarming messages.

To continue to experience the intended effect of eg Jmol, the user in turn has had to increasingly accept or tinker with their local Java settings; this has indeed become increasingly intrusive. And less experienced users often do not wish to engage with any of this activity. About two years ago, the Jmol community started having concerted discussions about what to do regarding Java, and they also started to converge with other developers and communities about a solution based on Javascript (which despite the name operates in an entirely different way from Java). Some of this early activity I tried to capture in a datument written during the summer of 2012[1]. The magnitude of the problem was considerable, how to refactor tens of thousands of Java-Code into JavaScript. The story of how this miracle was accomplished must be written by people like Bob Hanson and Takanori Nakane and perhaps some day they will. However, inserting all this wonderful technology into eg a WordPress blog still needed doing, and this task was undertaken by Jim Hu (and there are many others that are part of this effort, they all need to be thanked).

I volunteered to test, but so that Jim could see the effects of this testing, this (public) test page was created. Behind the scenes, the bugs have been winkled out, although much still remains to be done. This page will no doubt continue to evolve as this is done, and when it all works, I will no doubt add a postscript. So keep watching this space. It has two examples, each of which should produce a box with a molecule, as per this page.

[jsmol pdb=’1PRC’ caption=’Load 1PRC’ commands=” id=’a1′ debug=’true’]
[jsmol caption=’Load local file’ fileurl=’http://rzepa.net/blog/wp-content/uploads/2014/06/test2.pdb’ id=’a2′ commands=’=spacefill 23%;wireframe 0.15;color cpk;’ debug=’true’]

References

H.S. Rzepa, "Chemical datuments as scientific enablers", Journal of Cheminformatics, vol. 5, 2013. https://doi.org/10.1186/1758-2946-5-6

Tags:Bob Hanson, Java, JavaScript, Jim Hu, software technologies, Takanori Nakane, wonderful technology
Posted in Chemical IT | 1 Comment »

A newcomer in the game of how we find and use data.

Saturday, May 17th, 2014

I remember a time when tracking down a particular property of a specified molecule was an all day effort, spent in the central library (or further afield). Then came the likes of STN Online (~1980) and later Beilstein. But only if your institution had a subscription. Let me then cut to the chase: consider this URL: http://search.datacite.org/ui?q=InChIKey%3DLQPOSWKBQVCBKS-PGMHMLKASA-N The site is datacite, which collects metadata about cited data! Most of that data is open in the sense that it can be retrieved without a subscription (but see here that it is not always made easy to do so). So, the above is a search for cited data which contains the InChIkey LQPOSWKBQVCBKS-PGMHMLKASA-N. This produces the result:

This tells you who published the data (but oddly, its date is merely to the nearest year? It is beta software after all). The advanced equivalent of this search looks like this:

where the subject of the search is now the InChIkey. If you are familiar with the various molecular search engines, you will appreciate that this generic data search is still fairly primitive. But SEO (search engine optimisation) achieved by improving the quality of the metadata would help improve that experience.

The important thing about DataCite is that it only searches the metacontent of digital repositories, wherein one may expect to find properly curated data, and in particular the possibility of not merely finding highly processed data, but also of the original (instrumental or computational) datafile from which the metadata was abstracted. Rather than a visual graph, one might expect to also find the original data (to however many decimal points). Rather than just molecular coordinates, one might also find a full wavefunction describing the electron density distribution, or a full spectral analysis. In the original form as deposited by researchers, and not in a processed form as supplied by an “added value” resource. Don’t get me wrong; validated data is wonderful, but validation has to be done according to a schema, and such schemas change, improve, evolve over time.

The other important point I think which the above introduces is the concept that DataCite (and similar organisations) might act as a portal, through which software agents might act to validate/aggregate data. The utopian world would be that every organisation that produces data captures it in a form that DataCite and others can find. Unless of course the data is in itself also their business model, and they wish to exert a monopoly over it. One might appreciate monopolies if the alternative is not having access to the data at all, but perhaps at the expense of innovation? I cannot help but feel that once data citation as shown above becomes a generally accepted best practice amongst scientists, then entirely new ways of adding value to it will emerge in abundance. It would be interesting to see whether the current more monopolistic models survive this transition by upping their own game.

Tags:beta software, generic data search, molecular search engines, search engine, search engine optimisation, search looks, software agents
Posted in Chemical IT | No Comments »

Disambiguation/provenance of claimed scientific opinion and research.

Monday, May 5th, 2014

My name is displayed pretty prominently on this blog, but it is not always easy to find out who the real person is behind many a blog. In science, I am troubled by such anonymity. Well, a new era is about to hit us. When you come across an Internet resource, or an opinion/review of some scientific topic, I argue here that you should immediately ask: “what is its provenance?”

In the 350 year history of scientific dissemination[1], provenance has almost always been provided by publishers. Arguably, that was their most important role (and arranging anonymous peer review). Not that they ever met with their authors or always established that a real person or a real group actually existed! But with the explosion of vanity publication and a host of horror stories about articles for sale to authors keen to have a publication to their name, perhaps the role of provenance needs rethinking.

ORCiD is a project that seems to be gaining serious momentum in achieving a mechanism for disambiguation and provenance of researchers. Thus Brian Kelly (who has played an important role in the modern internet in the UK since 1993 or earlier) encourages all researchers to sign up (although I cannot help noting, rather cheekily, that he does not add his own ORCiD as provenance for his blog). ResearcherID was in fact an earlier organisation to offer such a service, but it is run by a commercial publisher and it is hosted at a “.com“. ORCiD at least claims to be an open (.org)anisation, and carries an open source license. It seems that some UK Universities (home to some researchers) have decided to sign up to ORCiD and most I suspect are planning to deploy these resources amongst their researchers, and quite possibly their students as well (postgraduate initially, maybe even undergraduate eventually).

I jumped the gun somewhat, getting mine more than a year ago. Better the devil you know, etc etc! It is orcid.org/0000-0002-8635-8390. What happened next? Well, I publish data@Figshare, who themselves signed up to be an early member of ORCiD. This gives them access to the API (application programmer interface), and so by supplying my ORCiD to Figshare, I can gain access by proxy to the ORCiD features on offer. The most immediate impact is that ORCiD lists all the data-objects I have published at Figshare, thus establishing a trust between them and my ORCiD identity. Mind you, no-one at ORCiD has ever met me, or checked on who I am. I think that task is going to be delegated eventually to e.g. my university (I am not absolutely certain how the linkage between my ORCiD and my employer, who clearly know me since they pay my salary, will be formalised). Because my employer has also now become an ORCiD member, we will be adding ORCiD API access to our own SPECTRa-DSpace data repository shortly, so that the data held there will also be added to my ORCiD lists.

And as the major journal publishers start to do the same, a formal linkage between my identity (perhaps as verified by my employer), journal-published articles (narratives) and my data publications (via the identifiers known as DOIs) will come into being.

How, you might reasonably ask, is this in the least useful? In truth, I am not sure anyone really knows exactly where this is heading. For example, impactstory.org/about is one added value site which attempts to gather altmetrics about the impact your research is having. But hey, the although the preceding link tells you who founded this organisation, you do not get the kind of provenance I am describing above; none of the founders cite their ORCiDs! You do get their @Twitter accounts though; I wonder what that tells us about the modern interpretation of provenance? Well, my impact can be seen here; in truth it’s not quite the impact I imagined my scientific career was having, but I suppose this is early days. What I am pleased to tell you is that ImpactStory does tell you not only about the impact articles I have published has had, but also the data. Two data sets are described as both discussed and highly viewed. Although as usual, you do not get to learn why the data is being discussed!

Where next? Well, to go back to the start of this post; blogs. It would be nice to formally link this blog to my ORCiD ID (this is not done simply by quoting it here, but via the ORCiD API). If/when I work out how to do this, I will no doubt post the event!

References

H. Oldenburg, "Epistle dedicatory", Philosophical Transactions of the Royal Society of London, vol. 1, pp. i-ii, 1665. https://doi.org/10.1098/rstl.1665.0001

Tags:0000-0002-8635-8390, added value site, API, Internet resource, ORCiD, programmer, United Kingdom
Posted in Chemical IT | No Comments »

Trigonal bipyramidal or square pyramidal: Another ten minute exploration.

Friday, May 2nd, 2014

This is rather cranking the handle, but taking my previous post and altering the search definition of the crystal structure database from 4- to 5-coordinate metals, one gets the following.

Fe …

Co …

Ni …

Cu …

Trigonal bipyramidal coordination has angles of 90, 120 and 180°. Square pyramidal has no 120° angles, and the 180° angles might be somewhat reduced. Thus the Fe and Co series have plenty of 120, whereas the Ni and Cu series hardly any. The Ni series has many 160° values. It is clearly a serious issue that attempting any correlation with the spin states is going to be a lot of really hard work (I might next do another simple search where bond lengths can be shown to very closely correlate with low/medium/high spin states). I will not be trying a more finely grained analysis of the above plots; I just wanted to point out how very simple and quick they are to generate.

Tags:Fe and Co, search definition
Posted in Chemical IT, crystal_structure_mining, General | 1 Comment »

Tetrahedral or square planar? A ten minute exploration.

Wednesday, April 30th, 2014

I love experiments where the insight-to-time-taken ratio is high. This one pertains to exploring the coordination chemistry of the transition metal region of the periodic table; specifically the tetra-coordination of the series headed by Mn-Ni. Is the geometry tetrahedral, square planar, or other? One can get a statistical answer in about ten minutes.
The (CCDC database) search definition required is shown above. The central atom defines the column of the period table, it is specified to have precisely four other atoms bonded to it, which can be any other element. These four bonds are specified as acyclic (to avoid any bias introduced by rings). And two angles are defined subtending the central atom. And off we go, defining on the way that the hits must be refined to an R-factor of < 0.05, have no disorder, and no errors.

Mn, (Tc), Re

Fe, Ru, Os

Co, Rh, Ir

Ni, Pd, Pt

Square planar coordination will manifest with pairs of angles of either 90° or 180°, whilst tetrahedral coordination will reveal only 109°.

Both the Mn and the Fe series show a (red) hotspot at the tetrahedral value.
The Co series shows a tetrahedral hot spot AND a somewhat less abundant square planar double-hot spot for the combination 90/180 and 180/90.
The Ni series reveals the hottest spots to correspond to square planar, but with a significant tetrahedral cluster.

This quick survey can be followed up by more detailed explorations of the clusters. For example, can one go to the literature and find out the typical spin state for e.g. the Ni series in each of the geometries. Unfortunately, the CCDC database does not record what the spin state of any individual compound is; one will have to go to the original literature to find out. What a shame that the linkage between two quite different properties is (as far as I know) not available in any easily searchable form. Alternatively, one can narrow down the searches to individual searches of row 1, 2 or 3 of the transition series and then compare the behaviour. The possibilities are considerable.

Then there are the outliers in each plot. Some (many?) may prove to be due to faulty data (whilst we have specified no errors, they can still occur) but others may be due to an unusual structural feature, or perhaps even an as yet unrecognized phenomenon! Set as a student experiment, one might ask each student to explore say 3 outliers and express an opinion as to what causes them to deviate. Enjoy!

Tags:data, Pt[/caption] Square, search definition, transition metal region
Posted in Chemical IT, crystal_structure_mining, General | No Comments »

Henry Rzepa's blog

Archive for the ‘Chemical IT’ Category

Electronic notebooks: a peek into the future?

One molecule, one identifier: Viewing molecular files from a digital repository using metadata standards.

References

Data galore! 134 kilomolecules.

References

Data nightmares: B40 and counting its π-electrons

References

The price of information: Evaluating big deal journal bundles

References

Test of JSmol in WordPress: the background story.

References

A newcomer in the game of how we find and use data.

Disambiguation/provenance of claimed scientific opinion and research.

References

Trigonal bipyramidal or square pyramidal: Another ten minute exploration.

Tetrahedral or square planar? A ten minute exploration.

Recent Posts

Archives

Blogroll

Meta