Posts Tagged ‘data’
Monday, April 8th, 2019
The conventional procedures for reporting analysis or new results in science is to compose an “article”, augment that perhaps with “supporting information” or “SI”, submit to a journal which undertakes peer review, with revision as necessary for acceptance and finally publication. If errors in the original are later identified, a separate corrigendum can be submitted to the same journal, although this is relatively rare. Any new information which appears post-publication is then considered for a new article, and the cycle continues. Here I consider the possibilities for variations in this sequence of events.
The new disruptors in the processes of scientific communication are the “data“, which can now be given a separate existence (as FAIR data) from the article and its co-published “SI”. Nowadays both the “article+SI” and any separate “data” have another, mostly invisible component, the “metadata“. Few authors ever see this metadata. For the article, it is generated by the publisher (as part of the service to the authors), and sent to CrossRef, which acts as a global registration agency for this particular metadata. For the data, it is assembled when the data is submitted to a “data repository”, either by the authors providing the information manually, or by automated workflows installed in the repository or by a combination of both. It might also be assembled by the article publisher as part of a complete metadata package covering both article and data, rather than being separated from the article metadata. Then, the metadata about data is registered with the global agency DataCite (and occasionally with CrossRef for historical reasons).‡ Few depositors ever inspect this metadata after it is registered; even fewer authors are involved in decisions about that metadata, or have any inputs to the processes involved in its creation.
Let me analyse a recent example.
- For the article[1] you can see the “landing page” for the associated metadata as https://search.crossref.org/?q=10.1021/acsomega.8b03005 and actually retrieve the metadata using https://api.crossref.org/v1/works/10.1021/acsomega.8b03005, albeit in a rather human-unfriendly manner.† This may be because metadata as such is considered by CrossRef as something just for machines to process and not for humans to see!
-
- This metadata indicates “references-count":22, which is a bit odd since 37 are actually cited in the article. It is not immediately obvious why there is a difference of 15 (I am querying this with the editor of the journal). None of the references themselves are included in the metadata record, because the publisher does not currently support liberation using Open References, which makes it difficult to track the missing ones down.
- Of the 37 citations listed in the article itself,[1] #22, #24 and #37 are different, being citations to different data sources. The first of these, #22 is an explicit reference to its data partner for the article.
- An alternative method of invoking a metadata record;
https://data.datacite.org/application/vnd.datacite.datacite+xml/10.1021/acsomega.8b03005
retrieves a sub-set of the article metadata available using the CrossRef query,‡ but again with no included references and again nothing for the data citation #22.
- Citation #22 in the above does have its own metadata record, obtainable using:
https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/4751
- This has an entry
<relatedIdentifier relatedIdentifierType="DOI" relationType="IsReferencedBy">10.1021/acsomega.8b03005</relatedIdentifier>
which points back to the article.[1]
- To summarise, the article noted above[1] has a metadata record that does not include any information about the references/citations (apart from an ambiguous count). A human reading the article can however can easily identify one citation pointing to the article data, which it turns out DOES have a metadata record which both human and machine can identify as pointing back to the article. Let us hope the publisher (the American Chemical Society) corrects this asymmetry in the future; it can be done as shown here![2]
For both types of metadata record, it is the publisher that retains any rights to modify them. Here however we encounter an interesting difference. The publishers of the data are, in this case, also the authors of the article! A modification to this record was made post-publication by this author so as to include the journal article identifier once it had been received from the publisher,[1] as in 2 above. Subsequently, these topics were discussed at a workshop on FAIR data, during which further pertinent articles[3], [4], [5] relating to the one discussed above[1] were shown in a slide by one of the speakers. Since this was deemed to add value to the context of the data for the original article, identifiers for these articles were also appended to the metadata record of the data.
This now raises the following questions:
- Should a metadata record be considered a living object, capable of being updated to reflect new information received after its first publication?
- If metadata records are an intrinsic part of both a scientific article and any data associated with that article, should authors be fully aware of their contents (if only as part of due diligence to correct errors or to query omissions)?
- Should the referees of such works also be made aware of the metadata records? It is of course enough of a challenge to get referees to inspect data (whether as SI or as FAIR), never mind metadata! Put another way, should metadata records be considered as part of the materials reviewed by referees, or something independent of referees and the responsibility of their publishers?
- More generally, how would/should the peer-review system respond to living metadata records? Should there be guidelines regarding such records? Or ethical considerations?
I pose these questions because I am not aware of much discussion around these topics; I suggest there probably should be!
‡Actually CrossRef and DataCite exchange each other’s metadata. However, each uses a somewhat different schema, so some components may be lost in this transit. †JSON, which is not particularly human friendly.
References
- A. Barba, S. Dominguez, C. Cobas, D.P. Martinsen, C. Romain, H.S. Rzepa, and F. Seoane, "Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data", ACS Omega, vol. 4, pp. 3280-3286, 2019. https://doi.org/10.1021/acsomega.8b03005
- S. Arkhipenko, M.T. Sabatini, A.S. Batsanov, V. Karaluka, T.D. Sheppard, H.S. Rzepa, and A. Whiting, "Mechanistic insights into boron-catalysed direct amidation reactions", Chemical Science, vol. 9, pp. 1058-1072, 2018. https://doi.org/10.1039/c7sc03595k
- T. Monaretto, A. Souza, T.B. Moraes, V. Bertucci‐Neto, C. Rondeau‐Mouro, and L.A. Colnago, "Enhancing signal‐to‐noise ratio and resolution in low‐field NMR relaxation measurements using post‐acquisition digital filters", Magnetic Resonance in Chemistry, vol. 57, pp. 616-625, 2018. https://doi.org/10.1002/mrc.4806
- D. Barache, J. Antoine, and J. Dereppe, "The Continuous Wavelet Transform, an Analysis Tool for NMR Spectroscopy", Journal of Magnetic Resonance, vol. 128, pp. 1-11, 1997. https://doi.org/10.1006/jmre.1997.1214
- U.L. Günther, C. Ludwig, and H. Rüterjans, "NMRLAB—Advanced NMR Data Processing in Matlab", Journal of Magnetic Resonance, vol. 145, pp. 201-208, 2000. https://doi.org/10.1006/jmre.2000.2071
Tags:Academic publishing, American Chemical Society, author, Business intelligence, Company: DataCite, CrossRef, data, Data management, DataCite, editor, EIDR, Information, Information science, JSON, Knowledge representation, Metadata repository, Records management, Technology/Internet, The Metadata Company
Posted in Chemical IT | No Comments »
Saturday, February 16th, 2019
The title of this post comes from the site www.crossref.org/members/prep/ Here you can explore how your favourite publisher of scientific articles exposes metadata for their journal.
Firstly, a reminder that when an article is published, the publisher collects information about the article (the “metadata”) and registers this information with CrossRef in exchange for a DOI. This metadata in turn is used to power e.g. a search engine which allows “rich” or “deep” searching of the articles to be undertaken. There is also what is called an API (Application Programmer Interface) which allows services to be built offering deeper insights into what are referred to as scientific objects. One such service is “Event Data“, which attempts to create links between various research objects such as publications, citations, data and even commentaries in social media. A live feed can be seen here.
So here are the results for the metadata provided by six publishers familiar to most chemists, with categories including;
- References
- Open References
- ORCID IDs
- Text mining URLs
- Abstracts

RSC

ACS

Elsevier

Springer-Nature

Wiley

Science
One immediately notices the large differences between publishers. Thus most have 0% metadata for the article abstracts, but one (the RSC) has 87%! Another striking difference is those that support open references (OpenCitations). The RSC and Springer Nature are 99-100% compliant whilst the ACS is 0%. Yet another variation is the adoption of the ORCID (Open Researcher and Collaborator Identifier), where the learned society publishers (RSC, ACS) achieve > 80%, but the commercial publishers are in the lower range of 20-49%.
To me the most intriguing was the Text mining URLs. From the help pages, “The Crossref REST API can be used by researchers to locate the full text of content across publisher sites. Publishers register these URLs – often including multiple links for different formats such as PDF or XML – and researchers can request them programatically“. Here the RSC is at 0%, ACS is at 8% but the commercial publishers are 80+%. I tried to find out more at e.g. https://www.springernature.com/gp/researchers/text-and-data-mining but the site was down when I tried. This can be quite a controversial area. Sometimes the publisher exerts strict control over how the text mining can be carried out and how any results can be disseminated. Aaron Swartz famously fell foul of this.
I am intrigued as to how, as a reader with no particular pre-assembled toolkit for text mining, I can use this metadata provided by the publishers to enhance my science. After all, 80+% of articles with some of the publishers apparently have a mining URL that I could use programmatically. If anyone reading this can send some examples of the process, I would be very grateful.
Finally I note the absence of any metadata in the above categories relating to FAIR data. Such data also has the potential for programmatic procedures to retrieve and re-use it (some examples are available here[1]), but apparently publishers do not (yet) collect metadata relating to FAIR. Hopefully they soon will.
References
- A. Barba, S. Dominguez, C. Cobas, D.P. Martinsen, C. Romain, H.S. Rzepa, and F. Seoane, "Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data", ACS Omega, vol. 4, pp. 3280-3286, 2019. https://doi.org/10.1021/acsomega.8b03005
Tags:Aaron Swartz, Academic publishing, API, Business intelligence, CrossRef, data, Data management, Elsevier, favourite publisher, Identifiers, Information, Information science, Knowledge, Knowledge representation, metadata, mining, ORCiD, PDF, Pre-exposure prophylaxis, Publishing, Publishing Requirements for Industry Standard Metadata, Records management, Research Object, Scholarly communication, Scientific literature, search engine, social media, Technical communication, Technology/Internet, text mining, Written communication, XML
Posted in Interesting chemistry | 1 Comment »
Saturday, February 16th, 2019
The title of this post comes from the site www.crossref.org/members/prep/ Here you can explore how your favourite publisher of scientific articles exposes metadata for their journal.
Firstly, a reminder that when an article is published, the publisher collects information about the article (the “metadata”) and registers this information with CrossRef in exchange for a DOI. This metadata in turn is used to power e.g. a search engine which allows “rich” or “deep” searching of the articles to be undertaken. There is also what is called an API (Application Programmer Interface) which allows services to be built offering deeper insights into what are referred to as scientific objects. One such service is “Event Data“, which attempts to create links between various research objects such as publications, citations, data and even commentaries in social media. A live feed can be seen here.
So here are the results for the metadata provided by six publishers familiar to most chemists, with categories including;
- References
- Open References
- ORCID IDs
- Text mining URLs
- Abstracts

RSC

ACS

Elsevier

Springer-Nature

Wiley

Science
One immediately notices the large differences between publishers. Thus most have 0% metadata for the article abstracts, but one (the RSC) has 87%! Another striking difference is those that support open references (OpenCitations). The RSC and Springer Nature are 99-100% compliant whilst the ACS is 0%. Yet another variation is the adoption of the ORCID (Open Researcher and Collaborator Identifier), where the learned society publishers (RSC, ACS) achieve > 80%, but the commercial publishers are in the lower range of 20-49%.
To me the most intriguing was the Text mining URLs. From the help pages, “The Crossref REST API can be used by researchers to locate the full text of content across publisher sites. Publishers register these URLs – often including multiple links for different formats such as PDF or XML – and researchers can request them programatically“. Here the RSC is at 0%, ACS is at 8% but the commercial publishers are 80+%. I tried to find out more at e.g. https://www.springernature.com/gp/researchers/text-and-data-mining but the site was down when I tried. This can be quite a controversial area. Sometimes the publisher exerts strict control over how the text mining can be carried out and how any results can be disseminated. Aaron Swartz famously fell foul of this.
I am intrigued as to how, as a reader with no particular pre-assembled toolkit for text mining, I can use this metadata provided by the publishers to enhance my science. After all, 80+% of articles with some of the publishers apparently have a mining URL that I could use programmatically. If anyone reading this can send some examples of the process, I would be very grateful.
Finally I note the absence of any metadata in the above categories relating to FAIR data. Such data also has the potential for programmatic procedures to retrieve and re-use it (some examples are available here[1]), but apparently publishers do not (yet) collect metadata relating to FAIR. Hopefully they soon will.
References
- A. Barba, S. Dominguez, C. Cobas, D.P. Martinsen, C. Romain, H.S. Rzepa, and F. Seoane, "Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data", ACS Omega, vol. 4, pp. 3280-3286, 2019. https://doi.org/10.1021/acsomega.8b03005
Tags:Aaron Swartz, Academic publishing, API, Business intelligence, CrossRef, data, Data management, Elsevier, favourite publisher, Identifiers, Information, Information science, Knowledge, Knowledge representation, metadata, mining, ORCiD, PDF, Pre-exposure prophylaxis, Publishing, Publishing Requirements for Industry Standard Metadata, Records management, Research Object, Scholarly communication, Scientific literature, search engine, social media, Technical communication, Technology/Internet, text mining, Written communication, XML
Posted in Interesting chemistry | 1 Comment »
Saturday, December 29th, 2018
The traditional structure of the research article has been honed and perfected for over 350 years by its custodians, the publishers of scientific journals. Nowadays, for some journals at least, it might be viewed as much as a profit centre as the perfected mechanism for scientific communication. Here I take a look at the components of such articles to try to envisage its future, with the focus on molecules and chemistry.
The formula which is mostly adopted by authors when they sit down to describe their chemical discoveries is more or less as follows:
- An introduction, setting the scene for the unfolding narrative
- Results. This is where much of the data from which the narrative is derived is introduced. Such data can be presented in the form of:
- Tables
- Figures and schemes
- Numerical and logical data embedded in narrative text
- Discussion, where the models constructed from the data are illustrated and new inferences presented. Very often categories 2 and 3 are conflated into one single narrative.
- Conclusions, where everything is brought together to describe the essential aspects of the new science.
- Bibliography, where previous articles pertinent to the narrative are listed.
In the last decade or so, the management of research data has developed as a field of its own, with three phases:
- Setting out a data management plan at the start of the project, often a set of aspirations together with putative actions,
- the day-to-day management of the data as it emerges in the form of an electronic laboratory notebook (ELN),
- the publication of selected data from the ELN into a repository, together with the registration of metadata describing the properties of the data.
In the latter category, item 8 can be said to be a game-changer, a true disruptive influence on the entire process. The key aspect is that it constitutes independent publication of data to sit alongside the object constructed from 1-5. More disruption emerges from the open citations project, whereby category 5 above can be released by publishers to adopt its own separate existence. So now we see that of the five essential anatomic components of a research article, two are already starting to achieve their own independence. Clearly the re-invention of the anatomy of the research article is well under way already.
Next I take a look at what sorts of object might be found in category 8, drawing very much on our own experience of implementing 7 and 8 over the last twelve years or so. I start by observing that in 2 above, figures are perhaps the object most in need of disruptive re-invention. In the 1980s, authors were much taken by the introduction of colour as a means of conveying information within a figure more clearly; although the significant costs then had to be borne directly by these authors (and with a few journals this persists to this day). By the early 1990s, the introduction of the Web[1] offered new opportunities not only of colour but of an extra dimension (or at least the illusion of one) by means of introducing interactivity for three-dimensional models. Some examples resulting from combining figures from category 2 with 8 above are listed in the table below.
Example 1 illustrates how a figure from category 2 above can be augmented with active hyperlinks specifying the DOI of the data in category 8 from which the figure is derived, thus creating a direct and contextual connection between the research article and the research data it is based upon. These links are embedded only in the Acrobat (PDF) version of the article as part of the production process undertaken by the journal publisher. Download Figure 9 from the link here and try it for yourself or try the entire article from the journal, where more figures are so enhanced.
Example 2 takes this one stage further. The hyperlinks in the published figure in example 1 were embedded in software capable of resolving them, namely a PDF viewer. But that is all that this software allows. By relocating the hyperlink into a Web browser instead, one can add further functionality in the form of Javascripts perhaps better described as workflows (supported by browsers but not supported by Acrobat). There are three such workflows in example 2.
- The first uses an image map to associate a region of the figure data object defined by a DOI.
- The second interrogates the metadata specifically associated with the DOI (the same DOIs that are seen in the figure itself) to see if there is any so-called ORE metadata available (ORE= Object Re-use and Exchange). If there is, it uses this information to retrieve the data itself and pass it through to
- the third workflow represented by a set of JavaScripts known as JSmol. These interpret the data received and construct an interactive visual 3D molecular model representing the retrieved data.
All this additional workflowed activity is implemented in a data repository. It is not impossible that it could also be implemented at the journal publisher end of things, but it is an action that would have to be supported by multiple publishers. Arguably this sort of enhancement is far better suited and more easily implemented by a specialised data publisher, i.e. a data repository.
Example 3 does the same thing for a table.
Example 4 enhances in a different manner. Conventionally NMR data is added to the supporting information file associated with a journal article, but such data is already heavily processed and interpreted. The raw instrumental data is never submitted to the journal and is pretty much always possibly only available by direct request from the original researchers (at least if the request is made whilst the original researchers are still contactable!). The data repository provides a new mechanism for making such raw instrumental (and indeed computational) data an integral part of the scientific process.
Example 5 shows how a bibliography can be linked to a secondary bibliography (citations 35 and 36 in this example in the narrative article) and perhaps in the future to Open Citations semantic searches for further cross references.
So by deconstructing the components of the standard scientific article, re-assembling some of them in a better-suited environment and then linking the two sets of components to each other, one can start to re-invent the genre and hopefully add more tools for researchers to use to benefit their basic research processes. The scope for innovation seems considerable. The issue of course is (a) whether publishers see this as a viable business model or whether they instead wish to protect their current model of the research article and whether (b) authors wish to undertake the learning curve and additional effort to go in this direction. As I have noted before, the current model is deficient in various ways; I do not think it can continue without significant reinvention for much longer. And I have to ask that if reinvention does emerge, will science be the prime beneficiary?
References
- H.S. Rzepa, B.J. Whitaker, and M.J. Winter, "Chemical applications of the World-Wide-Web system", Journal of the Chemical Society, Chemical Communications, pp. 1907, 1994. https://doi.org/10.1039/c39940001907
Tags:Academic publishing, Acrobat, Articles, chemical discoveries, data, Data management, ELN, Information, Molecules, Narrative, PDF, Publishing, Research, Scholarly communication, Science, Scientific Journal, Scientific method, Technical communication, Technology/Internet, Web browser
Posted in Chemical IT | No Comments »
Tuesday, August 7th, 2018
Harnessing FAIR data is an event being held in London on September 3rd; no doubt most speakers will espouse its virtues and speculate about how to realize its potential. Admirable aspirations indeed, but capturing hearts and minds also needs lots of real life applications! Whilst assembling a forthcoming post on this blog, I realized I might have one nice application which also pushes the envelope a bit further, in a manner that I describe below.
The post I refer to above is about using quantum chemical calculations to chart possible mechanistic pathways for the reaction between a carboxylic acid and an amine to form an amide. The FAIR data for the entire project is collected at DOI: 10.14469/hpc/4598. Part of what makes it FAIR is the metadata not only collected about this data but also formally registered with the DataCite agency. Registration in turn enables Finding; it is this aspect I want to demonstrate here.
The metadata for the above DOI includes information such as;
- The ORCID persistent identifier (PID) for the creator of the data (in this instance myself)
- Date stamps for the original creation date and subsequent modifications.
- A rights declaration, in this case the CC0 license which describes how the data can be re-used.
- Related identifiers, in this case describing members of this collection.
The data itself is held in the members of the collection, each of which is described by a more specific set of metadata in addition to the more general types in the above list (e.g. 10.14469/hpc/4606).
- One important additional metadata descriptor is the ORE locator (Object Re-use and Exchange, itself almost a synonym for FAIR). This allows a machine to deduce a direct path to the data file itself, and hence to retrieve it automatically if desired. It is important to note that the DOI itself (i.e. 10.14469/hpc/4606) points only to the “landing page” for the dataset, and does not necessarily describe the direct path to any specific file in the dataset. The ORE path can be used with e.g. software such as JSmol to directly load a molecule based only on its DOI. You can see an example of this here.
- Each molecule-based dataset contains additional specific metadata relating to the molecule itself. For example this is how the InChiKey, an identifier specific to that molecule, is expressed in metadata;
<subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">PVXKWVPAMVWJSQ-UHFFFAOYSA-N</subject>
The advantage of expressing the metadata in this way is that a general search of the type:
https://search.datacite.org/works?query=subjectScheme:inchikey+subject:CZABGBRSHXZJCF-UHFFFAOYSA-N
can be used to track down any molecule with metadata corresponding to the above InChIkey.
- Here is more metadata, introduced in this blog. It relates to the (computed) value of the Gibbs energy (the energy unit is in Hartree†), as returned by the Gaussian program;
<subject subjectScheme="Gibbs_Energy" schemeURI="https://goldbook.iupac.org/html/G/G02629.html" valueURI="http://gaussian.com/thermo/">-649.732417</subject>
I here argue that it represents a unique identifier for a molecule calculation using the quantum mechanical procedures implemented in e.g. Gaussian. This identifier is different from the InChIkey, in that it can be truncated to provide different levels of information.
- At the coarsest level, a search of the type
https://search.datacite.org/works?query=subjectScheme:Gibbs_energy+subject:-649.*
should reveal all molecules with the same number of atoms and electrons whose Gibbs energy has been calculated, but not necessarily with the same InChI (i.e. they may be isomers, or transition states, etc). This level might be useful for revealing most (not necessarily all‡) molecules involved in say a reaction mechanism. It should also be insensitive to the program system used, since most quantum codes will return a value for the Gibbs energy if the same procedures have been used (i.e. DFT method, basis set, solvation model and dispersion correction) accurate to probably 0.01 Hartree.
- The top level of precision however is high enough to almost certainly relate to a specific molecule and probably using a specific program;
https://search.datacite.org/works?query=subjectScheme:Gibbs_energy+subject:-649.732417
- The searcher can experiment with different levels of precision to narrow or broaden the search.
- I would also address the issue (before someone asks) of why I have used the Gibbs energy rather than the Total energy. Put simply, the Gibbs energy is far more useful in a chemical context. It can be used to relate the relative Gibbs energies of different isomers of the same molecule to e.g. the equilibrium constant that might be measured. Or the difference in Gibbs energies between a reactant and a transition state can be used to derive the free energy activation barrier for a reaction. The total energy is not so useful in such contexts, although of course it too could be added as a subject in the metadata above if a real use for it is found.
- The searcher can also use Boolean combinations of metadata, such as specifying both the InChIKey and the Gibbs Energy, along with say the ORCID of the person who may have published the data;
https://search.datacite.org/works?query=
subjectScheme:Gibbs_energy+subject:-649.*+
subjectScheme:inchikey+subject:CZABGBRSHXZJCF-UHFFFAOYSA-N+
ORCID:0000-0002-8635-8390♥
I have tried to show above how FAIR data implies some form of rich (registered) metadata. And how the metadata can be used to Find (the F in FAIR) data with very specific properties, thus Harnessing FAIR data.
†It is a current limitation of the V4.1 DataCite schema that there appears no way to specify the data type of the subject, including any units. ‡In theory, a range query of the type:
https://search.datacite.org/works?query=
subjectScheme:Gibbs_energy+subject:[-649.1 TO -649.8]
should be more specific, but I have not yet gotten it to work, probably because of the lack of data-typing means it is not recognised as a range of numeric values. ♥Implicit in this search is the grouping
https://search.datacite.org/works?query=(subjectScheme:Gibbs_energy+subject:-649.*)
+
(subjectScheme:inchikey+subject:CZABGBRSHXZJCF-UHFFFAOYSA-N)
+ORCID:0000-0002-8635-8390
Currently however DataCite do not correctly honour this form of grouping.
Tags:Academic publishing, chemical context, Code, data, DataCite, energy, free energy activation barrier, Identifiers, Information, ISO/IEC 11179, ORCiD, quantum chemical calculations, real life applications, Technical communication
Posted in Interesting chemistry | 9 Comments »
Thursday, December 7th, 2017
FAIR data is increasingly accepted as a description of what research data should aspire to; Findable, Accessible, Inter-operable and Re-usable, with Context added by rich metadata (and also that it should be Open). But there are two sides to data, one of which is the raw data emerging from say an instrument or software simulations and the other in which some kind of model is applied to produce semi- or even fully processed/interpreted data. Here I illustrate a new example of how both kinds of data can be made to co-exist.
I will start with a recent publication[1] with the title Crystallographic Snapshot of an Arrested Intermediate in the Biomimetic Activation of CO2. The nature of this intermediate caught the eye of another research group, who responded with their own critique[2]‡ along with the comment “However, since we have no access to the original crystallographic data …” They might have been referring to the semi-processed data (containing the so-called hkl structure factors) but they may also have been alluding to the raw image data captured directly from the diffractometer cameras. That traditionally has not been available via the CSD (Cambridge structural database), but would be required for a complete re-analysis of the crystal structure. Now the first example of how both FAIR (processed) data and raw data can co-exist has appeared.
The latest version of the CSD database shows an entry resulting from the following publication[3] and the deposited data has its own DOI there (10.5517/ccdc.csd.cc1n9ppb). That entry in turn has a DOI pointer to the Raw data (10.14469/hpc/2300) held in a different location and the pointer is reciprocated (⇌) with the latter pointing back to the former. Both datasets point to the original article, thus completing a holy triangle.†

There is more. The Raw dataset (10.14469/hpc/2300) declares it is a member of a superset, called Crystal structure data for Synthesis and Reactions of Benzannulated Spiroaminals; Tetrahydrospirobiquinolines (10.14469/hpc/2297) where you can find information about six other related structures. That collection is in turn a member of a superset called Synthesis and Reactions of Benzannulated Spiroaminals; Tetrahydrospirobiquinolines (10.14469/hpc/2099) where DOIs to other types of data associated with this project can be found, such as Computational data (10.14469/hpc/2098) and NMR data (10.14469/hpc/2294). Although a human can with some determination follow these associations up, down and across, the system is designed to also be followed by automated algorithms that could traverse this web quickly and efficiently.
So you can now see that a crystal structure held in the CSD could be the starting point for a journey of FAIR data discovery, in manner that has not hitherto been possible. How quickly the CSD will become populated by links to Raw (and other) data remains to be seen. I have not yet discovered any mechanism for specifying a CSD query which stipulates that Raw data must be available, but no doubt this will come.
To end, back to the Biomimetic Activation of CO2 referred to at the start. With no access to the original data, recourse was made to computational modelling.[2] Which where I came in, since I wanted access to the original (computational) data. Sadly it did not appear to be available with the article,[2] in much the same manner as the original complaint. Perhaps, when FAIR data becomes fully accepted as part of how science is done nowadays, such complaints will become ever rarer!
‡In fact the original authors did respond[4] with an acknowledgement that their original conclusions were not correct.
†Almost. The article [3] cites DOI: 10.14469/hpc/2099 (Ref 28), but it does not cite DOI: 10.5517/ccdc.csd.cc1n9ppb because the latter had not been minted yet at the time the final proofs were corrected, and there is no mechanism to add it at a later stage.
References
- S.L. Ackermann, D.J. Wolstenholme, C. Frazee, G. Deslongchamps, S.H.M. Riley, A. Decken, and G.S. McGrady, "Crystallographic Snapshot of an Arrested Intermediate in the Biomimetic Activation of CO<sub>2</sub>", Angewandte Chemie International Edition, vol. 54, pp. 164-168, 2014. https://doi.org/10.1002/anie.201407165
- J. Hurmalainen, M.A. Land, K.N. Robertson, C.J. Roberts, I.S. Morgan, H.M. Tuononen, and J.A.C. Clyburne, "Comment on “Crystallographic Snapshot of an Arrested Intermediate in the Biomimetic Activation of CO<sub>2</sub>”", Angewandte Chemie International Edition, vol. 54, pp. 7484-7487, 2015. https://doi.org/10.1002/anie.201411654
- J. Almond-Thynne, A.J.P. White, A. Polyzos, H.S. Rzepa, P.J. Parsons, and A.G.M. Barrett, "Synthesis and Reactions of Benzannulated Spiroaminals: Tetrahydrospirobiquinolines", ACS Omega, vol. 2, pp. 3241-3249, 2017. https://doi.org/10.1021/acsomega.7b00482
- S.L. Ackermann, D.J. Wolstenholme, C. Frazee, G. Deslongchamps, S.H.M. Riley, A. Decken, and G.S. McGrady, "Corrigendum: Crystallographic Snapshot of an Arrested Intermediate in the Biomimetic Activation of CO<sub>2</sub>", Angewandte Chemie International Edition, vol. 54, pp. 7470-7470, 2015. https://doi.org/10.1002/anie.201504197
Tags:computing, Context, data, Data management, Information, Knowledge, Raw data, software simulations, Technology/Internet
Posted in Chemical IT, crystal_structure_mining | No Comments »
Thursday, March 30th, 2017
In an era when alternative facts and fake news afflict us, the provenance of scientific data becomes ever more important. Especially if that data is available as open access and exploitable by others for both valid scientific reasons but potentially also by those with other motives. Here I consider the audit trail that might serve to establish data provenance in one typical situation in chemistry, the acquisition of NMR instrumental data.
Here I describe how such data is generated in my department; details may vary elsewhere.
- The prospective user of the NMR service is allocated a service ID. In our case, that ID relates to the research group rather than to individual researchers. This ID is parochial, it does not reference any other information about the user in the institute. Only the service manager has the information to associate this ID with real users and this information is normally not distributed.
- When a sample is submitted, this ID is used to create a new folder containing the data as a sub-folder of the group ID and located on the NMR data servers.
- The dataset itself‡ contains a number of files that contain an audit trail (names such as audita.txt, auditp.txt) with the fields: ##AUDIT TRAIL= $$ (NUMBER, WHEN, WHO, WHERE, PROCESS, VERSION, WHAT). Typically, none of these files have propagated the original user ID under which the data was collected; to do so would require a programmatic connection between the local authentication systems and the spectrometer software used, a connection that is normally missing. Thus the first break in the provenance trail.
- In principle other audit trails can be inferred from these files, such as the unique identity of the instrument provided by its manufacturer. Further information such as e.g. the probe used to collect the data (probes can be readily changed over) or any calibration data used in setting up the instrument for the data collection are by and large not recorded. To my knowledge, although an instrument can have a unique serial number, such serial numbers of swappable components such as probes are not recorded by the collection software. Thus the second break in the provenance trail.
- This data then needs to be processed by further software. In this case we use the MestreNova system for this task. Each dataset has editable assigned properties; below I show those that can be associated with the spectrum (accessed with MestreNova using Edit/Properties). All this comes from the information collected by the instrument. The user’s identity can be inserted into the “title” field, the display of which is off by default.

- There is also a section for parameters, a synonym for which might be metadata and accessed using this program from View/Tables/Parameters. If Author was entered as a parameter in the dataset by the spectrometer software, the Mnova document would retrieve that information. Equally, an ORCID identifier for the author entered at the time of data collection and thus stored in the dataset could be read by Mnova, stored and displayed if configured to do so. It would be fair to say however that this option is rarely if indeed ever systematically implemented by NMR instrument data collection software and so is never propagated to the data processing software (as highlighted in red below). Thus a third break in the provenance trail.
This is also an alternative and this time formal metadata field that can be populated, by default as shown below with the type of spectrum and nucleus. These properties are not controlled in the sense of only allowing those terms that are present in a specified dictionary. The jargon for such control is a metadata schema. This is not used here, since dissemination of this information is not intended; the software accepts whatever information it is given.
There are thus several opportunities to collect the identity of the experimenter and thus attribute provenance to the collected data, but this does very much depend on the will of researchers, institutions or publishers to enforce specific policies around this. The fourth break in the provenance trail.
- The dataset can then be uploaded (DOI: 10.14469/hpc/1291), at which stage provenance can finally be added using the ORCID credentials of the person publishing the dataset, who of course may or may not be the person who actually recorded the data! The full metadata for this specific collection can be seen at data.datacite.org/10.14469/hpc/1291. Or to put it another way, this is the first point in the provenance chain where the metadata is controlled by a schema and is also discoverable in a standard programmatic manner, i.e. the preceding link. The provenance is now formally associated with the ORCID identifier using the DataCite metadata schema. You should be aware that a local policy† is that access to the repository at https://data.hpc.imperial.ac.uk is only allowed by cross-authentication with http://orcid.org/ using the user’s ORCID. This identifier is then automatically propagated to the metadata held at e.g. data.datacite.org/10.14469/hpc/1095. Currently however, none of any metadata originally recorded in either the instrumental file set or the processed MestreNova file is forwarded on to the metadata record held at DataCite; again loss of information and potentially of provenance.
- The peer-reviewed article resulting from the interpretation of this data however can be associated with the provenance introduced in the previous stage; see data.datacite.org/10.14469/hpc/1267 and the IsReferencedBy property.
Now imagine if there was a common thread in all the stages of acquiring, processing and publishing this scientific data based on the ORCID.
- Providing an ORCID could be made an essential requirement of access to the instrument.
- This information would be propagated to the dataset …
- by inclusion in one or more of the audit trail files.
- At this stage, further persistent identifiers associated with the instrument manufacturer could be added, which help identify not only the instrument used, but sub-components such as the changeable probe. This would allow access to any calibration curves or probe sensitivity and other aspects.
- The ORCID and other relevant information could be picked up by the software used to convert the data into spectra and propagated into the metadata containers for this software …
- where its use is controlled by a specified schema.
- At this stage, the ORCID and information such as the nucleus recorded, the sample temperature etc can be propagated on to the final metadata records.
- And the reader of the article describing this work would have a formally defined provenance audit trail they could follow back to the start of the experiment or forward to a published article. In this case, the data claims provenance (acquired from peer review) from the article, but it should also work in reverse with the article claiming provenance from the data on which it is based. The indexing of this bidirectional exchange is one of the exciting features that we should see emerging from CrossRef (holders of metadata about articles) and DataCite (holders of metadata about research data) in the near future.
We are clearly a little way from having the infrastructures described above for establishing such data audit trails. To do so will require cooperation from instrument manufacturers, at least in the example as charted above, as well as researchers, institutions, publishers, peer-reviewers and funding bodies. The first step would be to ensure that all scientists who intend collecting, processing and publishing data should claim an ORCID. That remark is directed specifically at undergraduate, postgraduate and post-doctoral researchers, not just at their supervisor or their PI (principal investigator). At a point when the discussion about alternate facts and perhaps even alternate data risks a general loss of confidence in science, we should be pro-active in establishing trust in the scientific processes.
‡ You can see an example obtained by this process at DOI: 10.14469/hpc/1095
† This requirement is a strong driver for the uptake of ORCID amongst our student population.
Tags:Acquisition, Archival science, author, collection software, Company: NMR, data, Data management, data processing software, Evidence law, instrument data collection software, local authentication systems, Mestrenova, MestreNova system, Nuclear magnetic resonance, principal investigator, Provenance, Scientific method, service manager, spectrometer software, supervisor, Technology/Internet, Terminology
Posted in Chemical IT | 2 Comments »
Wednesday, April 30th, 2014
I love experiments where the insight-to-time-taken ratio is high. This one pertains to exploring the coordination chemistry of the transition metal region of the periodic table; specifically the tetra-coordination of the series headed by Mn-Ni. Is the geometry tetrahedral, square planar, or other? One can get a statistical answer in about ten minutes.
The (CCDC database) search definition required is shown above. The central atom defines the column of the period table, it is specified to have precisely four other atoms bonded to it, which can be any other element. These four bonds are specified as acyclic (to avoid any bias introduced by rings). And two angles are defined subtending the central atom. And off we go, defining on the way that the hits must be refined to an R-factor of < 0.05, have no disorder, and no errors.

Mn, (Tc), Re

Fe, Ru, Os

Co, Rh, Ir

Ni, Pd, Pt
Square planar coordination will manifest with pairs of angles of either 90° or 180°, whilst tetrahedral coordination will reveal only 109°.
- Both the Mn and the Fe series show a (red) hotspot at the tetrahedral value.
- The Co series shows a tetrahedral hot spot AND a somewhat less abundant square planar double-hot spot for the combination 90/180 and 180/90.
- The Ni series reveals the hottest spots to correspond to square planar, but with a significant tetrahedral cluster.
This quick survey can be followed up by more detailed explorations of the clusters. For example, can one go to the literature and find out the typical spin state for e.g. the Ni series in each of the geometries. Unfortunately, the CCDC database does not record what the spin state of any individual compound is; one will have to go to the original literature to find out. What a shame that the linkage between two quite different properties is (as far as I know) not available in any easily searchable form. Alternatively, one can narrow down the searches to individual searches of row 1, 2 or 3 of the transition series and then compare the behaviour. The possibilities are considerable.
Then there are the outliers in each plot. Some (many?) may prove to be due to faulty data (whilst we have specified no errors, they can still occur) but others may be due to an unusual structural feature, or perhaps even an as yet unrecognized phenomenon! Set as a student experiment, one might ask each student to explore say 3 outliers and express an opinion as to what causes them to deviate. Enjoy!
Tags:data, Pt[/caption] Square, search definition, transition metal region
Posted in Chemical IT, crystal_structure_mining, General | No Comments »