Posts Tagged ‘author’
Monday, April 8th, 2019
The conventional procedures for reporting analysis or new results in science is to compose an “article”, augment that perhaps with “supporting information” or “SI”, submit to a journal which undertakes peer review, with revision as necessary for acceptance and finally publication. If errors in the original are later identified, a separate corrigendum can be submitted to the same journal, although this is relatively rare. Any new information which appears post-publication is then considered for a new article, and the cycle continues. Here I consider the possibilities for variations in this sequence of events.
The new disruptors in the processes of scientific communication are the “data“, which can now be given a separate existence (as FAIR data) from the article and its co-published “SI”. Nowadays both the “article+SI” and any separate “data” have another, mostly invisible component, the “metadata“. Few authors ever see this metadata. For the article, it is generated by the publisher (as part of the service to the authors), and sent to CrossRef, which acts as a global registration agency for this particular metadata. For the data, it is assembled when the data is submitted to a “data repository”, either by the authors providing the information manually, or by automated workflows installed in the repository or by a combination of both. It might also be assembled by the article publisher as part of a complete metadata package covering both article and data, rather than being separated from the article metadata. Then, the metadata about data is registered with the global agency DataCite (and occasionally with CrossRef for historical reasons).‡ Few depositors ever inspect this metadata after it is registered; even fewer authors are involved in decisions about that metadata, or have any inputs to the processes involved in its creation.
Let me analyse a recent example.
- For the article[1] you can see the “landing page” for the associated metadata as https://search.crossref.org/?q=10.1021/acsomega.8b03005 and actually retrieve the metadata using https://api.crossref.org/v1/works/10.1021/acsomega.8b03005, albeit in a rather human-unfriendly manner.† This may be because metadata as such is considered by CrossRef as something just for machines to process and not for humans to see!
-
- This metadata indicates “references-count":22, which is a bit odd since 37 are actually cited in the article. It is not immediately obvious why there is a difference of 15 (I am querying this with the editor of the journal). None of the references themselves are included in the metadata record, because the publisher does not currently support liberation using Open References, which makes it difficult to track the missing ones down.
- Of the 37 citations listed in the article itself,[1] #22, #24 and #37 are different, being citations to different data sources. The first of these, #22 is an explicit reference to its data partner for the article.
- An alternative method of invoking a metadata record;
https://data.datacite.org/application/vnd.datacite.datacite+xml/10.1021/acsomega.8b03005
retrieves a sub-set of the article metadata available using the CrossRef query,‡ but again with no included references and again nothing for the data citation #22.
- Citation #22 in the above does have its own metadata record, obtainable using:
https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/4751
- This has an entry
<relatedIdentifier relatedIdentifierType="DOI" relationType="IsReferencedBy">10.1021/acsomega.8b03005</relatedIdentifier>
which points back to the article.[1]
- To summarise, the article noted above[1] has a metadata record that does not include any information about the references/citations (apart from an ambiguous count). A human reading the article can however can easily identify one citation pointing to the article data, which it turns out DOES have a metadata record which both human and machine can identify as pointing back to the article. Let us hope the publisher (the American Chemical Society) corrects this asymmetry in the future; it can be done as shown here![2]
For both types of metadata record, it is the publisher that retains any rights to modify them. Here however we encounter an interesting difference. The publishers of the data are, in this case, also the authors of the article! A modification to this record was made post-publication by this author so as to include the journal article identifier once it had been received from the publisher,[1] as in 2 above. Subsequently, these topics were discussed at a workshop on FAIR data, during which further pertinent articles[3], [4], [5] relating to the one discussed above[1] were shown in a slide by one of the speakers. Since this was deemed to add value to the context of the data for the original article, identifiers for these articles were also appended to the metadata record of the data.
This now raises the following questions:
- Should a metadata record be considered a living object, capable of being updated to reflect new information received after its first publication?
- If metadata records are an intrinsic part of both a scientific article and any data associated with that article, should authors be fully aware of their contents (if only as part of due diligence to correct errors or to query omissions)?
- Should the referees of such works also be made aware of the metadata records? It is of course enough of a challenge to get referees to inspect data (whether as SI or as FAIR), never mind metadata! Put another way, should metadata records be considered as part of the materials reviewed by referees, or something independent of referees and the responsibility of their publishers?
- More generally, how would/should the peer-review system respond to living metadata records? Should there be guidelines regarding such records? Or ethical considerations?
I pose these questions because I am not aware of much discussion around these topics; I suggest there probably should be!
‡Actually CrossRef and DataCite exchange each other’s metadata. However, each uses a somewhat different schema, so some components may be lost in this transit. †JSON, which is not particularly human friendly.
References
- A. Barba, S. Dominguez, C. Cobas, D.P. Martinsen, C. Romain, H.S. Rzepa, and F. Seoane, "Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data", ACS Omega, vol. 4, pp. 3280-3286, 2019. https://doi.org/10.1021/acsomega.8b03005
- S. Arkhipenko, M.T. Sabatini, A.S. Batsanov, V. Karaluka, T.D. Sheppard, H.S. Rzepa, and A. Whiting, "Mechanistic insights into boron-catalysed direct amidation reactions", Chemical Science, vol. 9, pp. 1058-1072, 2018. https://doi.org/10.1039/c7sc03595k
- T. Monaretto, A. Souza, T.B. Moraes, V. Bertucci‐Neto, C. Rondeau‐Mouro, and L.A. Colnago, "Enhancing signal‐to‐noise ratio and resolution in low‐field NMR relaxation measurements using post‐acquisition digital filters", Magnetic Resonance in Chemistry, vol. 57, pp. 616-625, 2018. https://doi.org/10.1002/mrc.4806
- D. Barache, J. Antoine, and J. Dereppe, "The Continuous Wavelet Transform, an Analysis Tool for NMR Spectroscopy", Journal of Magnetic Resonance, vol. 128, pp. 1-11, 1997. https://doi.org/10.1006/jmre.1997.1214
- U.L. Günther, C. Ludwig, and H. Rüterjans, "NMRLAB—Advanced NMR Data Processing in Matlab", Journal of Magnetic Resonance, vol. 145, pp. 201-208, 2000. https://doi.org/10.1006/jmre.2000.2071
Tags:Academic publishing, American Chemical Society, author, Business intelligence, Company: DataCite, CrossRef, data, Data management, DataCite, editor, EIDR, Information, Information science, JSON, Knowledge representation, Metadata repository, Records management, Technology/Internet, The Metadata Company
Posted in Chemical IT | No Comments »
Thursday, October 5th, 2017
We have heard a lot about OA or Open Access (of journal articles) in the last five years, often in association with the APC (Article Processing Charge) model of funding such OA availability. Rather less discussed is how the model of the peer review of these articles might also evolve into an Open environment. Here I muse about two experiences I had recently.
Organising the peer review of journal articles is often now seen as the single most important activity a journal publisher can undertake on behalf of the scientific community; the very reputation of the journal depends on this process being conducted responsibly, thoroughly and with integrity by the selected reviewers. Reviewers conduct this process voluntarily, mostly anonymously, without remuneration or recognition and often with short deadlines for completion. After one such process, I recently received an interesting follow-up email from the journal, suggesting I register my activity with Publons.com, a site set up to register and give non-anonymous credit for reviewing activities. I should say that Publons is a commercial company, set up in 2012 to to “address the static state of peer-reviewing practices in scholarly communication, with a view to encourage collaboration and speed up scientific development”. Worthy aims, but like many a .com company nowadays, one might ask what the back-story might be. Thus many of the Internet giants, Google, Facebook, Twitter etc, do have back-stories, which often underpin their business models, but which may only emerge years after their founding. With only a hazy idea of what Publons’ back-story might be, I went ahead and registered my reviewing activity. 
After doing so, I then accessed my entry. You only learn that I have reviewed for a particular journal, but nothing about the actual process itself. I did not really think that this experiment had done much to encourage collaboration and speed up scientific development. It might be useful for early career researchers to get their name exposed however.

I can almost understand why the review itself might not be publicly displayed, but as a result you learn nothing about the factual basis of the review and whether it might have been conducted responsibly, thoroughly and with integrity. Instead, I now suspect that the presence of my name on this site might merely encourage other publishers to deluge me with requests for further (freely donated) refereeing.
Discussing this at lunch, a colleague (thanks Ed!) reminded me of a veritable journal called Organic Syntheses. Here, authors submit a synthetic procedure and open identified “checkers” are invited to repeat the procedure and comment on it. The two roles are kept separate (i.e. the checkers do not become co-authors), but they could get credit for their activity. Thus if you view a typical recent entry[1] you will see a full biography and affiliation of the checkers given at the end, with footnotes often describing their own observations if they differ from those of the authors.
This set me thinking whether an open peer review process might also contain such an element of checking, as well as informed comment, nay opinion, about the article itself and the conclusions it makes. The opportunity arose when I was contacted by an author who was about to submit a computational article to a journal. This journal allowed open peer review. If I agreed to review, my name would be attached to the article if accepted for publication. I undertook this on the basis that I would use this review to conduct some limited checking of the computations and other assumptions underpinning the conclusions in the submitted article. I also wanted this open process to include the data on which my review was based. Most importantly if anyone wished to replicate my replication, the barriers to doing so should be as low as is possible. Shortly thereafter, I received a formal invitation from the journal and I set about my task. Crucially, all my own calculations supporting the review were archived in a data repository, albeit under embargo. In my cover letter I included the DOI for my data and the embargo access code, so that the authors (and the editor of the journal if they so wished) could inspect the data against which I wrote my review.
Then followed standard procedures, whereby the authors took my comments into consideration, revised the article and the final version was indeed accepted and published.[2] You will find the two referees/checkers listed, although unlike Organic Syntheses, there is no bibliographic information about them or their affiliation. I did ask the journal if they could at least link my ORCID identifier to my name, but that request was refused. If my name had been a common one, then disambiguating it into a unique identity could be a challenge. There was also no mechanism to associate my identity on the journal with any data on which I had based my review. Really, the only open aspect of this process was just my (potentially ambiguous) name, nothing else. No follow-up was received from the journal to add the review to Publons.
The next stage was to contact the author who had originally set the process under way to ask them if they would mind my releasing the data on which my review had been based. They agreed, as also they did to my telling this story. The overall outcome is thus a published article with the reviewers (if not their reviews or any supporting evidence for their review) openly named. In this specific case, there is also an open dataset with a formal link back to the article in the form of a DOI (10.14469/hpc/2640, although I suspect this aspect is unique, even precedent setting), but one driven by the reviewer and not the journal. It would be nice to have bidirectional links between both article and the review data, but I do not know any publishers currently operating such a mechanism (if anyone knows such, please tell).
Now to the broader questions about the process described above. I think that the aspiration to encourage collaboration and speed up scientific development may indeed have been promoted by this association between article and the data assembled by the reviewer. Whether the final article was improved as a result of the processes described here I will leave the authors to comment if they wish. As with the checkers employed by Organic Syntheses, such a review process takes not just time, but resources. Resources that currently have to be freely donated by the reviewers and their host institution and which clearly cannot become expensive, time-consuming or onerous. That was not the case as it happens here; my contributions were facilitated by my having sufficient expertise to perform the tasks I undertook really quite quickly.
I will raise one more issue; that of whether to add my review to the dataset which is now openly available. In fact it is not included, in part because it related to the initially submitted version of the MS. The final MS version has been revised and so many of the comments in my review may only make sense if you have the first version to hand. It would be perhaps unreasonable to make the first drafts of manuscripts routinely available (although historians of science would probably love that!) alongside the reviews of that first draft. But I could also see a case for doing so if the community agreed to it. One to discuss for the future I think. There is also the associated issue of what should happen to any dataset associated with a review in the event that the final article is rejected and not accepted. Should the data remain permanently under embargo and the reviewer’s identity permanently anonymous? Perhaps opening up even such datasets might nevertheless encourage collaboration and speed up scientific development, but I fancy some would consider that a step too far!
References
- J. Zhu, "Preparation of N-Trifluoromethylthiosaccharin: A Shelf-Stable Electrophilic Reagent for Trifluoromethylthiolation", Organic Syntheses, vol. 94, pp. 217-233, 2017. https://doi.org/10.15227/orgsyn.094.0217
- L. Li, M. Lei, Y. Xie, H.F. Schaefer, B. Chen, and R. Hoffmann, "Stabilizing a different cyclooctatetraene stereoisomer", Proceedings of the National Academy of Sciences, vol. 114, pp. 9803-9808, 2017. https://doi.org/10.1073/pnas.1709586114
Tags:Academic publishing, article processing charge, author, Company: Facebook, Company: Publons, Company: Twitter, editor, Electronic publishing, Entertainment/Culture, Hybrid open access journal, Internet giants, OA, Open access, Organic Syntheses, Public sphere, Publishing, Scholarly communication, search engines, Social Media & Networking, Technology/Internet
Posted in Chemical IT, General | 5 Comments »
Monday, May 29th, 2017
The title here is taken from a presentation made by Ian Bruno from CCDC at the recent conference on Open Science. It also addresses the theme here of the issues that might arise in assigning identifiers for any given molecule.
The structure was represented as shown[1] by the original authors, in which the bonding from S to Sn is indicated with both solid lines (a bond) and dotted lines (an “interaction”).
Why would this matter? Well, to enable any entry in the Cambridge structure database as findable (the F of FAIR) it has to be given a unique identifier. There are in general three such identifiers assigned by the CCDC:
- The Refcode, in this case XONHIS. These six or seven letter codes are historically the oldest, and started off at least with an attempt if possible to assign some semantic inference from the name, even if only occasionally.
- The CCDC deposition number, in this case 650011. This is the number that an author will receive immediately upon deposition, and you often find these identifiers quoted in supporting information files
- The DOI (digital object identifier), in this case 10.5517/ccptd3z, which can be used to view the structure even if access to the full CSD is not available to the user. In that sense, the DOI is the FAIRest of the first three of these identifiers.
- However, CCDC reported that they are considering adding a 4th very common identifier, based on the InChI (International chemical identifier), which comes as a full string and with the structure of the molecule at least in part inferrable from it, together with a shortened (almost) unique string which has the advantage of being “Googlable”. Both are helpfully FAIR.
It is this 4th identifier that is at issue here. InChIs are derived from atom connection tables; you need to define all bonds present in the molecule. And it is here that the dotted “bond”/”interaction” above becomes a problem. This is the representation shown in the CSD database, which reveals that all the Sn…S interactions are classified as “bonds”, along with some creative(!) representations of the C…S bonds.
So the InChI will very much depend on whether all the Sn…S contacts are termed as bonds or as interactions. To help clarify that, it is useful to show the typical range of lengths of such contacts. Below is a simple search for all Sn and S systems where the pair are either close in space (< 3.5Å) or have a bond specified between the two atoms.

The main cluster occurs at ~2.5Å, but there is some evidence of a second peak at about 3.0Å. The third distribution up to 3.5Å is probably a continuum of very weak dispersion interaction, which most molecules exhibit. The values for XONHIS are 2.521 and 2.996Å, which match the two clusters above.
So perhaps a quantum calculation can shed some light (DOI: 10.14469/hpc/2593)? The values on the right are the optimised bond lengths which are pretty similar to the crystal structure. On the left are the calculated Wiberg bond orders (B3LYP+D3BJ/Def2-TZVPP/chloroform calculation). These reveal both “bonds” have an order less than 1. The value of ~0.6 is probably not contentious, but it does graphically show that when a compound is indexed as having a “single bond” between two atoms, the quantitative bond order may be substantially less. What however would one make of a bond order of 0.214? Should it be classified as a bond, albeit a much weaker one than normal? Or should it instead simply be a rather strong “interaction” which is not classified as a bond? And perhaps one should have in mind the question “how sensitive is this result to the quantum mechanical procedure used?”
Why does this distinction matter? Well, the InChI algorithm is based on simple connectivity; are two atoms connected by a bond or not? There are no nuances here. At the moment, this decision can be made by an algorithm based on the distance between any atom pair (whether computed or measured), but more often I suspect it derives from a “molfile” which is often derived from a human-drawn representation using a structure drawing program. It does rather boil down to the individual preferences of the human drawing the molecule. Due in part to such uncertainties, it was estimated that only 22% of structures in the CSD can be used to generate a reliable InChI. Hydrogen bonds are almost always classified as non-bonds, which means their presence is rarely systematically flagged during the indexing of the structures. Organometallics often pose some of the greatest representational problems (there are many others).
I will end by observing another class of structure that I deal with, “reaction transition states”. As you might imagine these forms are full of pairs of atoms with ambiguous bond lengths and hence connectivity. We currently have no truly reliable method for assigning useful identifiers to them. So lots of challenges for the future then!
References
- R. Reyes-Martínez, R. Mejia-Huicochea, J.A. Guerrero-Alvarez, H. Höpfl, and H. Tlahuext, "Synthesis, heteronuclear NMR and X-ray crystallographic studies of two dinuclear diorganotin(IV) dithiocarbamate macrocycles", Arkivoc, vol. 2008, pp. 19-30, 2007. https://doi.org/10.3998/ark.5550190.0009.503
Tags:author, Bruno, chemical identifier, Digital Object Identifier, Ian Bruno, Identifier, InChI algorithm
Posted in Chemical IT | 2 Comments »
Friday, April 28th, 2017
Research data (and its management) is rapidly emerging as a focal point for the development of research dissemination practices. An important aspect of ensuring that such data remains fit for purpose is identifying what curation activities need to be associated with it. Here I revisit one particular case study associated with the molecular structure of a product identified from a photolysis reaction[1] and the curation of the crystallographic data associated with this study.
This particular dataset (CSD, dataDOI: 10.5517/cctnx5j) is associated with an article entitled “Single-Crystal X-ray Structure of 1,3-Dimethylcyclobutadiene by Confinement in a Crystalline Matrix“.[1] Data for crystal structures supporting a research article is required (at least in part) to be deposited into the Cambridge structure database (internal reference MUWMEX) and for which a significant level of curation is performed. Although the definition of the term curation has evolved over the last few years, here I take it to include the following:
- Identification of appropriate metadata describing the data. For molecules, this would include any identifiers such as the name of the molecule and the connectivities of the atoms constituting that molecule.
- The submission of this metadata to a suitable aggregator, such as e.g. DataCite and its inclusion in any other databases associated with the data. These two tests are part of the FAIR data guidelines[2], covering the F (findable) and A (accessible).
- Performing any validation tests for the data that can be identified. With crystal structure data in CIF format, this is defined by the utility checkCIF and helps to ensure the I (inter-operable) of FAIR. The R refers in part to the licenses under which the data can be re-used.
On (it has to be said rare) occasions, these procedures can lead to a disparity between the author’s conclusions arrived on the basis of their acquired data and the metadata identified by the independent curators. This difference is most obviously illustrated in this case study by the chemical names inferred by the curation process for the structure represented by the data in the CSD:
- chemical name: “tetrakis(Guanidinium) 25,26,27,28-tetrahydroxycalix(4)arene-5,11,17,23-tetrasulfonate 1,5-dimethyl-2-oxabicyclo[2.2.0]hex-5-en-3-one clathrate trihydrate“
- chemical name synonym: “tetrakis(Guanidinium) tetra-p-sulfocalix(4)arene 1,3-dimethylcyclobutadiene carbon dioxide clathrate trihydrate“.
Only the synonym agrees with the title given by the original authors in their publication.[1] One might indeed strongly argue that these two names are not in fact synonyms, since they refer to quite different chemical structures with different atom connectivities. A search of the database for the sub-structure corresponding to 1,3-dimethylcyclobutadiene does not reveal any hits and so the information implied by this synonym is not recorded in the index created for the CSD database.
I asked the scientific editors of the CSD for some guidance on the curation procedures applied to crystal structure datasets and they have kindly allowed me to quote some of this.
- “In cases such as this, we as editors are sometimes faced with conflicting information and have to try our best to strike a balance between the data presented in the CIF, a published interpretation and our knowledge based on the information already in the CSD”.
- “In areas where there is a particular conflict between these, we often would include a comment (usually in the Remarks or Disorder field as appropriate)”. For this particular dataset, one finds the following under the Disorder field:
- “Under UV radiation the clathrated pyrone molecule converts to a disordered mixture of square-planar 1, 3-dimethylcyclobutadiene and rectangular-bent 1, 3-dimethylcyclobutadiene in van der Waals contact with a carbon dioxide molecule. The ratio of the square-planar to rectangular-bent 1, 3-dimethylcyclobutadiene clathrate is modelled with occupancies 0.6292:0.3708”.
- It is not entirely obvious however whether this last comment originates from the original authors or from the data curators. It does not resolve the difference between the assigned chemical name and the indicated chemical name synonym.
- “In the case of MUWMEX, I think that the editor produced a diagram (below) which seems chemically reasonable based on the crystallographic data with which we were provided and tried to cover the situation regarding disorder, van der Waals contacts etc in the ‘Disorder’ field. At this point, it is left to the CSD user to decide for themselves.”

We have arrived at a point where the CSD user must indeed decide what the species described by this dataset actually is. Ideally, the best recourse would be to acquire the original data in full and repeat the crystallographic analysis. This is an aspect of the curation of crystallographic data that is not conducted as part of the current processes, which would require as a minimum a superset known as the hkl information to be present in the data. Again, to quote the CSD scientific editors:
- “With regard to your question: Is there any mechanism in the Conquest search to identify structures where the hkl information is present? I understand that it is not currently possible to do this in ConQuest. It is, however, possible … to access structure factor data (where available) using Access Structures.”
For MUWMEX, the hkl information is not present in the CSD dataset and in 2010 when the structure was published would have to be obtained directly from the authors. By 2016 however, its presence in deposited datasets was becoming far more common. It is worth pointing out that even the hkl information is not the complete data recorded for the experiment. That is represented by the original image files recording the X-ray diffractions. This latter is hardly ever available as FAIR data even nowadays.
I hope I have here illustrated at least some of the challenging aspects of curating scientific data and the issues that can arise when derived metadata (in this case the name and the atom connectivities of a molecule) reveal conflicts with the original interpretations. This for an area of chemistry where both the data deposition and its curation is a very mature subject, having operated for ~52 years now. It is still a process that requires the intervention of skilled curators of the data, but perhaps even more importantly it reveals the need to identify even more strictly what the provenance of the interpretations is. Should the CSD curation rest merely at the stage of teasing out and flagging inconsistencies and allowing the user to then take over to resolve the conflicts? Should it be more active, in re-analyzing data for each entry where conflicts have been detected? Perhaps the latter is not practical now, but it might be in the near future. What is certain is that with increasing availability of FAIR data these sorts of issues will increasingly come to the fore. And not just for the very well understood case of crystallographic data but for many other types of data.
References
- Y. Legrand, A. van der Lee, and M. Barboiu, "Single-Crystal X-ray Structure of 1,3-Dimethylcyclobutadiene by Confinement in a Crystalline Matrix", Science, vol. 329, pp. 299-302, 2010. https://doi.org/10.1126/science.1188002
- M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J. Boiten, L.B. da Silva Santos, P.E. Bourne, J. Bouwman, A.J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C.T. Evelo, R. Finkers, A. Gonzalez-Beltran, A.J. Gray, P. Groth, C. Goble, J.S. Grethe, J. Heringa, P.A. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S.J. Lusher, M.E. Martone, A. Mons, A.L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M.A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, and B. Mons, "The FAIR Guiding Principles for scientific data management and stewardship", Scientific Data, vol. 3, 2016. https://doi.org/10.1038/sdata.2016.18
Tags:assigned chemical name, author, chemical name, chemical name synonym, chemical names, chemical structures, editor, indicated chemical name synonym, Knowledge, radiation, Research, Scientific method, Technology/Internet, X-ray
Posted in Chemical IT, crystal_structure_mining | 5 Comments »
Thursday, March 30th, 2017
In an era when alternative facts and fake news afflict us, the provenance of scientific data becomes ever more important. Especially if that data is available as open access and exploitable by others for both valid scientific reasons but potentially also by those with other motives. Here I consider the audit trail that might serve to establish data provenance in one typical situation in chemistry, the acquisition of NMR instrumental data.
Here I describe how such data is generated in my department; details may vary elsewhere.
- The prospective user of the NMR service is allocated a service ID. In our case, that ID relates to the research group rather than to individual researchers. This ID is parochial, it does not reference any other information about the user in the institute. Only the service manager has the information to associate this ID with real users and this information is normally not distributed.
- When a sample is submitted, this ID is used to create a new folder containing the data as a sub-folder of the group ID and located on the NMR data servers.
- The dataset itself‡ contains a number of files that contain an audit trail (names such as audita.txt, auditp.txt) with the fields: ##AUDIT TRAIL= $$ (NUMBER, WHEN, WHO, WHERE, PROCESS, VERSION, WHAT). Typically, none of these files have propagated the original user ID under which the data was collected; to do so would require a programmatic connection between the local authentication systems and the spectrometer software used, a connection that is normally missing. Thus the first break in the provenance trail.
- In principle other audit trails can be inferred from these files, such as the unique identity of the instrument provided by its manufacturer. Further information such as e.g. the probe used to collect the data (probes can be readily changed over) or any calibration data used in setting up the instrument for the data collection are by and large not recorded. To my knowledge, although an instrument can have a unique serial number, such serial numbers of swappable components such as probes are not recorded by the collection software. Thus the second break in the provenance trail.
- This data then needs to be processed by further software. In this case we use the MestreNova system for this task. Each dataset has editable assigned properties; below I show those that can be associated with the spectrum (accessed with MestreNova using Edit/Properties). All this comes from the information collected by the instrument. The user’s identity can be inserted into the “title” field, the display of which is off by default.

- There is also a section for parameters, a synonym for which might be metadata and accessed using this program from View/Tables/Parameters. If Author was entered as a parameter in the dataset by the spectrometer software, the Mnova document would retrieve that information. Equally, an ORCID identifier for the author entered at the time of data collection and thus stored in the dataset could be read by Mnova, stored and displayed if configured to do so. It would be fair to say however that this option is rarely if indeed ever systematically implemented by NMR instrument data collection software and so is never propagated to the data processing software (as highlighted in red below). Thus a third break in the provenance trail.
This is also an alternative and this time formal metadata field that can be populated, by default as shown below with the type of spectrum and nucleus. These properties are not controlled in the sense of only allowing those terms that are present in a specified dictionary. The jargon for such control is a metadata schema. This is not used here, since dissemination of this information is not intended; the software accepts whatever information it is given.
There are thus several opportunities to collect the identity of the experimenter and thus attribute provenance to the collected data, but this does very much depend on the will of researchers, institutions or publishers to enforce specific policies around this. The fourth break in the provenance trail.
- The dataset can then be uploaded (DOI: 10.14469/hpc/1291), at which stage provenance can finally be added using the ORCID credentials of the person publishing the dataset, who of course may or may not be the person who actually recorded the data! The full metadata for this specific collection can be seen at data.datacite.org/10.14469/hpc/1291. Or to put it another way, this is the first point in the provenance chain where the metadata is controlled by a schema and is also discoverable in a standard programmatic manner, i.e. the preceding link. The provenance is now formally associated with the ORCID identifier using the DataCite metadata schema. You should be aware that a local policy† is that access to the repository at https://data.hpc.imperial.ac.uk is only allowed by cross-authentication with http://orcid.org/ using the user’s ORCID. This identifier is then automatically propagated to the metadata held at e.g. data.datacite.org/10.14469/hpc/1095. Currently however, none of any metadata originally recorded in either the instrumental file set or the processed MestreNova file is forwarded on to the metadata record held at DataCite; again loss of information and potentially of provenance.
- The peer-reviewed article resulting from the interpretation of this data however can be associated with the provenance introduced in the previous stage; see data.datacite.org/10.14469/hpc/1267 and the IsReferencedBy property.
Now imagine if there was a common thread in all the stages of acquiring, processing and publishing this scientific data based on the ORCID.
- Providing an ORCID could be made an essential requirement of access to the instrument.
- This information would be propagated to the dataset …
- by inclusion in one or more of the audit trail files.
- At this stage, further persistent identifiers associated with the instrument manufacturer could be added, which help identify not only the instrument used, but sub-components such as the changeable probe. This would allow access to any calibration curves or probe sensitivity and other aspects.
- The ORCID and other relevant information could be picked up by the software used to convert the data into spectra and propagated into the metadata containers for this software …
- where its use is controlled by a specified schema.
- At this stage, the ORCID and information such as the nucleus recorded, the sample temperature etc can be propagated on to the final metadata records.
- And the reader of the article describing this work would have a formally defined provenance audit trail they could follow back to the start of the experiment or forward to a published article. In this case, the data claims provenance (acquired from peer review) from the article, but it should also work in reverse with the article claiming provenance from the data on which it is based. The indexing of this bidirectional exchange is one of the exciting features that we should see emerging from CrossRef (holders of metadata about articles) and DataCite (holders of metadata about research data) in the near future.
We are clearly a little way from having the infrastructures described above for establishing such data audit trails. To do so will require cooperation from instrument manufacturers, at least in the example as charted above, as well as researchers, institutions, publishers, peer-reviewers and funding bodies. The first step would be to ensure that all scientists who intend collecting, processing and publishing data should claim an ORCID. That remark is directed specifically at undergraduate, postgraduate and post-doctoral researchers, not just at their supervisor or their PI (principal investigator). At a point when the discussion about alternate facts and perhaps even alternate data risks a general loss of confidence in science, we should be pro-active in establishing trust in the scientific processes.
‡ You can see an example obtained by this process at DOI: 10.14469/hpc/1095
† This requirement is a strong driver for the uptake of ORCID amongst our student population.
Tags:Acquisition, Archival science, author, collection software, Company: NMR, data, Data management, data processing software, Evidence law, instrument data collection software, local authentication systems, Mestrenova, MestreNova system, Nuclear magnetic resonance, principal investigator, Provenance, Scientific method, service manager, spectrometer software, supervisor, Technology/Internet, Terminology
Posted in Chemical IT | 2 Comments »
Friday, November 25th, 2016
Another conference, a Cambridge satellite meeting of OpenCon, and I quote here its mission: “OpenCon is a platform for the next generation to learn about Open Access, Open Education, and Open Data, develop critical skills, and catalyze action toward a more open system of research and education” targeted at students and early career academic professionals. But they do allow a few “late career” professionals to attend as well!
I could only attend the morning session, for which the keynote speaker was Erin McKiernan
The presentation was entitled How open science helps researchers succeed, presented as an exploration of an article written by Erin and colleagues with the same name and published in eLife[1] Erin has created a support page at http://whyopenresearch.org to augment the presentation and it’s well worth a visit.
One striking point made was the assertion that Open publications get more citations!

As with many metrics of the impacts of the science publication processes, a citation itself lacks the context of why it was made (see this post for further discussion), but the expectation is that a citation is “good”. From my perspective as a chemist, I did wonder why molecular science was missing from the graphic above. Do open chemistry publications also get more citations?
Which brings me to another point made during the talk, the increasingly controversial aspect of (journal) impact factors and the pressure placed on early career researchers to publish only in those with “high” impact factors, and for their careers to be assessed at least in part based on these and the anticipated “h-index”. The audience was indeed encouraged to go visit http://www.ascb.org/Dora/ (Declaration on Research Assessment, or Putting science into the assessment of research). Have you signed it yet?
Another manifestation of the modern trend to analyse impact metrics is the site Impactstory.org. This is a scripted resource that starts from your ORCID identifier and (optionally) your Twitter account (yes, apparently Tweets matter!) to derive a more complex alternative metric of a individual’s impacts. I had not tried this one before and so I submitted my ORCID and my Twitter account, and watched as the system went off to http://orcid.scopusfeedback.com (Scopus is an Elsevier product) to attempt to create my profile. It ground for quite a while, reporting initially that I had no publications! This was followed by an unexpected error; I did not get my impact back! But this experiment served to highlight one aspect that was discussed at the meeting; data and other research objects. The graphic above refers only to the citation of journal articles, it does not yet include the citation of data. However ORCID DOES include data and research objects as works. And because the granularity of my data and research objects is very fine (one molecule = one work), I have quite a few. In fact ~200,000! ORCID gets to about 8000 before it gives up. I suspect http://orcid.scopusfeedback.com queries ORCID, gets back ~8000 entries and crashes. No doubt the programmer tasked with implementing this resource did not anticipate that any individual could accumulate 8000+ entries! Or probably factor in that the vast majority of these would of course not be journal articles but data. If the site gets back to me about the crash I experienced, I will update here.
Simon Deakin was the next speaker with (open) data as the focus and the worries many researchers have in being scooped by others who have re-used your open data without proper attributions. The discussion teased out that if data is properly deposited, it will indeed have full associated metadata and in particular a date stamp that could help protect an author’s interests.
It was really good to meet so many early career researchers who espouse the open ethos. Perhaps, in 20 years time, another graphic akin to the one above might demonstrate that open researchers get more promotions!
References
- E.C. McKiernan, P.E. Bourne, C.T. Brown, S. Buck, A. Kenall, J. Lin, D. McDougall, B.A. Nosek, K. Ram, C.K. Soderberg, J.R. Spies, K. Thaney, A. Updegrove, K.H. Woo, and T. Yarkoni, "How open science helps researchers succeed", eLife, vol. 5, 2016. https://doi.org/10.7554/elife.16800
Tags:Academia, author, chemist, City: Cambridge, Company: Twitter, ELife, Erin McKiernan, keynote speaker, Max Planck Society, programmer, Simon Deakin, Social Media & Networking, speaker, Technology/Internet, Wellcome Trust
Posted in Chemical IT, General | 3 Comments »
Tuesday, August 16th, 2016
This week the ACS announced its intention to establish a “ChemRxiv preprint server to promote early research sharing“. This was first tried quite a few years ago, following the example of especially the physicists. As I recollect the experiment lasted about a year, attracted few submissions and even fewer of high quality. Will the concept succeed this time, in particular as promoted by a commercial publisher rather than a community of scientists (as was the original physicists model)?
The RSC (itself a highly successful commercial publisher) has picked up on this and run its own commentary. You will find quotes from yours truly there, along with Peter Murray-Rust, a long time ardent promoter of community driven open science. One interesting aspect is that the ACS runs around 50 journals, and the decision on whether each will accept preprints for publication will (shortly = next few weeks) be made by the individual editors. I wonder if the eventual list of those supporting the project will bring any surprises (bets on J. Am. Chem. Soc. preprints anyone)?
But I want to pick up on the declared aspiration “to promote early research sharing“. Here I couple research sharing with data sharing. If you share your research, you should also share the data resulting from that research. We are now entering a new era of data sharing (in part as a result of mandation by various funding bodies) and so one has to ask whether a pre-print server will encourage people to create and share FAIR data (data which is findable, accessible, inter-operable and re-usable) as a model to replace the current one of “supporting information” held in enormous PDF files (mostly unFAIR on at least three counts). This question is indeed posed in the RSC commentary. What I would like to see happen are projects such as that described here, which create what were described as “first class research objects”, and which I think amply fulfil the criteria of being FAIR. So, will ChemRxiv preprint servers help promote such FAIR data sharing as part of early research sharing? We will find out soon.
The ACS supports OA (Open Access) sharing of articles, provided the authors pay (or arrange payment of) the appropriate APC or article processing charge. These charges are complex, being subject to various discounts (for example if you as an author are an ACS member or not) but are generally not insignificant (> $1000). I wondered whether preprints might be subject to an APC, and so I asked the ACS. The response was “we don’t anticipate any submission or usages fees at this time“. I think that means free at point of submission, and free at point of readership “at this time“.
Finally, let me now summarise as I understand the current family of “research publications”:
- The preprint
- The final author version as submitted to a journal
- The “version of record” (VoR) as published by the journal
- Any FAIR published data associated with the article
All four of these are attempts at “research sharing”. Each may be located in a different location, and each may have its own DOI. And of course we cannot easily know how much overlap there is between each of them. Thus, how might 1-3 differ in terms of the story or “narrative” of scientific claims? Does 4 agree or support 1-3? Does 4 agree with perhaps data subsets contained in 1-3? If keeping abreast of the current research literature is a challenge, imagine having to cope with/reconcile up to four versions of each “publication”!
Lots of food for thought here. We have not heard the last of these themes.
Tags:Academia, Academic publishing, article processing charge, author, Data publishing, Data sharing, food, Grey literature, Open access, Open science, PDF, Peter Murray-Rust, pre-print server, Preprint, preprint server, Public sphere, Publishing, Scholarly communication, Technology/Internet
Posted in Chemical IT | 1 Comment »
Saturday, June 20th, 2015
The university sector in the UK has quality inspections of its research outputs conducted every seven years, going by the name of REF or Research Excellence Framework. The next one is due around 2020, and already preparations are under way! Here I describe how I have interpreted one of its strictures; that all UK funded research outputs (i.e. research publications in international journals) must be made available in open unrestricted form within three months of the article being accepted for publication, or they will not be eligible for consideration in 2020.
At the outset, I should say that one infrastructure to help researchers adhere to the guidelines is being implemented in the form of the Symplectic system. This allows a researcher to upload the final accepted version of a manuscript. At Imperial College, a digital repository called Spiral serves this purpose and also acts as the front end for collecting informative metadata to enhance discoverability. The final accepted version is then converted by the publisher into a version-of-record. This contains styling unique to the publisher and the content is subjected to further scrutiny by the authors as proof corrections. In an ideal world, these latter changes should also be faithfully propagated back to the final accepted version, as would all the supporting information associated with the article. Since most authors do not exactly enjoy the delights of proof corrections, this final reconciliation of the two versions may not always be assiduously undertaken.
I became concerned about the existence of two versions of any given scientific report and that the task of ensuring total fidelity in the content of both versions may negatively impact on the author’s time. Much better if the publisher could grant permission for the author to archive the version-of-record into a digital repository.
Some experiments were needed, and I decided to start them in reverse, by archiving my oldest publications. Since Symplectic now provides a system to do this, I began by using it. Symplectic identifies each publisher’s policies for archival, of which the most liberal are known as ROMEO GREEN. To quote from the definition, this colour allows the author to “archive pre-print and post-print or publisher’s version/PDF“. In an afternoon I had processed most of my ROMEO green articles. You know how it is sometimes, you do not read the fine print! And so the library soon informed me that archival of ROMEO GREEN was in fact only permitted on the author’s “personal web page”. Spiral, as an institutional repository, does not apparently constitute a personal web page for me and so none of my Symplectic submissions could be accepted for archival there.
Time to rethink the experiment. Firstly, I very much wanted the reprints to be held by a proper digital repository rather than a conventional web page. Why? I wanted my reprints to adhere as much as possible to FAIR: findable, accessible, interoperable and re-usable. Well, at least the first two of those (the last two relate more to data). A repository is designed to hold metadata in a formal and standards-based manner and metadata helps achieve FAIR. So I asked the Royal Society of Chemistry (as a ROMEO GREEN publisher) whether a personal web page hosted on a digital repository would qualify. I was soon informed that I had proposed a neat solution here, and they couldn’t see an issue.
Now, all I had to do is find a repository where I could create such a personal web page. The chemistry department at Imperial College has for ten years hosted a DSpace repository called SPECTRa[1] which already has the functionality for individuals to create personal collections. I had also picked up on the increasing attention being given to Zenodo, like the World-Wide Web itself an offshoot of CERN (of large Hadron Collider fame) and born from the need for researchers to more permanently archive the outputs of their researches. These outputs include software, videos, images, presentations, posters, publications and (most obviously for CERN) datasets. I thought I would include them in my experiment as well. There results are summarised below.
The last line of this table includes a link to another design feature of a repository, facilitating the ability to harvest the content. The ContentMine project (“The right to read is the right to mine!“) has shown how such harvesting of facts from the literature can be automated on a vast scale, and (IMHO) represents an example of those disruptive innovations that have the power to change the world forever. It also enshrines the idea that scientific facts funded by the public purse should be capable of being openly liberated from their containers. A harvestable repository seems an ideal container for achieving this.
My experiment is part of what might be seen as the increasingly subtle interplay between:
- scientific authors, whose creative endeavour research is and without whom scientific publishers would not exist
- publishers who create a business model from the content freely given them by authors but also (especially if a commercial publisher) need to be accountable to their shareholders.
- the funding councils, many of whom now wish the outcomes of the research they fund to be openly available to all
- the local libraries/administrators who have to adhere to/enforce all the rules contractually handed down to them by publishers whose direct customers they are, but who also need to serve their community of readers and authors.
- researchers who would rather do research than fret about the above, and who would rather spend limited resources doing that research rather than diverting an increasing amount of their attention into the above system.
- readers, who need unimpeded access to the research endeavours of others, but often have little influence on the policies and actions of all the other stakeholders, since they are NOT considered customers (of the publishers).
- etc. etc.
My experiment was in part designed to explore these rules, their interpretations and their boundaries. For the time being at least I seem to have found an arrangement that allows me to distribute versions-of-record of my own work, thanks to a generous and far-sighted learned society publisher. Watch this space!
References
- J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
- H.S. Rzepa, and B.C. Challis, "The Mechanism Of Diazo-Coupling To Indoles And The Effect Of Steric Hindrance On The Rate Limiting Step", Zenodo, 1975. https://doi.org/10.5281/zenodo.18758
- H.S. Rzepa, "Hydrogen transfer reactions of indoles", 1974. http://doi.org/10044/1/20860
- H.S. Rzepa, "Hydrogen Transfer Reactions Of Indoles", Zenodo, 1974. https://doi.org/10.5281/zenodo.18777
- H.S. Rzepa, "C 25 H 34 Cl 1 N 3 O 1", 2015. https://doi.org/10.14469/ch/191342
- H.S. Rzepa, A. Lobo, M.S. Andrade, V.S. Silva, and A.M. Lourenco, "Chiroptical properties of streptorubin B – the synergy between theory and experiment.", 2015. https://doi.org/10.5281/zenodo.18632
Tags:Academia, Academic publishing, Archival science, author, Data management, Digital library, EPrints, Institutional repository, Knowledge, Knowledge representation, Library science, metadata, Open access, PDF, personal web page, Preprint, Publishing, Repository, researcher, ROMEO GREEN, Science, Technology/Internet, United Kingdom, web server
Posted in Chemical IT | No Comments »
Sunday, May 10th, 2015
Blogging in chemistry remains something of a niche activity, albeit with a variety of different styles. The most common is commentary or opinion on the scientific literature or conferencing, serving to highlight what their author considers interesting or important developments. There are even metajournals that aggregate such commentaries. The question therefore occasionally arises; should blogs aspire to any form of permanence, or are they simply creatures of their time.
In this blog, as you might have noticed, I take a slightly different tack. One focus is on exploring, perchance in more detail than might be found in the standard text-book, some of the dogmas of chemistry. It happens that occasionally when writing a conventional scientific article, I find myself wishing to cite such sources. This of itself raises interesting issues (such as should one cite what might be considered material that has not been peer-reviewed in the conventional manner) but the most important would be whether one should cite evanescent sources. So this brings me to the topic of this post; can a post be archived in a sense that achieves a greater perceived permanence? Nowadays, permanence tends to be associated with a digital object identifier, or DOI. So one can boil this question down to: can one assign a DOI to a blog post?
Well, if you came to this post via the main page, you may indeed have spotted that some do have a DOI. This is an experiment I have been running with an organisation known as The Winnower, who provide a WordPress extension to archive any individual post and assign it a (CrossRef) DOI. The archived version also includes metadata that points back to the original post.
This archival is not yet perfect. In its current state it does not (yet) capture:
- Comments on any post (which could be considered a form of open peer review)
- Enhancements such as the links to Jmol/JSmol that I associate with some of the posts
- The ORCID identifier, which adds a layer of additional provenance.
- We of course do not yet know what the lifetime expectancy archiving organisations will achieve (could it be 100 years for example?).
It does capture the citation list when there is one, and since I include citations to my data sources (for the computations performed in support of many of my posts) the archive is I think accordingly rendered more valuable.
What brought this post on? Well, the Journal of Chemical Education has put out a call for articles on chemical information for a special issue. I decided to contribute by aggregating some of my teaching related posts; indeed individually could perhaps have only appeared here as opposed to a more traditional means of dissemination such as the JCE journal itself. And I wanted to cite them using the DOI rather than simply the URL of the post. It’s an experiment, and one which I do not yet know if anyone else will try. That in some ways is the point of a blog; it is an interesting experimental vehicle!
Tags:author, chemical information, Digital Object Identifier, the JCE journal, the Journal of Chemical Education
Posted in Chemical IT | 5 Comments »
Sunday, October 13th, 2013
I reminisced about the wonderfully naive but exciting Web-period of 1993-1994. This introduced the server-log analysis to us for the first time, and hits-on-a-web-page. One of our first attempts at crowd-sourcing and analysis was to run an electronic conference in heterocyclic chemistry and to look at how the attendees visited the individual posters and presentations by analysing the server logs.

You can read all about that analysis here. This is one interesting graphic below, showing the 24-hour distribution of accesses. Remember, this was before Google and its analytics even existed (and yes, we were also doing Google-like searches before they did).

But let me get to the actual point of this post. A decade or so ago, all universities in the UK were asked to undertake a quality review exercise of their research outputs. One of the metrics of such outputs is the scientific publication, and each research group leader had to collect their most important four articles published in the previous few years and submit them (as paper) to a review panel. This poor panel was faced with a mountain of paperwork (literally!) when they arrived to do their job. It was soon decided that a better (electronic) system had to be devised. So now we have a product called Symplectic (which as it happens originated in the physics department here at Imperial College), which tirelessly gathers such outputs. More accurately, it gathers the meta-data for research publications, since most publishers do not allow actual reprints to be so harvested! And when it finds a new article, it informs its author, and asks them to check that the meta-data is accurate.
So it was a few days ago that I received such an alert. I checked the meta-data (adding in fact some which associates the scientific work with a particular resource, our High-Performance-Computing unit, and also the NMR systems here) but then the following thumbnail‡ caught my eye. The wonderful Symplectic system had computed this for me.

This I had to see. Expanded, it shows as follows. An altmetric measures attention. And attention (however transient) is apparently itself measured by tweets, facebook, news outlets, science blogs, Mendeley and CiteULike.

Well, things have certainly moved on from the days of analysing server-logs! Now, would an aspiring tenure-track young scientist, presenting an altmetric score of 28 to their head of department expect to get their tenure on this basis? Of course, we are back to the old hoary chestnut. Is attention necessarily good? You cannot tell from the above if we have indeed produced worthy science, or science to be scorned.
Well, the above represents a 20 year period in the evolution of science and how it is communicated. Whether this represents positive progress I leave you to decide. And if one of your altmetric scores is > 28, you have done better than us!
‡Does the icon look familiar? See here.
Tags:aspiring tenure-track young scientist, author, Google, head of department, Imperial College, research group leader, United Kingdom, Web-period
Posted in Chemical IT, General | No Comments »