Technology/Internet « Henry Rzepa's blog

Posts Tagged ‘Technology/Internet’

Managing (open) NMR data: a working example using Mpublish.

Monday, August 1st, 2016

In March, I posted from the ACS meeting in San Diego on the topic of Research data: Managing spectroscopy-NMR, and noted a talk by MestreLab Research on how a tool called Mpublish in the forthcoming release of their NMR analysis software Mestrenova could help. With that release now out, the opportunity arose to test the system.

I will start by reminding that NMR data associated with a published article is (or should be) openly free: one should not need a subscription to the journal to access it (although one might in order to find it). Now, NMR data as it emerges from a spectrometer is highly sophisticated, comprising a collection of (sometimes) binary proprietary files containing the measured free induction decays (FID). Turning this raw data into an interpretable NMR spectrum, the visual form of the data that so appeals to human beings, is non trivial. This requires what may be highly sophisticated software and that in turn means that it may be a commercial product. Of course there are also examples of non-commercial open software packages that are best-of-breed; indeed in its early life-cycle MestreNova was known as MESTREC before becoming a commercial product. Could one achieve the benefits of both open and fully functional NMR data with no loss from the original instrument coupled with the ability to apply top-quality software for its analysis in an open manner? This is a demonstration of how Mpublish achieves this.

Invoke the URL data.datacite.org/chemical/x-mnpub/10.14469/hpc/1087 from a browser
This action queries the metadata deposited with DataCite for the doi 10.14469/hpc/1087 and retrieves the first instance of any file associated with that dataset that has the format type chemical/x-mnpub. You can directly view this metadata by invoking just data.datacite.org/10.14469/hpc/1087 where you can find both mnpub and mnova formats listed. A command such as data.datacite.org/chemical/x-mnpub/10.14469/hpc/1087 allows the file retrieval to be incorporated into automated workflows based just on the doi and the media type desired. Note my parenthetical comment above about finding data; here you only need its doi to retrieve it!
The URL above downloads a small text file with the suffix .mnpub which contains in essence two components:
- A URL pointing directly to an .mnova file at the repository for which the doi has been issued
- A signature key derived used to verify that the public key of the publisher (the data repository in this instance) was counter-signed by Mestrelab.
If you now download the application program and install it (but for the purpose of this demonstration, ignore any requests to try to license the program. Use it unlicensed) and open the .mnpub file using it, you should get the below.The application program has checked the signature key, and if valid, proceeds to download a full data file (a .mnova file in this case), and to analyze and display it within the program. The data is fully active; it can be manipulated and analysed. Notice in the picture below, the red arrow points to the state of the license, in this case not present.
It is also possible to apply this procedure to the raw data as it emerges from the (Bruker) spectrometer, and compressed into a .zip archive. The MestreNova software will automatically process the contents by applying various default parameters, although the result may not correspond exactly to that present in e.g. the equivalent .mnova file (which may have had specific parameters applied).

It is my hope that anyone who records NMR data and processes it using software such as MestreNova will now consider using the mechanism above to accompany their submitted articles, rather than just automatically pasting a static image of the spectrum into a PDF file as "supporting information". This is part of what is meant by "managed research data" (RDM).

One cannot help but note that many types of scientific instrument nowadays come with bespoke software for analysing the data they produce. Very often this software is unavailable to anyone who has not purchased the instrument itself. To make the data available to others, the processed data and its visual interpretation often have to be reduced, with much consequent information loss, to a lowest common denominator format such as Acrobat/PDF. Here we see a mechanism for avoiding any such information loss whilst enabling, for that dataset only, the full potential for (re)analysing the data. It will be interesting to see if other examples of this model or its equivalent emerge in the near future.

Tags:Acrobat, analysis software, chemical, Chemistry, City: San Diego, format type chemical/x-mnpub, media type, Mestrenova, non-commercial open software packages, Nuclear magnetic resonance, Nuclear magnetic resonance spectra database, Nuclear magnetic resonance spectroscopy, PDF, public key, Science, Scientific method, spectroscopy, Technology/Internet
Posted in Chemical IT | 3 Comments »

Data-free research data management? Not an oxymoron.

Tuesday, May 24th, 2016

I occasionally post about "RDM" (research data management), an activity that has recently become a formalised essential part of the research processes. I say recently formalised, since researchers have of course kept research notebooks recording their activities and their data since the dawn of science, but not always in an open and transparent manner. The desirability of doing so was revealed by the 2009 "Climategate" events. In the UK, Climategate was apparently the catalyst which persuaded the funding councils (such as the EPSRC, the Royal Society, etc) to formulate policies which required all their funded researchers to adopt the principles of RDM by May 2015 and in their future researches. An early career researcher here, anxious to conform to the funding body instructions, sent me an email a few days ago asking about one aspect of RDM which got me thinking.

The question related to the divide between data as a separate research object (and which therefore has to be managed), and data as an inseparable part of the article narrative, which is of course ostensibly managed by the journal publication processes. Such data may often be the description of a process rather than simply tables of numbers or graphs. In chemistry it may include chemical names and chemical terms as part of an experimental procedure. For one nice illustration of such embedded data, go look at the chemical tagger page. Here the data is blending with the semantics, and the two are not easily separated. So, when such separation is not easily achieved, should the specific processes required by RDM as illustrated in the five bullet points below actually be followed?

Specify a data management plan to be followed, as for example points 2-5 below.
Decide upon a location for your data, separated into one for "live" or working data (the purpose simply being to ensure it is properly backed up) and the other for a sub-set of formally "published data" which has to be available for at least ten years after its publication.
Use 2 to gather metadata (see 6-14 below) and in return get a DOI representing the location of the combined metadata + data, from a suitable registration authority such as DataCite.
Quote this DOI(s) in the article describing the results of analysing the data and presenting hypotheses, and conversely once the article itself is allocated its own DOI from a registration authority such as CrossRef, update the metadata in item 3 so as to achieve a bidirectional link between the data and its narrative (and we assume that DataCite and CrossRef will also increasingly exchange the metadata they each hold about the items).
Add both the data and the article DOIs to any institutional CRIS or current research information system (parenthetically, I regard this last stipulation as rather redundant if items 3 and 4 are working effectively, but its a good interim measure whilst the overall system matures).

So, should step 2 be included if the data itself is inextricably intertwined with the narrative and cannot be separated? The slightly surprising advice I would suggest is yes! And the answer is that it IS possible to generate metadata (data about the, possibly entwined, data) which CAN be processed in such a step. What forms would such metadata take?

Identification of the researcher(s) involved. This would nowadays take the form of an ORCID (Open Research and Collaborator Identifier).
Identification of the hosting institution where the data has been produced. There is currently no equivalent to an ORCID for institutions, but it is very likely to come in the future.
A date stamp formalising when the (meta)data is actually deposited.
A title for the project being described. Here we see a blurring between the narrative/article and the data; a title is the shortest possible description of the narrative/article, and it may also apply to the data object(s) or it could have its own title.
A slightly fuller abstract of the project being described. Here we see further blurring between the narrative and data objects.
One can include "related identifiers", in particular the DOIs of any other relevant articles that might have been published which may expand the context of the data, and also the DOIs of any other relevant datasets which may have been allocated in step 2 above.
It is also beneficial to include "chemical identifiers". These can take the form of InChI strings and InChI keys, which allow discretely defined molecular objects which were the object of the research to be tracked and which relate to both the narrative and any other data objects.
If specific software has been used to analyse data, it too can be included as a "related identifier" (e.g. [1]
Potentially at least, if a well-defined instrument has been involved, it too could be included with its own "related identifier". With both 13-14, other issues may need addressing, such as versioning etc, but this no doubt will be sorted in due course.
etc.

So items such as 6-14 can be collected and sent to e.g. DataCite with a DOI received in return as part of item 2 of the RDM processes. No "pure" data need be involved, only metadata. Nonetheless such metadata can only increase the visibility and discoverability of the research, as illustrated in how such metadata can be searched for the components described above.

References

H.S. Rzepa, "KINISOT. A basic program to calculate kinetic isotope effects using normal coordinate analysis of transition state and reactants.", 2015. https://doi.org/10.5281/zenodo.19272

Tags:Academic publishing, chemical identifiers, chemical names and chemical terms, chemical tagger page, CrossRef, Data management, Data management plan, DataCite, Identifiers, ORCiD, RDM, researcher, Royal Society, Singular spectrum analysis, Technical communication, Technology/Internet
Posted in Chemical IT | No Comments »

Collaborative FAIR data sharing.

Sunday, April 17th, 2016

I want to describe a recent attempt by a group of collaborators to share the research data associated with their just published article.[1]

I am here introducing things in a hierarchical form (i.e. not necessarily the serial order in which actions were taken).

The data repository selected for the data sharing is described by (m3data) doi: 10.17616/R3K64N[2]
A collaborative project collection was established on this repository (doi: 10.14469/hpc/244[3]). This data collection has some of the following attributes:
Its metadata is sent here: https://search.datacite.org/ui?&q=10.14469/hpc/244 where it can be queried for other details.
The project collaborators are all identified by their ORCID, used to obtain further individual information about the researchers. This information is also propagated to the metadata sent to DataCite.
In the section labelled associated DOIs there is a link to the recently published peer-reviewed article, which itself cites the data via doi: 10.14469/hpc/244 and which thus establishes a bidirectional link between the article and its data.
Also in the associated DOIs section are other DOIs (to two figures and two tables) held in a separate location. One example: doi: 10.14469/hpc/332[4]) which illustrates the original type of data sharing we started about 10 years ago. This form has been variously called a "WEO" or Web-enhanced object (by the ACS) or interactivity boxes (RSC, etc). In such WEOs, we wrap the data into an interactive visual appearance using Jmol or JSmol software. The data itself is directly available to the reader using the Jmol export functions (right mouse click in the visual window).
- In this specific example the WEO has been assigned its DOI using the repository noted above.[2]
- We have in the past also used Figshare[5]) for this purpose, see e.g. 10.6084/m9.figshare.1181739^‡
- The WEO itself can itself reference a more complete set of data used to create the visual appearance, for example data that allows the wavefunction of the molecule to be computed, doi: 10.6084/m9.figshare.2581987.v1[6] In this instance this is held on the Figshare[5] repository.
The collection has another section labelled Members. These are individual datasets associated with the collection and held on the SAME repository as the collection itself. In this case, there are five such members, two of which are listed below:
1. 10.14469/hpc/281[7] contains a variety of other data such as outputs from an IRC (intrinsic reaction coordinate), energy profile diagrams and ZIP archives of other calculations.
2. 10.14469/hpc/272[8] itself contains five members, one of which is e.g.
  - 10.14469/hpc/267[9] which contains a ZIP archive with NMR data (see here for how this might be packaged in the future) and a file for a GPC (chromatography) instrument.
  - This last item also contains a new section labelled Metadata, which includes e.g. the InChI key and InChI string for the molecule whose properties are reported.

If this mode of presenting data seems a little more complex than a single monolithic PDF file, its because its designed for:

collaboration between scientists, potentially at different locations and institutions.
attribution of provenance/credit for the individual items (via ORCID).
separate date stamping by the various contributors.
providing bi-directional links between data and publications.
holding what we call FAIR (findable, accessible, interoperable and reusable) data, rather than just data encapsulated in a PDF file.
Collecting, storing and sending metadata for aggregation in a formal way, i.e. to DataCite using a formal schema to render the metadata properly searchable.

Thus 10.14469/hpc/244 represents our most complex attempt yet at such collaborative FAIR data sharing with multiple contributors. The tools for packaging many of the datasets are still quite limited (see again here) and the design is still being optimised (call it α). When the repository[2] has been more extensively tested, we intend to make it available as open source for others to experiment with. And of course, when this happens the source code too will have its own DOI!

^‡A refactoring of the Figshare site in December 2015 meant that the DOI no longer points directly to the WEO, and you have to follow a manually inserted link on that page to see it.

References

C. Romain, Y. Zhu, P. Dingwall, S. Paul, H.S. Rzepa, A. Buchard, and C.K. Williams, "Chemoselective Polymerizations from Mixtures of Epoxide, Lactone, Anhydride, and Carbon Dioxide", Journal of the American Chemical Society, vol. 138, pp. 4120-4131, 2016. https://doi.org/10.1021/jacs.5b13070
Re3data.Org., "Imperial College Research Computing Service Data Repository", 2016. https://doi.org/10.17616/r3k64n
C. ROMAIN, "Chemo-Selective Polymerizations Using Mixtures of Epoxide, Lactone, Anhydride and CO2", 2016. https://doi.org/10.14469/hpc/244
H. Rzepa, "Table S8: Comparison of two different basis sets for selected intermediates for CHO/PA ROCOP.", 2016. https://doi.org/10.14469/hpc/332
Re3data.Org., "figshare", 2012. https://doi.org/10.17616/r3pk5r
P. Dingwall, "Gaussian Job Archive for C6H10O", 2016. https://doi.org/10.6084/m9.figshare.2581987.v1
C. ROMAIN, "Figure 9, Figure S18, Figure S19: ROCOP of PA/CHO + IRC", 2016. https://doi.org/10.14469/hpc/281
C. ROMAIN, "Table 1 : Polymerizations Using Lactone, Epoxide, and CO2", 2016. https://doi.org/10.14469/hpc/272
C. ROMAIN, "Table 1, entry 1 : Polymerizations Using Lactone, Epoxide, and CO2", 2016. https://doi.org/10.14469/hpc/267

Tags:10.17616, Academic publishing, DataCite, energy profile diagrams, Figshare, Identifiers, Open science, ORCiD, PDF, Scholarly communication, Technical communication, Technology/Internet, Web-enhanced object
Posted in Chemical IT | No Comments »

Metametadata: data about data about (chemical) data.

Saturday, April 16th, 2016

Scientists are familiar with the term data, at least in a scientific or chemical context, but appreciating metadata (meaning "after", or "beyond") is slightly more subtle, in the sense of using it to mean data about data. The challenge lies in clarifying where the boundary between data and its metadata lies and in specifying and controlling the vocabulary used for these metadata descriptions. Items in a chemical metadata dictionary might include e.g. subject classifications such as Organic Molecular Chemistry or identifiers such as InChIkey. But what could metametadata be? Here I briefly show some examples by way of illustration.

Let me start by defining a data repository as a store of both data and the metadata describing it. The metadata is to be exposed in a standard manner which allows it to be aggregated by other agencies. Nowdays, it is becoming common to identify such a data object together with its metadata using a persistent identifier, or DOI. But to decide if any particular repository and the data objects contained therein is generally useful to you, you need information about the metadata itself. Technically, this is defined using a schema[1] describing the metadata (which might e.g. identify any dictionaries used); hence metametadata. Now you need to store the metametadata and so I introduce the concept of a registry which does this. This metametadata object is itself assigned a DOI^‡ and here I list these DOIs for a personal selection of some chemically oriented examples, in this case deriving from the largest registry of research data repositories re3data.org. You can search for your own entry at their site: http://service.re3data.org/search.

Data repository	The repository metametadata DOI^♣	Badge
Figshare	10.17616/R3PK5R[2]
Zenodo	10.17616/R3QP53[3]
Cambridge structure database	10.17616/R36011[4]
Crystallographic open database	10.17616/R37S31[5]
Oxford University Research Archive	10.17616/R3Q056[6]
Open Notebook Science	10.17616/R3859D[7]
Usefulchem	10.17616/R3Z89N[8]
Chemotion	10.17616/R34P5T[9]
Chemspider	10.17616/R38P4P[10]
Chemical Database Service	10.17616/R36P42[11]
Imperial College HPC data repository.	r3d100011965[12],[13]
Imperial College SPECTRa repository.[14]	10.17616/R30316[15]

Not all of the repositories listed in the table above assign formal DOIs to their data collections, meaning that the metadata for their entries cannot be aggregated in a searchable manner using e.g. search.datacite.org/ui (or search.datacite.org/api for the machine version). Currently, the metametadata does not fully carry this information, an aspect which I gather will be rectified in a future revision of the re3data schema.[1]

Importantly, both metadata and (repository) metametadata can be searched using APIs (application programmer interface), ensuring that the entire flow of meta information can be subject to automated software analysis rather than just visual inspections by a human.This should allow a rich and open infrastructure for handling research objects or data to be built up using hierarchical metadata. The examples above indeed show that the chemical space is already the largest component of the Natural Sciences space.

Although the edifice is still largely in its infancy, already I think we can start to see an alternative open approach emerging to "Googling" for data, or the even older traditional bespoke (i.e. non-open) services offered by commercial human-based abstractors of chemical metadata.

^‡This DOI is information about the metametadata, and hence it is metametametadata, or m3data. Sorry! ^♣The citations at the foot of this post are generated entirely automatically (by a WordPress plugin called Kcite) from the m3data associated with each entry, i.e. the DOI listed. Were the persistent identifier for the entry ever to be changed, this would propagate automatically to the citation, unlike the static entries in the table.

References

J. Rücknagel, P. Vierkant, R. Ulrich, G. Kloska, E. Schnepf, D. Fichtmüller, E. Reuter, A. Semrau, M. Kindling, H. Pampel, M. Witt, F. Fritze, S. Van De Sandt, J. Klump, H. Goebelbecker, M. Skarupianski, R. Bertelmann, P. Schirmbacher, F. Scholze, C. Kramer, C. Fuchs, S. Spier, and A. Kirchhoff, "Metadata Schema for the Description of Research Data Repositories", 2015. https://doi.org/10.2312/re3.008
Re3data.Org., "figshare", 2012. https://doi.org/10.17616/r3pk5r
Re3data.Org., "Zenodo", 2013. https://doi.org/10.17616/r3qp53
Re3data.Org., "The Cambridge Structural Database", 2013. https://doi.org/10.17616/r36011
Re3data.Org., "Crystallography Open Database", 2013. https://doi.org/10.17616/r37s31
Re3data.Org., "Oxford University Research Archive", 2014. https://doi.org/10.17616/r3q056
Re3data.Org., "ONSchallenge", 2013. https://doi.org/10.17616/r3859d
Re3data.Org., "UsefulChem", 2014. https://doi.org/10.17616/r3z89n
Re3data.Org., "chemotion", 2013. https://doi.org/10.17616/r34p5t
Re3data.Org., "ChemSpider", 2013. https://doi.org/10.17616/r38p4p
Re3data.Org., "Chemical Database Service", 2012. https://doi.org/10.17616/r36p42
https://doi.org/
H. Rzepa, "Imperial College High Performance Computing Service Data Repository Metadata Schema", 2016. https://doi.org/10.14469/hpc/382
J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
Re3data.Org., "SPECTRa Project", 2013. https://doi.org/10.17616/r30316

Tags:Academic publishing, automated software analysis, BASE, chemical context, Chemical Database Service, chemical metadata, chemical metadata dictionary, chemical space, City: Cambridge, Data dictionary, Data management, Identifiers, Knowledge representation, programmer, Registry of Research Data Repositories, search.datacite.org/api, SPECTRa, Technology/Internet
Posted in Chemical IT | No Comments »

Publishing embargoes.

Wednesday, April 13th, 2016

Publishing embargoes seem a relatively new phenomenon, probably starting in areas of science when the data produced for a scientific article was considered more valuable than the narrative of that article. However, the concept of the embargo seems to be spreading to cover other aspects of publishing, and I came across one recently which appears to take such embargoes into new and uncharted territory.

One example (there are many others) of embargoes continuing to operate in the era of open science and open data relates to crystallographically derived coordinates for macromolecules. Biomolecular structures are allowed to be embargoed for a maximum of one year before becoming openly available or “released” (considered a friendlier term than embargo). A more recent phenomenon is of embargoes on press releases which may be prepared by authors and or publishers to accompany the appearance of any article considered especially newsworthy. The publisher will then request that the press release is only released to coincide with the actual publication time and date of the article itself. Both of these types of embargo are more or less accepted by both parties. But in the last five years or so, new types of embargo have been introduced and it is these I want to discuss here.

The self-archive or “green open access” version of an article, in the form of the last author version of an accepted manuscript prior to copy-editing and other operations by a publisher. Such Green OA versions are now a mandatory requirement from funders (in the UK), arising from the need to conduct a “REF” or research excellence framework assessment of all (UK) universities every seven years or so. In order to allow assessors and funding councils unencumbered access to these research outputs, the authors must self-archive their publications in a suitable institutional repository. In general therefore, there should always exist two versions of any scientific paper authored within these guidelines, the AV (author version) and VoR (Version of Record, held by the publisher, and carrying the guarantee of peer review). Publishers now embargo author versions until the VoR version has been published, and sometimes even up to 18 months beyond this period.
The “supporting information” or SI embargo. This is closely related to the crystallographic data embargo noted above, but it applies in general to most other data and information associated with an article. Until very recently, most SI was in fact handled by the publisher themselves, and so it was released at the same time as the article. Since it is becoming more common to deposit data and SI in a separate repository, some publishers mandate that the release dates of this material must not precede the article itself. Deposition of such data has also become a mandatory requirement from (UK) funders since May 2015, and I have blogged about such “research data management” often here. In effect, both the scientific article and the data supporting it achieve their own DOIs or persistent digital identifiers, allowing easy and independent access to either the article OR its data. In fact, assigning such a DOI has a more subtle effect; creating a DOI means that metadata describing the object is also created and then aggregated by the agency issuing the DOI such as CrossRef and DataCite. Importantly, one should note that SI which is handled purely by the publisher will not have its own separate DOI and it will not have its own metadata. The data metadata for example can include the DOI for the article, and vice versa. I have shown examples of the utility of such metadata for data in an earlier post.
So now we come to the most recent embargo, which has surfaced since around May 2015, as increasingly data has become a first class object in its own right with its own DOI and importantly its own metadata. There is now evidence that some publishers are requesting that this very metadata about data is also subjected to an embargo, not to be released before the article which makes use of that data is itself released. So data can be deposited in “dark form” prior to a publication, but the metadata (which carries the date stamp and provenance for the deposition) may have to be “dark” or embargoed. Actually, this is not yet very common; for example I asked the Royal Society of Chemistry what their policy was, with the reply “the Royal Society of Chemistry wouldn’t require metadata about the data files to be embargoed”.

We live in an era where the very careers of reseachers can be determined by their claim to priority about scientific discoveries. The date stamps for priority continue to be largely controlled and issued by publishers and some may decide that it will be in their business interests to extend their control to data. Perhaps they may even wish to control all aspects of publication including the data and its metadata, acting as self-proclaimed research facilitators.

At this moment, this has not happened; both data and its metadata can remain open and FAIR. Which is where I think we should go in the future in the interests of open science itself.

Tags:Academic publishing, Embargo, Open access, Publishing, Royal Society of Chemistry, Technology/Internet, Uncharted, Uncharted Territory
Posted in Chemical IT | No Comments »

Global initiatives in research data management and discovery: searching metadata.

Monday, March 7th, 2016

The upcoming ACS national meeting in San Diego has a CINF (chemical information division) session entitled "Global initiatives in research data management and discovery". I have highlighted here just one slide from my contribution to this session, which addresses the discovery aspect of the session.

Data, if you think about it, is rarely discoverable other than by intimate association with a narrative or journal article. Even then, the standard procedure is to identify the article itself as being of interest, and then digging out the "supporting information", which normally takes the form of a single paginated PDF document. If you are truly lucky, you might also get a CIF file (for crystal structures). But such data has little life of its own outside of its parent, the article. Put another way, it has no metadata it can call its own (metadata is data about an object, in this case research data). An alternative is to try to find the data by searching conventional databases such as CAS, Beilstein/Reaxys or CSD, and there of course the searches can be very precise. But (someone) has to pay the bills for such accessibility.

We are now starting to see quite different solutions to finding data (the F in FAIR data, the other letters representing accessibility, interoperability and re-usability). These solutions depend on metadata being a part of the solution from the outset, rather than any afterthought produced as a commercial solution. The collection of metadata is part of the overall process called RDM, or research data management, perhaps even the most important part of it. In exchange for identifying metadata about one's data, one gets back a "receipt" in the form of a persistent identifier for the data, more commonly known as a DOI. The agency that issues the DOI also undertakes to look after the donated metadata, and to make it searchable. The table below shows eight searches of such metadata, one example of how to acquire statistics relating to the usage of the data and one search of how to find repositories containing the data.

Search queries enabled by the use of metadata in data publication
#	Search query^*	Instances retrieved:
1	http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:*	InChI identifier
2	http://search.datacite.org/ui?q=alternateIdentifier:InChI:*	InChI key
3	http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:CULPUXIDFLIQBT-UHFFFAOYSA-N	InChI key CULPUXIDFLIQBT-UHFFFAOYSA-N
4	http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChIKey:*	ORCID 0000-0002-8635-8390 AND (boolean) InChI key.
5	http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChI:InChI=1S/C9H11N5O3*	ORCID 0000-0002-8635-8390 AND (boolean) + InChI string 1S/C9H11N5O3 with the * wild.
6	http://search.datacite.org/ui?q=has_media:true&fq=prefix:10.14469	Has content media^‡ for Publisher 10.14469 (Imperial College)
7	http://search.datacite.org/ui?q=format:chemical/x-*	Data format type chemical/x-*
8	http://search.datacite.org/api?&q=prefix:10.14469& fq=alternateIdentifier:InChIKey:& fl=doi,title,alternateIdentifier& wt=json&rows=15 http://api.labs.datacite.org/works?q=prefix:10.14469+AND+alternateIdentifier:InChIKey:	First 15 hits in JSON format, batch query mode
9	http://stats.datacite.org/?fq=datacentre_facet:"BL.IMPERIAL – Imperial College London"	resolution statistics for publisher 10.14469 (Imperial College) per month
10	http://service.re3data.org/search?query=&subjects[]=31 Chemistry	Research data repository search for Chemistry (135 hits)

^‡In this instance the three MIME media types are chemical/x-wavefunction, chemical/x-gaussian-checkpoint and chemical/x-gaussian-log. See[1] for chemical MIME (multipurpose internet media extensions).

Anyone familiar with the standard ways of finding data (CAS, CSD, Reaxys) will appreciate that the above does not yet have the finesse to find eg sub-structures of chemical structures, synthetic procedures or molecular properties. My including it here is primarily to show some of the potential such systems have, and to remark particularly that the batch query capability of this infrastructure could indeed be used in the future to construct much more sophisticated systems. Oh, and to the end-user at least, the searches shown above do not require institutional licenses to use. Both the data and its metadata is free, mostly with a CC0 or CC BY 3.0 license for re-use (the R of FAIR).

If more of interest related to this topic emerges at the ACS session, I will report back here.

References

H.S. Rzepa, P. Murray-Rust, and B.J. Whitaker, "The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange", Journal of Chemical Information and Computer Sciences, vol. 38, pp. 976-982, 1998. https://doi.org/10.1021/ci9803233

Tags:Academic publishing, chemical, chemical information division, Chemical nomenclature, chemical structures, Chemical substance, chemical/x-wavefunction, Cheminformatics, City: San Diego, content media, data repository search, format type chemical/x-*&nbsp, Identifiers, Imperial College, Imperial College London, International Chemical Identifier, JSON, media types, multipurpose internet media extensions, ORCiD, PDF, potential such systems, research data management, Search queries, Technical communication, Technology/Internet
Posted in Chemical IT | 2 Comments »

LEARN Workshop: Embedding Research Data as part of the research cycle

Monday, February 1st, 2016

I attended the first (of a proposed five) workshops organised by LEARN (an EU-funded project that aims to ...Raise awareness in research data management (RDM) issues & research policy) on Friday. Here I give some quick bullet points relating to things that caught my attention and or interest. The program (and Twitter feed) can be found at https://learnrdm.wordpress.com where other's comments can also be seen.

Henry Oldenburg, founder member and first secretary of the Royal Society, was the first Open Scientist.
About 100 people attended the workshop. Of these ~3-5 identified themselves as researchers creating data, and the rest comprised research data managers, administrators, librarians, publishers (but see below) etc. Many were new to their posts.
Not publishing scientific data should become recognised as scientific malpractice.
Central libraries should pro-actively disperse their knowledge to data scientists in departments.
If a scientist is concerned that openly publishing their data might give advantage to their competitors, they are urged to counteract this by "being cleverer than the others".
The three great bastions of open science are (a) Open Data, (b) Open access articles and (c) doing science openly. Examples of this third category include open notebook science (ONS), a form notably pioneered by Jean-Claude Bradley. One attribute of ONS was noted as no insider knowledge.
Learned societies should endow medals for Open Science.
(Some) publishers are reinventing themselves as Research Facilitators.

The plenaries are all well worth dipping into (certainly the video and in some cases all the slides are scheduled to appear).

If you are a researcher (undergraduate students, PGs, PDRAs, early career researchers and academics) you should immediately track down your local evangelist/expert in RDM and ask what the local infrastructures are (or will be shortly built).

Tags:Academic publishing, European Union, first Open Scientist, first secretary, Free culture movement, Henry Oldenburg, Jean Claude Bradley, Open access, Open data, Open science, RDM, Research, researcher, Royal Society, Science, Scientific method, Scientific misconduct, scientist, Technology/Internet
Posted in Chemical IT | 1 Comment »

A visualization of the anomeric effect from crystal structures.

Thursday, August 27th, 2015

The anomeric effect is best known in sugars, occuring in sub-structures such as RO-C-OR. Its origins relate to how the lone pairs on each oxygen atom align with the adjacent C-O bonds. When the alignment is 180°, one oxygen lone pair can donate into the C-O σ* empty orbital and a stabilisation occurs. Here I explore whether crystal structures reflect this effect.

Scheme

The torsion angles along each O-C bond are specified, along with the two C-O distances. All the bonds are declared acyclic, and the usual R < 5%, no disorder and no errors specified.

You can see from the plot below that the hotspot occurs when both RO-CO torsions are ~65°. From this we will assume that the two (unseen)^‡ lone pairs at any one of the oxygens are distributed approximately tetrahedrally around each oxygen, and if this is true then one of them must by definition be oriented ~ 180° with respect to the same RO-CO bond (the other is therefore oriented -60°). This allows it to be antiperiplanar to the adjacent C-O bond and hence interact with its σ* empty orbital. So the hotspot corresponds to structures where BOTH oxygen atoms have lone pairs which interact with the adjacent O-C anti bond.
There is a tiny cluster for which both RO-CO torsions are ~180° and hence neither oxygen has an antiperiplanar lone pair.
Only slightly larger are clusters where one torsion is ~65° and the other ~180°, meaning that only one oxygen has an antiperiplanar lone pair.
A plot of the two C-O lengths indeed shows an overall hotspot at ~1.40Å for both distances. If the search is filtered to include only torsions in the range 150-180°, the hotspot value increases to 1.415Å for both. If one torsion is restricted to 40-80° and the other to 150-180° the hotspot shows one C-O bond is about 0.012Å shorter than the other.

Scheme

I also include a further constraint, that the diffraction data must be collected below 140K. The hotspot moves to ~ 55/60° indicating values free of some vibrational noise.

Scheme

Interestingly, replacing oxygen with nitrogen reveals relatively few examples of the effect (C(NR₂)₄ is an exception). Replacing O by divalent S produces only 13 hits, with the surprising result (below) that in all of them only one S sets up an anomeric interaction. Arguably, the number of examples is too low to draw any firm conclusions from this observation.

Scheme

^‡Most diffractometers measure low angle scattering of X-rays by high density electrons. These are the core electrons associated with a nucleus rather than the valence electrons associated with lone pairs. Hence very few positions of valence lone pairs have ever been crystallographically measured.

Tags:Alkane stereochemistry, Anomeric effect, Carbohydrate chemistry, Carbohydrates, Carbon–oxygen bond, Chemical bond, Ether, Lone pair, Physical organic chemistry, Quantum chemistry, Stereochemistry, Technology/Internet
Posted in Chemical IT, crystal_structure_mining | No Comments »

Single Figure (nano)publications, reddit AMAs and other new approaches to research reporting

Wednesday, August 5th, 2015

I recently received two emails each with a subject line new approaches to research reporting. The traditional 350 year-old model of the (scientific) journal is undergoing upheavals at the moment with the introduction of APCs (article processing charges), a refereeing crisis and much more. Some argue that brand new thinking is now required. Here are two such innovations (and I leave you to judge whether that last word should have an appended ?).

To set the scene for the first, I will quote the abstract: “The single figure publication is a novel, efficient format by which to communicate scholarly advances. It will serve as a forerunner of the nano-publication, a modular unit of information critical for machine-driven data aggregation and knowledge integration[1] The kernel of this suggestion is (again I quote) “We offer the idea of the micro-publication unit, the single figure publication (SFP), to provide scholars with a real-world, manageable method to inform research.” I was struck by the overlap between this suggestion and the one you may find on many of the posts on this blog, where what I refer to as FAIR Data is assigned a digital object identifier (DOI) and included in the citation lists at the end of the post. The key phrase in the above abstract is machine-driven data aggregation and knowledge, although the article does not really go into any mechanisms for easily achieving this. It is my argument that the act of assigning a DOI carries with it the association that there is machine searchable metadata which can be retrieved and used for the aggregation and knowledge mining. The authors of this article, Do and Mobley, advocate adoption of nanopublications defined by inclusion of just a single figure (notably, not a table of results!) and some accompanying context which they claim would reduce the unit of publication to a more tractable size. This does raise the question of whether science needs more publications (in chemistry alone there are said to be more than a million published each year) or whether we should instead be concentrating our efforts on improving the data side of things by increasing its semantic content and formalising its structures, its preservation and curation. I certainly argue that far too little effort has been poured into these latter activities. You only have to look at the typical SI (supporting information) associated with many chemistry articles to realise that in many cases they are still hardly fit for purpose. There is one concept introduced by Do and Mobley that also deserves mention. Their nanopublications are structured to be read by machines, not people. They will therefore not be refereed by people (my inference). They do not really discuss how else the quality will be assessed, but of course if you treat their nanopublication as essentially FAIR data, then it does become possible to develop methods of machine refereeing.

The second email alerted me to an article[2] in the Winnower, a forum that offers a bridge between “traditional scholarly publishing tools to traditional and non-traditional scholarly outputs—because scholarly communication doesn’t just happen in scholarly journals“. Here, the concept of scholarly communication is extended to the New Reddit Journal of Science and introduces the concept pioneered by reddit of the AMA, or “ask me anything” environment. I occasionally publish some of the posts on this blog to the Winnower, receiving in return the increasingly ubiquitous DOI. I have also occasionally quoted these DOIs in articles submitted to conventional chemistry journals. What we see now is the propagation of a Winnower DOI on to e.g. https://www.reddit.com/r/science/ where anyone^† can post a question related to the original research reporting. I must state that I do have some reservations about this. Whilst it is likely that the majority of traditional scholarly reporting is likely to receive no AMAs (just as a very high proportion of research articles attract few if any citations in other articles over a period of decades), it is also likely that the quality of posted AMAs may turn out to be very low. At which point the original researcher has to make a judgement as to whether to devote any of their increasingly precious and fragmented time to answering them. And if few if any answers are posted in response to an AMA, the system seems unlikely to flourish.

But what we see here are two serious attempts to develop new approaches to research reporting, and not doubt others will emerge. To quote Yogi Berra, the future is not what it used to be.

^†Anyone can also post to this blog to ask similar questions. But note that associating an ORCID with such comments is highly recommended. I do not think that reddit currently supports ORCID, but I would argue if the intent is serious, it certainly should.

References

L. Do, and W. Mobley, "Single Figure Publications: Towards a novel alternative format for scholarly communication", F1000Research, vol. 4, pp. 268, 2015. https://doi.org/10.12688/f1000research.6742.1
. RobustTempComparison, and . r/Science, "Science AMA Series: Climate models are more accurate than previous evaluations suggest. We are a bunch of scientists and graduate students who recently published a paper demonstrating this, Ask Us Anything!", The Winnower, . https://doi.org/10.15200/winn.143871.12809

Tags:10.15200, 143871.12809, Academia, Academic publishing, advocate, Citation, data mining, Digital Object Identifier, Do, Knowledge, knowledge mining, Microattribution, Mobley, original researcher, Peer review, Publishing, scholarly publishing tools, Technology/Internet, the New Reddit Journal, Yogi Berra
Posted in Chemical IT, General | No Comments »

The 2015 Bradley-Mason prize for open chemistry.

Friday, June 26th, 2015

Open principles in the sciences in general and chemistry in particular are increasingly nowadays preached from funding councils down, but it can be more of a challenge to find innovative practitioners. Part of the problem perhaps is that many of the current reward systems for scientists do not always help promote openness. Jean-Claude Bradley was a young scientist who was passionately committed to practising open chemistry, even though when he started he could not have anticipated any honours for doing so. A year ago a one day meeting at Cambridge was held to celebrate his achievements, followed up with a special issue of the Journal of Cheminformatics. Peter Murray-Rust and I both contributed and following the meeting we decided to help promote Open Chemistry via an annual award to be called the Bradley-Mason prize. This would celebrate both “JC” himself and Nick Mason, who also made outstanding contributions to the cause whilst studying at Imperial College. The prize was initially to be given to an undergraduate student at Imperial, but was also extended to postgraduate students who have promoted and showcased open chemistry in their PhD researches.

Peter and I are delighted to announce the inaugural winners of this prize.

The postgraduate winner is Tom Phillips for his open blog describing his experiences as a PhD student and for leading by example. He has published his instrumental codes on Github (and now Zenodo[1]) and data and codes for reproducing the graphs in his work on the “lab on a chip” in Figshare[2] and through his blog has encouraged other research students to do the same. Tom has worked assiduously to ensure that all the articles describing his PhD work are or will be open access.[3]

The undergraduate winner is Tom Arrow for his “spare time” involvement with WikiMedia (the foundation that underpins the open Wikipedia), including participating in a Wikimedia EU hackathon in Lyon France, and feeding his experiences and skills back into his undergraduate environment as well as enhancing the teaching Wiki used by his fellow students. Tom took the lead in introducing us to Wikidata[4] for storing chemical data in an open Wikibase data repository and in promoting its use for enriching Wikipedia chemistry pages and showcasing open data in undergraduate teaching environments.

References

T. Phillips, and S. Macbeth, "pumpy: Zenodo release", 2015. https://doi.org/10.5281/zenodo.19033
T. Phillips, J.H. Bannock, and J.D. Mello, "Data for microscale extraction and phase separation using a porous capillary", 2015. https://doi.org/10.6084/m9.figshare.1447208
T.W. Phillips, J.H. Bannock, and J.C. deMello, "Microscale extraction and phase separation using a porous capillary", Lab on a Chip, vol. 15, pp. 2960-2967, 2015. https://doi.org/10.1039/c5lc00430f
D. Vrandečić, and M. Krötzsch, "Wikidata", Communications of the ACM, vol. 57, pp. 78-85, 2014. https://doi.org/10.1145/2629489

Tags:Cambridge, chemical data, Chemistry Central, Collective intelligence, Crowdsourcing, Doctor of Philosophy, Education, European Union, France, GITHUB INC., Imperial College, Jean Claude Bradley, lab on a chip, Lyon, Nick Mason, Nonprofit technology, Open content, Peter Murray-Rust, reward systems, Technology/Internet, Tom Arrow, Tom Phillips, Wikimedia Foundation, wikipedia, World Wide Web, young scientist
Posted in Bradley-Mason Prize for Open Chemistry, Chemical IT | 1 Comment »

Henry Rzepa's blog

Posts Tagged ‘Technology/Internet’

Managing (open) NMR data: a working example using Mpublish.

Data-free research data management? Not an oxymoron.

References

Collaborative FAIR data sharing.

References

Metametadata: data about data about (chemical) data.

References

Publishing embargoes.

Global initiatives in research data management and discovery: searching metadata.

References

LEARN Workshop: Embedding Research Data as part of the research cycle

A visualization of the anomeric effect from crystal structures.

Single Figure (nano)publications, reddit AMAs and other new approaches to research reporting

References

The 2015 Bradley-Mason prize for open chemistry.

References

Recent Posts

Archives

Blogroll

Meta