Posts Tagged ‘City: San Diego’

Managing (open) NMR data: a working example using Mpublish.

Monday, August 1st, 2016

In March, I posted from the ACS meeting in San Diego on the topic of Research data: Managing spectroscopy-NMR, and noted a talk by MestreLab Research on how a tool called Mpublish in the forthcoming release of their NMR analysis software Mestrenova could help. With that release now out, the opportunity arose to test the system.

I will start by reminding that NMR data associated with a published article is (or should be) openly free: one should not need a subscription to the journal to access it (although one might in order to find it). Now, NMR data as it emerges from a spectrometer is highly sophisticated, comprising a collection of (sometimes) binary proprietary files containing the measured free induction decays (FID). Turning this raw data into an interpretable NMR spectrum, the visual form of the data that so appeals to human beings, is non trivial. This requires what may be highly sophisticated software and that in turn means that it may be a commercial product. Of course there are also examples of non-commercial open software packages that are best-of-breed; indeed in its early life-cycle MestreNova was known as MESTREC before becoming a commercial product. Could one achieve the benefits of both open and fully functional NMR data with no loss from the original instrument coupled with the ability to apply top-quality software for its analysis in an open manner? This is a demonstration of how Mpublish achieves this.

  1. Invoke the URL data.datacite.org/chemical/x-mnpub/10.14469/hpc/1087 from a browser
  2. This action queries the metadata deposited with DataCite for the doi 10.14469/hpc/1087 and retrieves the first instance of any file associated with that dataset that has the format type chemical/x-mnpub. You can directly view this metadata by invoking just data.datacite.org/10.14469/hpc/1087 where you can find both mnpub and mnova formats listed. A command such as data.datacite.org/chemical/x-mnpub/10.14469/hpc/1087 allows the file retrieval to be incorporated into automated workflows based just on the doi and the media type desired. Note my parenthetical comment above about finding data; here you only need its doi to retrieve it!
  3. The URL above downloads a small text file with the suffix .mnpub which contains in essence two components:

    • A URL pointing directly to an .mnova file at the repository for which the doi has been issued
    • A signature key derived used to verify that the public key of the publisher (the data repository in this instance) was counter-signed by Mestrelab.
  4. If you now download the application program and install it (but for the purpose of this demonstration, ignore any requests to try to license the program. Use it unlicensed) and open the .mnpub file using it, you should get the below.The application program has checked the signature key, and if valid, proceeds to download a full data file (a .mnova file in this case), and to analyze and display it within the program. The data is fully active; it can be manipulated and analysed. Notice in the picture below, the red arrow points to the state of the license, in this case not present.
    mn
  5. It is also possible to apply this procedure to the raw data as it emerges from the (Bruker) spectrometer, and compressed into a .zip archive. The MestreNova software will automatically process the contents by applying various default parameters, although the result may not correspond exactly to that present in e.g. the equivalent .mnova file (which may have had specific parameters applied).

It is my hope that anyone who records NMR data and processes it using software such as MestreNova will now consider using the mechanism above to accompany their submitted articles, rather than just automatically pasting a static image of the spectrum into a PDF file as "supporting information". This is part of what is meant by "managed research data" (RDM).

One cannot help but note that many types of scientific instrument nowadays come with bespoke software for analysing the data they produce. Very often this software is unavailable to anyone who has not purchased the instrument itself. To make the data available to others, the processed data and its visual interpretation often have to be reduced, with much consequent information loss, to a lowest common denominator format such as Acrobat/PDF. Here we see a mechanism for avoiding any such information loss whilst enabling, for that dataset only, the full potential for (re)analysing the data. It will be interesting to see if other examples of this model or its equivalent emerge in the near future.

 
 
 

Global initiatives in research data management and discovery: searching metadata.

Monday, March 7th, 2016

The upcoming ACS national meeting in San Diego has a CINF (chemical information division) session entitled "Global initiatives in research data management and discovery". I have highlighted here just one slide from my contribution to this session, which addresses the discovery aspect of the session.

Data, if you think about it, is rarely discoverable other than by intimate association with a narrative or journal article. Even then, the standard procedure is to identify the article itself as being of interest, and then digging out the "supporting information", which normally takes the form of a single paginated PDF document. If you are truly lucky, you might also get a CIF file (for crystal structures). But such data has little life of its own outside of its parent, the article. Put another way, it has no metadata it can call its own (metadata is data about an object, in this case research data). An alternative is to try to find the data by searching conventional databases such as CAS,  Beilstein/Reaxys or CSD, and there of course the searches can be very precise. But (someone) has to pay the bills for such accessibility.

We are now starting to see quite different solutions to finding data (the F in FAIR data, the other letters representing accessibility, interoperability and re-usability). These solutions depend on metadata being a part of the solution from the outset, rather than any afterthought produced as a commercial solution. The collection of metadata is part of the overall process called RDM, or research data management, perhaps even the most important part of it. In exchange for identifying metadata about one's data, one gets back a "receipt" in the form of a persistent identifier for the data, more commonly known as a DOI. The agency that issues the DOI also undertakes to look after the donated metadata, and to make it searchable. The table below shows eight searches of such metadata, one example of how to acquire statistics relating to the usage of the data and one search of how to find repositories containing the data.

Search queries enabled by the use of metadata in data publication
# Search query* Instances retrieved:
1 http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:*  InChI identifier
2 http://search.datacite.org/ui?q=alternateIdentifier:InChI:*  InChI key 
3 http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:CULPUXIDFLIQBT-UHFFFAOYSA-N InChI key CULPUXIDFLIQBT-UHFFFAOYSA-N 
4 http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChIKey:* ORCID 0000-0002-8635-8390 AND (boolean) InChI key.
5 http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChI:InChI=1S/C9H11N5O3* ORCID 0000-0002-8635-8390 AND (boolean) + InChI string 1S/C9H11N5O3 with the * wild.
6 http://search.datacite.org/ui?q=has_media:true&fq=prefix:10.14469 Has content media for Publisher 10.14469 (Imperial College)
7 http://search.datacite.org/ui?q=format:chemical/x-* Data format type chemical/x-* 
8 http://search.datacite.org/api?&q=prefix:10.14469& fq=alternateIdentifier:InChIKey:*& fl=doi,title,alternateIdentifier& wt=json&rows=15
http://api.labs.datacite.org/works?q=prefix:10.14469+AND+alternateIdentifier:InChIKey:*
First 15 hits in JSON format, batch query mode
9 http://stats.datacite.org/?fq=datacentre_facet:"BL.IMPERIAL – Imperial College London" resolution statistics for publisher 10.14469 (Imperial College) per month
10 http://service.re3data.org/search?query=&subjects[]=31 Chemistry Research data repository search for Chemistry (135 hits)

In this instance the three MIME media types are chemical/x-wavefunction, chemical/x-gaussian-checkpoint and chemical/x-gaussian-log. See[1] for chemical MIME (multipurpose internet media extensions).


Anyone familiar with the standard ways of finding data (CAS, CSD, Reaxys) will appreciate that the above does not yet have the finesse to find eg sub-structures of chemical structures, synthetic procedures or molecular properties. My including it here is primarily to show some of the potential such systems have, and to remark particularly that the batch query capability of this infrastructure could indeed be used in the future to construct much more sophisticated systems.  Oh, and to the end-user at least, the searches shown above do not require institutional licenses to use. Both the data and its metadata is free, mostly with a CC0 or CC BY 3.0 license for re-use (the R of FAIR).

If more of interest related to this topic emerges at the ACS session,  I will report back here.

References

  1. H.S. Rzepa, P. Murray-Rust, and B.J. Whitaker, "The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange", Journal of Chemical Information and Computer Sciences, vol. 38, pp. 976-982, 1998. https://doi.org/10.1021/ci9803233

Discovery based research experiences: gauche effects in group 16 elements.

Wednesday, March 2nd, 2016

The upcoming ACS national meeting in San Diego has a CHED (chemical education division) session entitled Implementing Discovery-Based Research Experiences in Undergraduate Chemistry Courses. I had previously explored what I called extreme gauche effects in the molecule F-S-S-F. Here I take this a bit further to see what else can be discovered about molecules containing bonds between group 16 elements (QA= O, S, Se, Te). 

OO-SQ

The search definition is shown above, with DIST1 being the QA-QA bond length, the QA-QA bond being acyclic, each QA bearing only two bonded atoms and NM being any non-metal. The first result shown is for QA=S.

S-S

  1. The first discovery is that the most common torsion (red-hot spot) is about 90°, but there appears to be a statistically significant distortion towards longer S-S distances as the torsion deviates from this angle. For those who are so inclined it would perhaps be worth improving my term "appears to be" with a more formal numerical analysis of the distribution shown above and its significance. Any offers?
  2. The other discovery worth exploring is the number of occurences with an angle of 180°. With F-S-S-F itself (not a solid), I had previously noted that this angle actually represented a transition state in the torsion! So what might be inferred from these examples?

The next search includes a further constraint that the temperature the data was recorded at be <140K. This reduces vibrational "noise" and so should increase the significance. S-S-140

  1. Here we discover the same "V"-shaped distribution as before, possibly more significant statistically than the previous search. Again, a proper statistical analysis of the significance of this result is desirable.

The next search is for QA = Se or Te. X-X

  1. The Se and Te distributions can clearly be distinguished, with a weak "V-shape" visible for Se, but absent for Te. Again, those hits at 180!
  2. There are a few instances "in-between" the two distributions, which appear to be  Se-Te systems.

Finally, QA=QB = O.

O-O

  1. The discovery here is the apparent absence of any "V-shaped" distribution.
  2. The hot spot now occurs at 180°, but with a tail down to 60° or less. Clearly, the definition of "NM" as any non-metal probably needs to be explored further for specific instances to see what influence the nature of NM has. NM for example could be another O, which might be a severe perturbation. 

So here I have tried to tease out seven directions for further discovery. I am attending/presenting at the session I noted at the top and will report back on any interesting observations.