Posts Tagged ‘JSON’

Global initiatives in research data management and discovery: searching metadata.

Monday, March 7th, 2016

The upcoming ACS national meeting in San Diego has a CINF (chemical information division) session entitled "Global initiatives in research data management and discovery". I have highlighted here just one slide from my contribution to this session, which addresses the discovery aspect of the session.

Data, if you think about it, is rarely discoverable other than by intimate association with a narrative or journal article. Even then, the standard procedure is to identify the article itself as being of interest, and then digging out the "supporting information", which normally takes the form of a single paginated PDF document. If you are truly lucky, you might also get a CIF file (for crystal structures). But such data has little life of its own outside of its parent, the article. Put another way, it has no metadata it can call its own (metadata is data about an object, in this case research data). An alternative is to try to find the data by searching conventional databases such as CAS,  Beilstein/Reaxys or CSD, and there of course the searches can be very precise. But (someone) has to pay the bills for such accessibility.

We are now starting to see quite different solutions to finding data (the F in FAIR data, the other letters representing accessibility, interoperability and re-usability). These solutions depend on metadata being a part of the solution from the outset, rather than any afterthought produced as a commercial solution. The collection of metadata is part of the overall process called RDM, or research data management, perhaps even the most important part of it. In exchange for identifying metadata about one's data, one gets back a "receipt" in the form of a persistent identifier for the data, more commonly known as a DOI. The agency that issues the DOI also undertakes to look after the donated metadata, and to make it searchable. The table below shows eight searches of such metadata, one example of how to acquire statistics relating to the usage of the data and one search of how to find repositories containing the data.

Search queries enabled by the use of metadata in data publication
# Search query* Instances retrieved:
1 http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:*  InChI identifier
2 http://search.datacite.org/ui?q=alternateIdentifier:InChI:*  InChI key 
3 http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:CULPUXIDFLIQBT-UHFFFAOYSA-N InChI key CULPUXIDFLIQBT-UHFFFAOYSA-N 
4 http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChIKey:* ORCID 0000-0002-8635-8390 AND (boolean) InChI key.
5 http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChI:InChI=1S/C9H11N5O3* ORCID 0000-0002-8635-8390 AND (boolean) + InChI string 1S/C9H11N5O3 with the * wild.
6 http://search.datacite.org/ui?q=has_media:true&fq=prefix:10.14469 Has content media for Publisher 10.14469 (Imperial College)
7 http://search.datacite.org/ui?q=format:chemical/x-* Data format type chemical/x-* 
8 http://search.datacite.org/api?&q=prefix:10.14469& fq=alternateIdentifier:InChIKey:*& fl=doi,title,alternateIdentifier& wt=json&rows=15
http://api.labs.datacite.org/works?q=prefix:10.14469+AND+alternateIdentifier:InChIKey:*
First 15 hits in JSON format, batch query mode
9 http://stats.datacite.org/?fq=datacentre_facet:"BL.IMPERIAL – Imperial College London" resolution statistics for publisher 10.14469 (Imperial College) per month
10 http://service.re3data.org/search?query=&subjects[]=31 Chemistry Research data repository search for Chemistry (135 hits)

In this instance the three MIME media types are chemical/x-wavefunction, chemical/x-gaussian-checkpoint and chemical/x-gaussian-log. See[1] for chemical MIME (multipurpose internet media extensions).


Anyone familiar with the standard ways of finding data (CAS, CSD, Reaxys) will appreciate that the above does not yet have the finesse to find eg sub-structures of chemical structures, synthetic procedures or molecular properties. My including it here is primarily to show some of the potential such systems have, and to remark particularly that the batch query capability of this infrastructure could indeed be used in the future to construct much more sophisticated systems.  Oh, and to the end-user at least, the searches shown above do not require institutional licenses to use. Both the data and its metadata is free, mostly with a CC0 or CC BY 3.0 license for re-use (the R of FAIR).

If more of interest related to this topic emerges at the ACS session,  I will report back here.

References

  1. H.S. Rzepa, P. Murray-Rust, and B.J. Whitaker, "The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange", Journal of Chemical Information and Computer Sciences, vol. 38, pp. 976-982, 1998. https://doi.org/10.1021/ci9803233

Five things you did not know about (fork) handles.

Tuesday, March 18th, 2014

OK, you have to be British to understand the pun in the title, a famous comedy skit about four candles. Back to science, and my mention of some crystal data now having a DOI in the previous post. I thought it might be fun to replicate the contents of one of my ACS slides here.

Firstly, a DOI is one implementation of a more generic (and quite old) concept known as a Handle. This is one form of a persistent digital identifier. Article DOIs have been in common use for at least ten years now, and even new chemistry students know about them! A DOI points to an article in a journal? Not quite as it happens, but in fact it could be a whole lot more that a DOI could lead to! Let me explain by showing you five examples:

  1. doi.org/10042/26065 resolves to a landing page. Crucially, this is NOT the article itself, which may remain obstinately behind a paywall to which you have no access.
  2. doi.org/10042/26065?locatt=filename:input.gjf resolves to a file input.gjf that may be present off the landing page, and hence allowing a machine action to retrieve it.
  3. doi.org/10042/26065?locatt=mimetype:chemical/x-gaussian-input resolves to the first file matching the MIME type that may be present off the landing page, and hence allowing a machine action to retrieve it.
  4. doi.org/10042/26065?locatt=id:1 resolves to the  first file matching ID=1 that may be present off the landing page, and hence allowing a machine action to retrieve it.
  5. doi.org/api/10042/26065 will return the JSON-encoded full handle record for processing in Javascript, so that a machine now has access to all the information it might need to perform a machine action.

Now, items 2-5 are not generally available; they work only on our servers. We have placed them there to show how item 6 of the Amsterdam Manifesto could be made to work. There are other ways of course. But you can see them in action here[1] (the article is open access, so you should not get any paywall behaviour from the landing page).


Postscript. A few days ago, I asked my group of 1st year undergraduate students how they might go about tracking down a journal article from its authors, the journal name and the page numbers. The most common reply was “Google it”. Next came “go to the library and find it on the shelves”. One replied “from its DOI” (that student had done an internship in a pharma company before joining us). I used to teach a chemical information course here[2] between 1996 – 2010 where this sort of stuff was a staple. That course is no longer taught. Hence the aforementioned replies!

References

  1. A. Armstrong, R.A. Boto, P. Dingwall, J. Contreras-García, M.J. Harvey, N.J. Mason, and H.S. Rzepa, "The Houk–List transition states for organocatalytic mechanisms revisited", Chem. Sci., vol. 5, pp. 2057-2071, 2014. https://doi.org/10.1039/c3sc53416b
  2. "It:lectures-2011 - ChemWiki", 2019. http://doi.org/10042/a3v06