JSON « Henry Rzepa's blog

Posts Tagged ‘JSON’

Questions about the (metadata) components of a scientific article.

Monday, April 8th, 2019

The conventional procedures for reporting analysis or new results in science is to compose an “article”, augment that perhaps with “supporting information” or “SI”, submit to a journal which undertakes peer review, with revision as necessary for acceptance and finally publication. If errors in the original are later identified, a separate corrigendum can be submitted to the same journal, although this is relatively rare. Any new information which appears post-publication is then considered for a new article, and the cycle continues. Here I consider the possibilities for variations in this sequence of events.

The new disruptors in the processes of scientific communication are the “data“, which can now be given a separate existence (as FAIR data) from the article and its co-published “SI”. Nowadays both the “article+SI” and any separate “data” have another, mostly invisible component, the “metadata“. Few authors ever see this metadata. For the article, it is generated by the publisher (as part of the service to the authors), and sent to CrossRef, which acts as a global registration agency for this particular metadata. For the data, it is assembled when the data is submitted to a “data repository”, either by the authors providing the information manually, or by automated workflows installed in the repository or by a combination of both. It might also be assembled by the article publisher as part of a complete metadata package covering both article and data, rather than being separated from the article metadata. Then, the metadata about data is registered with the global agency DataCite (and occasionally with CrossRef for historical reasons).^‡ Few depositors ever inspect this metadata after it is registered; even fewer authors are involved in decisions about that metadata, or have any inputs to the processes involved in its creation.

Let me analyse a recent example.

For the article[1] you can see the “landing page” for the associated metadata as https://search.crossref.org/?q=10.1021/acsomega.8b03005 and actually retrieve the metadata using https://api.crossref.org/v1/works/10.1021/acsomega.8b03005, albeit in a rather human-unfriendly manner.^† This may be because metadata as such is considered by CrossRef as something just for machines to process and not for humans to see!
- - This metadata indicates “references-count":22, which is a bit odd since 37 are actually cited in the article. It is not immediately obvious why there is a difference of 15 (I am querying this with the editor of the journal). None of the references themselves are included in the metadata record, because the publisher does not currently support liberation using Open References, which makes it difficult to track the missing ones down.
    - This last inference can be tested using metadata from this article[2] using e.g.
      https://api.crossref.org/v1/works/10.1039/C7SC03595K or
      https://data.datacite.org/application/vnd.datacite.datacite+xml/10.1039/C7SC03595K
      which reveals a full citation list, including explicit citations to data objects as per: https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/1620
  - Of the 37 citations listed in the article itself,[1] #22, #24 and #37 are different, being citations to different data sources. The first of these, #22 is an explicit reference to its data partner for the article.
  - An alternative method of invoking a metadata record;
    https://data.datacite.org/application/vnd.datacite.datacite+xml/10.1021/acsomega.8b03005
    retrieves a sub-set of the article metadata available using the CrossRef query,^‡ but again with no included references and again nothing for the data citation #22.
Citation #22 in the above does have its own metadata record, obtainable using:
https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/4751
- This has an entry
  <relatedIdentifier relatedIdentifierType="DOI" relationType="IsReferencedBy">10.1021/acsomega.8b03005</relatedIdentifier>
  which points back to the article.[1]
To summarise, the article noted above[1] has a metadata record that does not include any information about the references/citations (apart from an ambiguous count). A human reading the article can however can easily identify one citation pointing to the article data, which it turns out DOES have a metadata record which both human and machine can identify as pointing back to the article. Let us hope the publisher (the American Chemical Society) corrects this asymmetry in the future; it can be done as shown here![2]

For both types of metadata record, it is the publisher that retains any rights to modify them. Here however we encounter an interesting difference. The publishers of the data are, in this case, also the authors of the article! A modification to this record was made post-publication by this author so as to include the journal article identifier once it had been received from the publisher,[1] as in 2 above. Subsequently, these topics were discussed at a workshop on FAIR data, during which further pertinent articles[3], [4], [5] relating to the one discussed above[1] were shown in a slide by one of the speakers. Since this was deemed to add value to the context of the data for the original article, identifiers for these articles were also appended to the metadata record of the data.

This now raises the following questions:

Should a metadata record be considered a living object, capable of being updated to reflect new information received after its first publication?
If metadata records are an intrinsic part of both a scientific article and any data associated with that article, should authors be fully aware of their contents (if only as part of due diligence to correct errors or to query omissions)?
Should the referees of such works also be made aware of the metadata records? It is of course enough of a challenge to get referees to inspect data (whether as SI or as FAIR), never mind metadata! Put another way, should metadata records be considered as part of the materials reviewed by referees, or something independent of referees and the responsibility of their publishers?
More generally, how would/should the peer-review system respond to living metadata records? Should there be guidelines regarding such records? Or ethical considerations?

I pose these questions because I am not aware of much discussion around these topics; I suggest there probably should be!

^‡Actually CrossRef and DataCite exchange each other’s metadata. However, each uses a somewhat different schema, so some components may be lost in this transit. ^†JSON, which is not particularly human friendly.

References

A. Barba, S. Dominguez, C. Cobas, D.P. Martinsen, C. Romain, H.S. Rzepa, and F. Seoane, "Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data", ACS Omega, vol. 4, pp. 3280-3286, 2019. https://doi.org/10.1021/acsomega.8b03005
S. Arkhipenko, M.T. Sabatini, A.S. Batsanov, V. Karaluka, T.D. Sheppard, H.S. Rzepa, and A. Whiting, "Mechanistic insights into boron-catalysed direct amidation reactions", Chemical Science, vol. 9, pp. 1058-1072, 2018. https://doi.org/10.1039/c7sc03595k
T. Monaretto, A. Souza, T.B. Moraes, V. Bertucci‐Neto, C. Rondeau‐Mouro, and L.A. Colnago, "Enhancing signal‐to‐noise ratio and resolution in low‐field NMR relaxation measurements using post‐acquisition digital filters", Magnetic Resonance in Chemistry, vol. 57, pp. 616-625, 2018. https://doi.org/10.1002/mrc.4806
D. Barache, J. Antoine, and J. Dereppe, "The Continuous Wavelet Transform, an Analysis Tool for NMR Spectroscopy", Journal of Magnetic Resonance, vol. 128, pp. 1-11, 1997. https://doi.org/10.1006/jmre.1997.1214
U.L. Günther, C. Ludwig, and H. Rüterjans, "NMRLAB—Advanced NMR Data Processing in Matlab", Journal of Magnetic Resonance, vol. 145, pp. 201-208, 2000. https://doi.org/10.1006/jmre.2000.2071

Tags:Academic publishing, American Chemical Society, author, Business intelligence, Company: DataCite, CrossRef, data, Data management, DataCite, editor, EIDR, Information, Information science, JSON, Knowledge representation, Metadata repository, Records management, Technology/Internet, The Metadata Company
Posted in Chemical IT | No Comments »

Global initiatives in research data management and discovery: searching metadata.

Monday, March 7th, 2016

The upcoming ACS national meeting in San Diego has a CINF (chemical information division) session entitled "Global initiatives in research data management and discovery". I have highlighted here just one slide from my contribution to this session, which addresses the discovery aspect of the session.

Data, if you think about it, is rarely discoverable other than by intimate association with a narrative or journal article. Even then, the standard procedure is to identify the article itself as being of interest, and then digging out the "supporting information", which normally takes the form of a single paginated PDF document. If you are truly lucky, you might also get a CIF file (for crystal structures). But such data has little life of its own outside of its parent, the article. Put another way, it has no metadata it can call its own (metadata is data about an object, in this case research data). An alternative is to try to find the data by searching conventional databases such as CAS, Beilstein/Reaxys or CSD, and there of course the searches can be very precise. But (someone) has to pay the bills for such accessibility.

We are now starting to see quite different solutions to finding data (the F in FAIR data, the other letters representing accessibility, interoperability and re-usability). These solutions depend on metadata being a part of the solution from the outset, rather than any afterthought produced as a commercial solution. The collection of metadata is part of the overall process called RDM, or research data management, perhaps even the most important part of it. In exchange for identifying metadata about one's data, one gets back a "receipt" in the form of a persistent identifier for the data, more commonly known as a DOI. The agency that issues the DOI also undertakes to look after the donated metadata, and to make it searchable. The table below shows eight searches of such metadata, one example of how to acquire statistics relating to the usage of the data and one search of how to find repositories containing the data.

Search queries enabled by the use of metadata in data publication
#	Search query^*	Instances retrieved:
1	http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:*	InChI identifier
2	http://search.datacite.org/ui?q=alternateIdentifier:InChI:*	InChI key
3	http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:CULPUXIDFLIQBT-UHFFFAOYSA-N	InChI key CULPUXIDFLIQBT-UHFFFAOYSA-N
4	http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChIKey:*	ORCID 0000-0002-8635-8390 AND (boolean) InChI key.
5	http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChI:InChI=1S/C9H11N5O3*	ORCID 0000-0002-8635-8390 AND (boolean) + InChI string 1S/C9H11N5O3 with the * wild.
6	http://search.datacite.org/ui?q=has_media:true&fq=prefix:10.14469	Has content media^‡ for Publisher 10.14469 (Imperial College)
7	http://search.datacite.org/ui?q=format:chemical/x-*	Data format type chemical/x-*
8	http://search.datacite.org/api?&q=prefix:10.14469& fq=alternateIdentifier:InChIKey:& fl=doi,title,alternateIdentifier& wt=json&rows=15 http://api.labs.datacite.org/works?q=prefix:10.14469+AND+alternateIdentifier:InChIKey:	First 15 hits in JSON format, batch query mode
9	http://stats.datacite.org/?fq=datacentre_facet:"BL.IMPERIAL – Imperial College London"	resolution statistics for publisher 10.14469 (Imperial College) per month
10	http://service.re3data.org/search?query=&subjects[]=31 Chemistry	Research data repository search for Chemistry (135 hits)

^‡In this instance the three MIME media types are chemical/x-wavefunction, chemical/x-gaussian-checkpoint and chemical/x-gaussian-log. See[1] for chemical MIME (multipurpose internet media extensions).

Anyone familiar with the standard ways of finding data (CAS, CSD, Reaxys) will appreciate that the above does not yet have the finesse to find eg sub-structures of chemical structures, synthetic procedures or molecular properties. My including it here is primarily to show some of the potential such systems have, and to remark particularly that the batch query capability of this infrastructure could indeed be used in the future to construct much more sophisticated systems. Oh, and to the end-user at least, the searches shown above do not require institutional licenses to use. Both the data and its metadata is free, mostly with a CC0 or CC BY 3.0 license for re-use (the R of FAIR).

If more of interest related to this topic emerges at the ACS session, I will report back here.

References

H.S. Rzepa, P. Murray-Rust, and B.J. Whitaker, "The Application of Chemical Multipurpose Internet Mail Extensions (Chemical MIME) Internet Standards to Electronic Mail and World Wide Web Information Exchange", Journal of Chemical Information and Computer Sciences, vol. 38, pp. 976-982, 1998. https://doi.org/10.1021/ci9803233

Tags:Academic publishing, chemical, chemical information division, Chemical nomenclature, chemical structures, Chemical substance, chemical/x-wavefunction, Cheminformatics, City: San Diego, content media, data repository search, format type chemical/x-*&nbsp, Identifiers, Imperial College, Imperial College London, International Chemical Identifier, JSON, media types, multipurpose internet media extensions, ORCiD, PDF, potential such systems, research data management, Search queries, Technical communication, Technology/Internet
Posted in Chemical IT | 2 Comments »

Five things you did not know about (fork) handles.

Tuesday, March 18th, 2014

OK, you have to be British to understand the pun in the title, a famous comedy skit about four candles. Back to science, and my mention of some crystal data now having a DOI in the previous post. I thought it might be fun to replicate the contents of one of my ACS slides here.

Firstly, a DOI is one implementation of a more generic (and quite old) concept known as a Handle. This is one form of a persistent digital identifier. Article DOIs have been in common use for at least ten years now, and even new chemistry students know about them!^‡ A DOI points to an article in a journal? Not quite as it happens, but in fact it could be a whole lot more that a DOI could lead to! Let me explain by showing you five examples:

doi.org/10042/26065 resolves to a landing page. Crucially, this is NOT the article itself, which may remain obstinately behind a paywall to which you have no access.
doi.org/10042/26065?locatt=filename:input.gjf resolves to a file input.gjf that may be present off the landing page, and hence allowing a machine action to retrieve it.
doi.org/10042/26065?locatt=mimetype:chemical/x-gaussian-input resolves to the first file matching the MIME type that may be present off the landing page, and hence allowing a machine action to retrieve it.
doi.org/10042/26065?locatt=id:1 resolves to the first file matching ID=1 that may be present off the landing page, and hence allowing a machine action to retrieve it.
doi.org/api/10042/26065 will return the JSON-encoded full handle record for processing in Javascript, so that a machine now has access to all the information it might need to perform a machine action.

Now, items 2-5 are not generally available; they work only on our servers. We have placed them there to show how item 6 of the Amsterdam Manifesto could be made to work. There are other ways of course. But you can see them in action here[1] (the article is open access, so you should not get any paywall behaviour from the landing page).

^‡Postscript. A few days ago, I asked my group of 1st year undergraduate students how they might go about tracking down a journal article from its authors, the journal name and the page numbers. The most common reply was “Google it”. Next came “go to the library and find it on the shelves”. One replied “from its DOI” (that student had done an internship in a pharma company before joining us). I used to teach a chemical information course here[2] between 1996 – 2010 where this sort of stuff was a staple. That course is no longer taught. Hence the aforementioned replies!

References

A. Armstrong, R.A. Boto, P. Dingwall, J. Contreras-García, M.J. Harvey, N.J. Mason, and H.S. Rzepa, "The Houk–List transition states for organocatalytic mechanisms revisited", Chem. Sci., vol. 5, pp. 2057-2071, 2014. https://doi.org/10.1039/c3sc53416b
"It:lectures-2011 - ChemWiki", 2019. http://doi.org/10042/a3v06

Tags:ACS, DOI, Google, Handle, JSON, LOCATT
Posted in Chemical IT | 5 Comments »

Henry Rzepa's blog

Posts Tagged ‘JSON’

Questions about the (metadata) components of a scientific article.

References

Global initiatives in research data management and discovery: searching metadata.

References

Five things you did not know about (fork) handles.

References

Recent Posts

Archives

Blogroll

Meta