Posts Tagged ‘Identifiers’

A search of some major chemistry publishers for FAIR data records.

Friday, April 12th, 2019

In recent years, findable data has become ever more important (the F in FAIR). Here I test that F using the DataCite search service.

Firstly an introduction to this service. This is a metadata database about datasets and other research objects. One of the properties is relatedIdentifier which records other identifiers associated with the dataset, being say the DOI of any published article associated with the data, but it could also be pointers to related datasets.

One can query thus:

  1. https://search.datacite.org/works?query=relatedIdentifiers.relatedIdentifier:*
    which retrieves the very healthy looking 6,179,287 works.
  2. One can restrict this to a specific publisher by the DOI prefix assigned to that publisher:
    ?query=relatedIdentifiers.relatedIdentifier:10.1021*
    which returns a respectable 210,240 works.
  3. It turns out that the major contributor to FAIR currently are crystal structures from the CCDC. One can remove them from the search to see what is left over:
    ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*) 
    and one is down to 14,213 works, of which many nevertheless still appear to be crystal structures. These may be links to other crystal datasets.

I have performed searches 2 and 3 for some popular publishers of chemistry (the same set that were analysed here).

Publisher Search 2 Search 3
ACS 210,240 14,213
RSC 138,147 1,279
Elsevier 185,351 56,373
Nature 12,316 8,104
Wiley 135,874 9,283
Science 3,384 2,343

These publishers all have significant numbers of datasets which at least accord with the F of FAIR. A lot of data sets may not have metadata which in fact points back to a published article, since this can be something that has to be done only when the DOI of that article appears, in other words AFTER the publication of the dataset. So these numbers are probably low rather than high.

How about the other way around? Rather than datasets that have a journal article as a related identifier, we could search for articles that have a dataset as a related identifier?

  1. ?query=(identifier:*10.1039*)+AND+(relatedIdentifiers.relatedIdentifier:*)
    returns rather mysterious nothing found. It might also be that there is no mapping of this search between the CrossRef and DataCite metadata schemas.
  2. And just to show the searches are behaving as expected:
    ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+AND+(identifier:*10.5517*)
    returns 196,027 works.

It will also be of interest to show how these numbers change over time. Is there an exponential increase? We shall see.

Finally, we have not really explored adherence to eg the AIR of FAIR.  That is for another post.

Harnessing FAIR data: A suggested useful persistent identifier (PID) for quantum chemical calculations.

Tuesday, August 7th, 2018

Harnessing FAIR data is an event being held in London on September 3rd; no doubt most speakers will espouse its virtues and speculate about how to realize its potential. Admirable aspirations indeed, but capturing hearts and minds also needs lots of real life applications! Whilst assembling a forthcoming post on this blog, I realized I might have one nice application which also pushes the envelope a bit further, in a manner that I describe below.

The post I refer to above is about using quantum chemical calculations to chart possible mechanistic pathways for the reaction between a carboxylic acid and an amine to form an amide. The FAIR data for the entire project is collected at DOI: 10.14469/hpc/4598. Part of what makes it FAIR is the metadata not only collected about this data but also formally registered with the DataCite agency. Registration in turn enables Finding; it is this aspect I want to demonstrate here.

The metadata for the above DOI includes information such as;

  1. The ORCID persistent identifier (PID) for the creator of the data (in this instance myself)
  2. Date stamps for the original creation date and subsequent modifications.
  3. A rights declaration, in this case the CC0 license which describes how the data can be re-used.
  4. Related identifiers, in this case describing members of this collection.

The data itself is held in the members of the collection, each of which is described by a more specific set of metadata in addition to the more general types in the above list (e.g. 10.14469/hpc/4606).

  1. One important additional metadata descriptor is the ORE locator (Object Re-use and Exchange, itself almost a synonym for FAIR). This allows a machine to deduce a direct path to the data file itself, and hence to retrieve it automatically if desired. It is important to note that the DOI itself (i.e. 10.14469/hpc/4606) points only to the “landing page” for the dataset, and does not necessarily describe the direct path to any specific file in the dataset. The ORE path can be used with e.g. software such as JSmol to directly load a molecule based only on its DOI. You can see an example of this here.
  2. Each molecule-based dataset contains additional specific metadata relating to the molecule itself. For example this is how the InChiKey, an identifier specific to that molecule, is expressed in metadata;
    <subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">PVXKWVPAMVWJSQ-UHFFFAOYSA-N</subject>
    The advantage of expressing the metadata in this way is that a general search of the type:
    https://search.datacite.org/works?query=subjectScheme:inchikey+subject:CZABGBRSHXZJCF-UHFFFAOYSA-N
    can be used to track down any molecule with metadata corresponding to the above InChIkey.
  3. Here is more metadata, introduced in this blog. It relates to the (computed) value of the Gibbs energy (the energy unit is in Hartree), as returned by the Gaussian program;
    <subject subjectScheme="Gibbs_Energy" schemeURI="https://goldbook.iupac.org/html/G/G02629.html" valueURI="http://gaussian.com/thermo/">-649.732417</subject>
    I here argue that it represents a unique identifier for a molecule calculation using the quantum mechanical procedures implemented in e.g. Gaussian. This identifier is different from the InChIkey, in that it can be truncated to provide different levels of information.
    • At the coarsest level, a search of the type
      https://search.datacite.org/works?query=subjectScheme:Gibbs_energy+subject:-649.*
      should reveal all molecules with the same number of atoms and electrons whose Gibbs energy has been calculated, but not necessarily with the same InChI (i.e. they may be isomers, or transition states, etc). This level might be useful for revealing most (not necessarily all) molecules involved in say a reaction mechanism. It should also be insensitive to the program system used, since most quantum codes will return a value for the Gibbs energy if the same procedures have been used (i.e. DFT method, basis set, solvation model and dispersion correction) accurate to probably 0.01 Hartree.
    • The top level of precision however is high enough to almost certainly relate to a specific molecule and probably using a specific program;
      https://search.datacite.org/works?query=subjectScheme:Gibbs_energy+subject:-649.732417
    • The searcher can experiment with different levels of precision to narrow or broaden the search.
    • I would also address the issue (before someone asks) of why I have used the Gibbs energy rather than the Total energy. Put simply, the Gibbs energy is far more useful in a chemical context. It can be used to relate the relative Gibbs energies of different isomers of the same molecule to e.g. the equilibrium constant that might be measured. Or the difference in Gibbs energies between a reactant and a transition state can be used to derive the free energy activation barrier for a reaction. The total energy is not so useful in such contexts, although of course it too could be added as a subject in the metadata above if a real use for it is found.
  4. The searcher can also use Boolean combinations of metadata, such as specifying both the InChIKey and the Gibbs Energy, along with say the ORCID of the person who may have published the data;
    https://search.datacite.org/works?query=
    subjectScheme:Gibbs_energy+subject:-649.*+
    subjectScheme:inchikey+subject:CZABGBRSHXZJCF-UHFFFAOYSA-N+
    ORCID:0000-0002-8635-8390

I have tried to show above how FAIR data implies some form of rich (registered) metadata. And how the metadata can be used to Find (the F in FAIR) data with very specific properties, thus Harnessing FAIR data.


It is a current limitation of the V4.1 DataCite schema that there appears no way to specify the data type of the subject, including any units. In theory, a range query of the type:
https://search.datacite.org/works?query=
subjectScheme:Gibbs_energy+subject:[-649.1 TO -649.8]

should be more specific, but I have not yet gotten it to work, probably because of the lack of data-typing means it is not recognised as a range of numeric values. Implicit in this search is the grouping
https://search.datacite.org/works?query=(subjectScheme:Gibbs_energy+subject:-649.*)
+
(subjectScheme:inchikey+subject:CZABGBRSHXZJCF-UHFFFAOYSA-N)
+ORCID:0000-0002-8635-8390

Currently however DataCite do not correctly honour this form of grouping.

PIDapalooza 2018. A conference like no other!

Tuesday, January 23rd, 2018

Another occasional conference report (day 1). So why is one about “persistent identifiers” important, and particularly to the chemistry domain?

The PID most familiar to most chemists is the DOI (digital object identifier). In fact there are many; some 60 types have been collected by ORCID (themselves purveyors of researcher identifiers). They sometimes even have different names; in life sciences they tend to be known instead as accession numbers. One theme common to many (probably not all) is that they represent sources of metadata about the object being identified. Further information if which allows you (or a machine) to decide if acquiring the full object is worthwhile. So in no particular order, here are some of the things I learnt today.

  1. Mark Hahnel noted the recent launch of the Dimensions resource which links research data with other research activities; I have not yet had a chance to learn its capabilities, but it seems an interesting alternative to other stalwarts such as eg Google Scholar etc.

    You can try this example: https://app.dimensions.ai/discover/publication?search_text=10.6084&search_type=kws&full_search=true which retrieves articles in which the data repository with prefix 10.6084 (Figshare) is cited. Try also the prefix 10.14469 which is the Imperial College repository.

  2. Andy Mabbett talked about the deployment and use of persistent identifiers (the Q numbers) in Wikidata, which increasingly underpin the basis for the various flavours of Wikipedia. He also noted their use of some 50 different identifiers.
  3. Johanna McEntyre noted some 5M published articles in life sciences which reference 1M+ ORCID identifiers, easily the domain with the fastest uptake of this type. Also noted was the new FREYA project; aiming to connect open identifiers for discovery, access and use of research resources.
  4. Tom Gillespie talked about RRID, or Research Resource Identifiers. Included in this are hardware, including instruments and with around 6000 RRIDs systematized so far. They argue this area promotes both the A and I of FAIR (accessible and inter-operable). Of course A and I mean many things to many people.
  5. Several other presentations talked about the finer detail of metadata, such as sub-classifications into e.g. descriptive/admin/technical, but I did rather miss demos showing how search queries of such fine-grained metadata could be constructed.

Apart from the presentations themselves, PIDapalooza is unusual for some other activities. Thus you could go get your PIDnails done, with a selection of 8 or so tasteful logos to choose from. There will be tattoos tomorrow (this is a conference for younger people after all). I may grab a photo or two to provide evidence!

 

Data-free research data management? Not an oxymoron.

Tuesday, May 24th, 2016

I occasionally post about "RDM" (research data management), an activity that has recently become a formalised essential part of the research processes. I say recently formalised, since researchers have of course kept research notebooks recording their activities and their data since the dawn of science, but not always in an open and transparent manner. The desirability of doing so was revealed by the 2009 "Climategate" events. In the UK, Climategate was apparently the catalyst which persuaded the funding councils (such as the EPSRC, the Royal Society, etc) to formulate policies which required all their funded researchers to adopt the principles of RDM by May 2015 and in their future researches. An early career researcher here, anxious to conform to the funding body instructions, sent me an email a few days ago asking about one aspect of RDM which got me thinking.

The question related to the divide between data as a separate research object (and which therefore has to be managed), and data as an inseparable part of the article narrative, which is of course ostensibly managed by the journal publication processes. Such data may often be the description of a process rather than simply tables of numbers or graphs. In chemistry it may include chemical names and chemical terms as part of an experimental procedure. For one nice illustration of such embedded data, go look at the chemical tagger page. Here the data is blending with the semantics, and the two are not easily separated. So, when such separation is not easily achieved, should the specific processes required by RDM as illustrated in the five bullet points below actually be followed?

  1. Specify a data management plan to be followed, as for example points 2-5 below.
  2. Decide upon a location for your data, separated into one for "live" or working data (the purpose simply being to ensure it is properly backed up) and the other for a sub-set of formally "published data" which has to be available for at least ten years after its publication.
  3. Use 2 to gather metadata (see 6-14 below) and in return get a DOI representing the location of the combined metadata + data, from a suitable registration authority such as DataCite.
  4. Quote this DOI(s) in the article describing the results of analysing the data and presenting hypotheses, and conversely once the article itself is allocated its own DOI from a registration authority such as CrossRef, update the metadata in item 3 so as to achieve a bidirectional link between the data and its narrative (and we assume that DataCite and CrossRef will also increasingly exchange the metadata they each hold about the items).
  5. Add both the data and the article DOIs to any institutional CRIS or current research information system (parenthetically, I regard this last stipulation as rather redundant if items 3 and 4 are working effectively, but its a good interim measure whilst the overall system matures).

So, should step 2 be included if the data itself is inextricably intertwined with the narrative and cannot be separated? The slightly surprising advice I would suggest is yes! And the answer is that it IS possible to generate metadata (data about the, possibly entwined, data) which CAN be processed in such a step. What forms would such metadata take?

  1. Identification of the researcher(s) involved. This would nowadays take the form of an ORCID (Open Research and Collaborator Identifier).
  2. Identification of the hosting institution where the data has been produced. There is currently no equivalent to an ORCID for institutions, but it is very likely to come in the future.
  3. A date stamp formalising when the (meta)data is actually deposited.
  4. A title for the project being described. Here we see a blurring between the narrative/article and the data; a title is the shortest possible description of the narrative/article, and it may also apply to the data object(s) or it could have its own title.
  5. A slightly fuller abstract of the project being described. Here we see further blurring between the narrative and data objects.
  6. One can include "related identifiers", in particular the DOIs of any other relevant articles that might have been published which may expand the context of the data, and also the DOIs of any other relevant datasets which may have been allocated in step 2 above.
  7. It is also beneficial to include "chemical identifiers". These can take the form of InChI strings and InChI keys, which allow discretely defined molecular objects which were the object of the research to be tracked and which relate to both the narrative and any other data objects.
  8. If specific software has been used to analyse data, it too can be included as a "related identifier" (e.g. [1]
  9. Potentially at least, if a well-defined instrument has been involved, it too could be included with its own "related identifier". With both 13-14, other issues may need addressing, such as versioning etc, but this no doubt will be sorted in due course.
  10. etc.

So items such as 6-14 can be collected and sent to e.g. DataCite with a DOI received in return as part of item 2 of the RDM processes. No "pure" data need be involved, only metadata. Nonetheless such metadata can only increase the visibility and discoverability of the research, as illustrated in how such metadata can be searched for the components described above.

References

  1. H.S. Rzepa, "KINISOT. A basic program to calculate kinetic isotope effects using normal coordinate analysis of transition state and reactants.", 2015. https://doi.org/10.5281/zenodo.19272

Collaborative FAIR data sharing.

Sunday, April 17th, 2016

I want to describe a recent attempt by a group of collaborators to share the research data associated with their just published article.[1]

I am here introducing things in a hierarchical form (i.e. not necessarily the serial order in which actions were taken).

  1. The data repository selected for the data sharing is described by (m3data) doi: 10.17616/R3K64N[2]
  2. A collaborative project collection was established on this repository (doi: 10.14469/hpc/244[3]). This data collection has some of the following attributes:
  3. Its metadata is sent here: https://search.datacite.org/ui?&q=10.14469/hpc/244 where it can be queried for other details.
  4. The project collaborators are all identified by their ORCID, used to obtain further individual information about the researchers. This information is also propagated to the metadata sent to DataCite.
  5. In the section labelled associated DOIs there is a link to the recently published peer-reviewed article, which itself cites the data via doi: 10.14469/hpc/244 and which thus establishes a bidirectional link between the article and its data.
  6. Also in the associated DOIs section are other DOIs (to two figures and two tables) held in a separate location. One example: doi: 10.14469/hpc/332[4]) which illustrates the original type of data sharing we started about 10 years ago. This form has been variously called a "WEO" or Web-enhanced object (by the ACS) or interactivity boxes (RSC, etc). In such WEOs, we wrap the data into an interactive visual appearance using Jmol or JSmol software. The data itself is directly available to the reader using the Jmol export functions (right mouse click in the visual window).

    • In this specific example the WEO has been assigned its DOI using the repository noted above.[2] 
    • We have in the past also used Figshare[5]) for this purpose, see e.g. 10.6084/m9.figshare.1181739
    • The WEO itself can itself reference a more complete set of data used to create the visual appearance, for example data that allows the wavefunction of the molecule to be computed,  doi: 10.6084/m9.figshare.2581987.v1[6] In this instance this is held on the Figshare[5] repository.
  7. The collection has another section labelled Members. These are individual datasets associated with the collection and held on the SAME repository as the collection itself. In this case, there are five such members, two of which are listed below:

    1. 10.14469/hpc/281[7] contains a variety of other data such as outputs from an IRC (intrinsic reaction coordinate), energy profile diagrams and ZIP archives of other calculations.
    2. 10.14469/hpc/272[8] itself contains five members, one of which is e.g.

      • 10.14469/hpc/267[9] which contains a ZIP archive with NMR data (see here for how this might be packaged in the future) and a file for a GPC (chromatography) instrument.
      • This last item also contains a new section labelled Metadata, which includes e.g. the InChI key and InChI string for the molecule whose properties are reported.

If this mode of presenting data seems a little more complex than a single monolithic PDF file, its because its designed for:

  1. collaboration between scientists, potentially at different locations and institutions.
  2. attribution of provenance/credit for the individual items (via ORCID).
  3. separate date stamping by the various contributors.
  4. providing bi-directional links between data and publications.
  5. holding what we call FAIR (findable, accessible, interoperable and reusable) data, rather than just data encapsulated in a PDF file.
  6. Collecting, storing and sending metadata for aggregation in a formal way, i.e. to DataCite using a formal schema to render the metadata properly searchable.

Thus 10.14469/hpc/244 represents our most complex attempt yet at such collaborative FAIR data sharing with multiple contributors. The tools for packaging many of the datasets are still quite limited (see again here) and the design is still being optimised (call it α). When the repository[2] has been more extensively tested, we intend to make it available as open source for others to experiment with. And of course, when this happens the source code too will have its own DOI!


A refactoring of the Figshare site in December 2015 meant that the DOI no longer points directly to the WEO, and you have to follow a manually inserted link on that page to see it.

References

  1. C. Romain, Y. Zhu, P. Dingwall, S. Paul, H.S. Rzepa, A. Buchard, and C.K. Williams, "Chemoselective Polymerizations from Mixtures of Epoxide, Lactone, Anhydride, and Carbon Dioxide", Journal of the American Chemical Society, vol. 138, pp. 4120-4131, 2016. https://doi.org/10.1021/jacs.5b13070
  2. Re3data.Org., "Imperial College Research Computing Service Data Repository", 2016. https://doi.org/10.17616/r3k64n
  3. C. ROMAIN, "Chemo-Selective Polymerizations Using Mixtures of Epoxide, Lactone, Anhydride and CO2", 2016. https://doi.org/10.14469/hpc/244
  4. H. Rzepa, "Table S8: Comparison of two different basis sets for selected intermediates for CHO/PA ROCOP.", 2016. https://doi.org/10.14469/hpc/332
  5. Re3data.Org., "figshare", 2012. https://doi.org/10.17616/r3pk5r
  6. P. Dingwall, "Gaussian Job Archive for C6H10O", 2016. https://doi.org/10.6084/m9.figshare.2581987.v1
  7. C. ROMAIN, "Figure 9, Figure S18, Figure S19: ROCOP of PA/CHO + IRC", 2016. https://doi.org/10.14469/hpc/281
  8. C. ROMAIN, "Table 1 : Polymerizations Using Lactone, Epoxide, and CO2", 2016. https://doi.org/10.14469/hpc/272
  9. C. ROMAIN, "Table 1, entry 1 : Polymerizations Using Lactone, Epoxide, and CO2", 2016. https://doi.org/10.14469/hpc/267