Posts Tagged ‘Identifiers’
Thursday, April 18th, 2019
In a previous post, I looked at the Findability of FAIR data in common chemistry journals. Here I move on to the next letter, the A = Accessible.
The attributes of A[1] include:
- (meta)data are retrievable by their identifier using a standardized communication protocol.
- the protocol is open, free and universally implementable.
- the protocol allows for an authentication and authorization procedure.
- metadata are accessible, even when the data are no longer available.
- The metadata should include access information that enables automatic processing by a machine as well as a person.
Items 1-2 are covered by associating a DOI (digital object identifier) with the metadata. Item 3 relates to data which is not necessarily also OPEN (FAIR and OPEN are complementary, but do not mean the same).
Item 4 mandates that a copy of the metadata be held separately from the data itself; currently the favoured repository is DataCite (and this metadata way well be duplicated at CrossRef, thus providing a measure of redundancy). It also addresses an interesting debate on whether the container for data such as a ZIP or other compressed archive should also contain the full metadata descriptors internally, which would not directly address item 4, but could do so by also registering a copy of the metadata externally with eg DataCite.
Item 4 also implies some measure of separation between the data and its metadata, which now raises an interesting and separate issue (introduced with this post) that the metadata can be considered a living object, with some attributes being updated post deposition of the data itself. Thus such metadata could include an identifier to the journal article relating to the data, information that only appears after the FAIR data itself is published. Or pointers to other datasets published at a later date. Such updating of metadata contained in an archive along with the data itself would be problematic, since the data itself should not be a living object.
Item 5 is the need for Accessibility to relate both to a human acquiring FAIR data and to a machine. The latter needs direct information on exactly how to access the data. To illustrate this, I will use data deposited in support of the previous post and for which a representative example of metadata can be found at (item 4) a separate location at:
data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/5496
This contains the components:
- <relatedIdentifier relatedIdentifierType="URL" relationType="HasMetadata" relatedMetadataScheme="ORE"schemeURI="http://www.openarchives.org/ore/
">https://data.hpc.imperial.ac.uk/resolve/?ore=5496</relatedIdentifier>
- <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart" relatedMetadataScheme="Filename" schemeURI="filename://aW5wdXQuZ2pm">https://data.hpc.imperial.ac.uk/resolve/?doi=5496&file=1</relatedIdentifier>
Item 6 is an machine-suitable RDF declaration of the full metadata record. Item 7 allows direct access to the datafile. This in turn allows programmed interfaces to the data to be constructed, which include e.g. components for immediate visualisation and/or analysis. It also allows access on a large-scale (mining), something a human is unlikely to try.
It would be fair to say that the A of FAIR is still evolving. Moreover, searches of the DataCite metadata database are not yet at the point where one can automatically identify metadata records that have these attributes. When they do become available, I will show some examples here.
Added: This search: https://search.test.datacite.org/works?
query=relatedIdentifiers.relatedMetadataScheme:ORE shows how it might operate.
References
- M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J. Boiten, L.B. da Silva Santos, P.E. Bourne, J. Bouwman, A.J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C.T. Evelo, R. Finkers, A. Gonzalez-Beltran, A.J. Gray, P. Groth, C. Goble, J.S. Grethe, J. Heringa, P.A. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S.J. Lusher, M.E. Martone, A. Mons, A.L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M.A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, and B. Mons, "The FAIR Guiding Principles for scientific data management and stewardship", Scientific Data, vol. 3, 2016. https://doi.org/10.1038/sdata.2016.18
Tags:Academic publishing, automatic processing, Data management, Digital Object Identifier, EIDR, FAIR data, Findability, Identifiers, Information, Information architecture, Information science, Knowledge, Knowledge representation, metadata, mining, Open Archives Initiative, RDF, Records management, representative, standardized communication protocol, Technical communication, Technology/Internet, Web design, Written communication, XML
Posted in Chemical IT | No Comments »
Thursday, April 18th, 2019
In a previous post, I looked at the Findability of FAIR data in common chemistry journals. Here I move on to the next letter, the A = Accessible.
The attributes of A[1] include:
- (meta)data are retrievable by their identifier using a standardized communication protocol.
- the protocol is open, free and universally implementable.
- the protocol allows for an authentication and authorization procedure.
- metadata are accessible, even when the data are no longer available.
- The metadata should include access information that enables automatic processing by a machine as well as a person.
Items 1-2 are covered by associating a DOI (digital object identifier) with the metadata. Item 3 relates to data which is not necessarily also OPEN (FAIR and OPEN are complementary, but do not mean the same).
Item 4 mandates that a copy of the metadata be held separately from the data itself; currently the favoured repository is DataCite (and this metadata way well be duplicated at CrossRef, thus providing a measure of redundancy). It also addresses an interesting debate on whether the container for data such as a ZIP or other compressed archive should also contain the full metadata descriptors internally, which would not directly address item 4, but could do so by also registering a copy of the metadata externally with eg DataCite.
Item 4 also implies some measure of separation between the data and its metadata, which now raises an interesting and separate issue (introduced with this post) that the metadata can be considered a living object, with some attributes being updated post deposition of the data itself. Thus such metadata could include an identifier to the journal article relating to the data, information that only appears after the FAIR data itself is published. Or pointers to other datasets published at a later date. Such updating of metadata contained in an archive along with the data itself would be problematic, since the data itself should not be a living object.
Item 5 is the need for Accessibility to relate both to a human acquiring FAIR data and to a machine. The latter needs direct information on exactly how to access the data. To illustrate this, I will use data deposited in support of the previous post and for which a representative example of metadata can be found at (item 4) a separate location at:
data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/5496
This contains the components:
- <relatedIdentifier relatedIdentifierType="URL" relationType="HasMetadata" relatedMetadataScheme="ORE"schemeURI="http://www.openarchives.org/ore/
">https://data.hpc.imperial.ac.uk/resolve/?ore=5496</relatedIdentifier>
- <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart" relatedMetadataScheme="Filename" schemeURI="filename://aW5wdXQuZ2pm">https://data.hpc.imperial.ac.uk/resolve/?doi=5496&file=1</relatedIdentifier>
Item 6 is an machine-suitable RDF declaration of the full metadata record. Item 7 allows direct access to the datafile. This in turn allows programmed interfaces to the data to be constructed, which include e.g. components for immediate visualisation and/or analysis. It also allows access on a large-scale (mining), something a human is unlikely to try.
It would be fair to say that the A of FAIR is still evolving. Moreover, searches of the DataCite metadata database are not yet at the point where one can automatically identify metadata records that have these attributes. When they do become available, I will show some examples here.
Added: This search: https://search.test.datacite.org/works?
query=relatedIdentifiers.relatedMetadataScheme:ORE shows how it might operate.
References
- M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J. Boiten, L.B. da Silva Santos, P.E. Bourne, J. Bouwman, A.J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C.T. Evelo, R. Finkers, A. Gonzalez-Beltran, A.J. Gray, P. Groth, C. Goble, J.S. Grethe, J. Heringa, P.A. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S.J. Lusher, M.E. Martone, A. Mons, A.L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M.A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, and B. Mons, "The FAIR Guiding Principles for scientific data management and stewardship", Scientific Data, vol. 3, 2016. https://doi.org/10.1038/sdata.2016.18
Tags:Academic publishing, automatic processing, Data management, Digital Object Identifier, EIDR, FAIR data, Findability, Identifiers, Information, Information architecture, Information science, Knowledge, Knowledge representation, metadata, mining, Open Archives Initiative, RDF, Records management, representative, standardized communication protocol, Technical communication, Technology/Internet, Web design, Written communication, XML
Posted in Chemical IT | No Comments »
Friday, April 12th, 2019
In recent years, findable data has become ever more important (the F in FAIR). Here I test that F using the DataCite search service.
Firstly an introduction to this service. This is a metadata database about datasets and other research objects. One of the properties is relatedIdentifier which records other identifiers associated with the dataset, being say the DOI of any published article associated with the data, but it could also be pointers to related datasets.
One can query thus:
- https://search.datacite.org/works?query=relatedIdentifiers.relatedIdentifier:*
which retrieves the very healthy looking 6,179,287 works.
- One can restrict this to a specific publisher by the DOI prefix assigned to that publisher:
?query=relatedIdentifiers.relatedIdentifier:10.1021*
which returns a respectable 210,240 works.
- It turns out that the major contributor to FAIR currently are crystal structures from the CCDC. One can remove them from the search to see what is left over:
?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*)
and one is down to 14,213 works, of which many nevertheless still appear to be crystal structures. These may be links to other crystal datasets.
I have performed searches 2 and 3 for some popular publishers of chemistry (the same set that were analysed here).
| Publisher |
Search 2 |
Search 3 |
| ACS |
210,240 |
14,213 |
| RSC |
138,147 |
1,279 |
| Elsevier |
185,351 |
56,373 |
| Nature |
12,316 |
8,104 |
| Wiley |
135,874 |
9,283 |
| Science |
3,384 |
2,343 |
These publishers all have significant numbers of datasets which at least accord with the F of FAIR. A lot of data sets may not have metadata which in fact points back to a published article, since this can be something that has to be done only when the DOI of that article appears, in other words AFTER the publication of the dataset. So these numbers are probably low rather than high.
How about the other way around? Rather than datasets that have a journal article as a related identifier, we could search for articles that have a dataset as a related identifier?
- ?query=(identifier:*10.1039*)+AND+(relatedIdentifiers.relatedIdentifier:*)
returns rather mysterious nothing found. It might also be that there is no mapping of this search between the CrossRef and DataCite metadata schemas.
- And just to show the searches are behaving as expected:
?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+AND+(identifier:*10.5517*)
returns 196,027 works.
It will also be of interest to show how these numbers change over time. Is there an exponential increase? We shall see.
Finally, we have not really explored adherence to eg the AIR of FAIR. That is for another post.
Tags:Academic publishing, DataCite, Digital Object Identifier, Digital technology, Elsevier, Findability, Identifiers, Information, Information architecture, Information science, Knowledge, Knowledge representation, search service, Web design
Posted in Chemical IT | 1 Comment »
Saturday, February 16th, 2019
The title of this post comes from the site www.crossref.org/members/prep/ Here you can explore how your favourite publisher of scientific articles exposes metadata for their journal.
Firstly, a reminder that when an article is published, the publisher collects information about the article (the “metadata”) and registers this information with CrossRef in exchange for a DOI. This metadata in turn is used to power e.g. a search engine which allows “rich” or “deep” searching of the articles to be undertaken. There is also what is called an API (Application Programmer Interface) which allows services to be built offering deeper insights into what are referred to as scientific objects. One such service is “Event Data“, which attempts to create links between various research objects such as publications, citations, data and even commentaries in social media. A live feed can be seen here.
So here are the results for the metadata provided by six publishers familiar to most chemists, with categories including;
- References
- Open References
- ORCID IDs
- Text mining URLs
- Abstracts

RSC

ACS

Elsevier

Springer-Nature

Wiley

Science
One immediately notices the large differences between publishers. Thus most have 0% metadata for the article abstracts, but one (the RSC) has 87%! Another striking difference is those that support open references (OpenCitations). The RSC and Springer Nature are 99-100% compliant whilst the ACS is 0%. Yet another variation is the adoption of the ORCID (Open Researcher and Collaborator Identifier), where the learned society publishers (RSC, ACS) achieve > 80%, but the commercial publishers are in the lower range of 20-49%.
To me the most intriguing was the Text mining URLs. From the help pages, “The Crossref REST API can be used by researchers to locate the full text of content across publisher sites. Publishers register these URLs – often including multiple links for different formats such as PDF or XML – and researchers can request them programatically“. Here the RSC is at 0%, ACS is at 8% but the commercial publishers are 80+%. I tried to find out more at e.g. https://www.springernature.com/gp/researchers/text-and-data-mining but the site was down when I tried. This can be quite a controversial area. Sometimes the publisher exerts strict control over how the text mining can be carried out and how any results can be disseminated. Aaron Swartz famously fell foul of this.
I am intrigued as to how, as a reader with no particular pre-assembled toolkit for text mining, I can use this metadata provided by the publishers to enhance my science. After all, 80+% of articles with some of the publishers apparently have a mining URL that I could use programmatically. If anyone reading this can send some examples of the process, I would be very grateful.
Finally I note the absence of any metadata in the above categories relating to FAIR data. Such data also has the potential for programmatic procedures to retrieve and re-use it (some examples are available here[1]), but apparently publishers do not (yet) collect metadata relating to FAIR. Hopefully they soon will.
References
- A. Barba, S. Dominguez, C. Cobas, D.P. Martinsen, C. Romain, H.S. Rzepa, and F. Seoane, "Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data", ACS Omega, vol. 4, pp. 3280-3286, 2019. https://doi.org/10.1021/acsomega.8b03005
Tags:Aaron Swartz, Academic publishing, API, Business intelligence, CrossRef, data, Data management, Elsevier, favourite publisher, Identifiers, Information, Information science, Knowledge, Knowledge representation, metadata, mining, ORCiD, PDF, Pre-exposure prophylaxis, Publishing, Publishing Requirements for Industry Standard Metadata, Records management, Research Object, Scholarly communication, Scientific literature, search engine, social media, Technical communication, Technology/Internet, text mining, Written communication, XML
Posted in Interesting chemistry | 1 Comment »
Saturday, February 16th, 2019
The title of this post comes from the site www.crossref.org/members/prep/ Here you can explore how your favourite publisher of scientific articles exposes metadata for their journal.
Firstly, a reminder that when an article is published, the publisher collects information about the article (the “metadata”) and registers this information with CrossRef in exchange for a DOI. This metadata in turn is used to power e.g. a search engine which allows “rich” or “deep” searching of the articles to be undertaken. There is also what is called an API (Application Programmer Interface) which allows services to be built offering deeper insights into what are referred to as scientific objects. One such service is “Event Data“, which attempts to create links between various research objects such as publications, citations, data and even commentaries in social media. A live feed can be seen here.
So here are the results for the metadata provided by six publishers familiar to most chemists, with categories including;
- References
- Open References
- ORCID IDs
- Text mining URLs
- Abstracts

RSC

ACS

Elsevier

Springer-Nature

Wiley

Science
One immediately notices the large differences between publishers. Thus most have 0% metadata for the article abstracts, but one (the RSC) has 87%! Another striking difference is those that support open references (OpenCitations). The RSC and Springer Nature are 99-100% compliant whilst the ACS is 0%. Yet another variation is the adoption of the ORCID (Open Researcher and Collaborator Identifier), where the learned society publishers (RSC, ACS) achieve > 80%, but the commercial publishers are in the lower range of 20-49%.
To me the most intriguing was the Text mining URLs. From the help pages, “The Crossref REST API can be used by researchers to locate the full text of content across publisher sites. Publishers register these URLs – often including multiple links for different formats such as PDF or XML – and researchers can request them programatically“. Here the RSC is at 0%, ACS is at 8% but the commercial publishers are 80+%. I tried to find out more at e.g. https://www.springernature.com/gp/researchers/text-and-data-mining but the site was down when I tried. This can be quite a controversial area. Sometimes the publisher exerts strict control over how the text mining can be carried out and how any results can be disseminated. Aaron Swartz famously fell foul of this.
I am intrigued as to how, as a reader with no particular pre-assembled toolkit for text mining, I can use this metadata provided by the publishers to enhance my science. After all, 80+% of articles with some of the publishers apparently have a mining URL that I could use programmatically. If anyone reading this can send some examples of the process, I would be very grateful.
Finally I note the absence of any metadata in the above categories relating to FAIR data. Such data also has the potential for programmatic procedures to retrieve and re-use it (some examples are available here[1]), but apparently publishers do not (yet) collect metadata relating to FAIR. Hopefully they soon will.
References
- A. Barba, S. Dominguez, C. Cobas, D.P. Martinsen, C. Romain, H.S. Rzepa, and F. Seoane, "Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data", ACS Omega, vol. 4, pp. 3280-3286, 2019. https://doi.org/10.1021/acsomega.8b03005
Tags:Aaron Swartz, Academic publishing, API, Business intelligence, CrossRef, data, Data management, Elsevier, favourite publisher, Identifiers, Information, Information science, Knowledge, Knowledge representation, metadata, mining, ORCiD, PDF, Pre-exposure prophylaxis, Publishing, Publishing Requirements for Industry Standard Metadata, Records management, Research Object, Scholarly communication, Scientific literature, search engine, social media, Technical communication, Technology/Internet, text mining, Written communication, XML
Posted in Interesting chemistry | 1 Comment »
Tuesday, August 7th, 2018
Harnessing FAIR data is an event being held in London on September 3rd; no doubt most speakers will espouse its virtues and speculate about how to realize its potential. Admirable aspirations indeed, but capturing hearts and minds also needs lots of real life applications! Whilst assembling a forthcoming post on this blog, I realized I might have one nice application which also pushes the envelope a bit further, in a manner that I describe below.
The post I refer to above is about using quantum chemical calculations to chart possible mechanistic pathways for the reaction between a carboxylic acid and an amine to form an amide. The FAIR data for the entire project is collected at DOI: 10.14469/hpc/4598. Part of what makes it FAIR is the metadata not only collected about this data but also formally registered with the DataCite agency. Registration in turn enables Finding; it is this aspect I want to demonstrate here.
The metadata for the above DOI includes information such as;
- The ORCID persistent identifier (PID) for the creator of the data (in this instance myself)
- Date stamps for the original creation date and subsequent modifications.
- A rights declaration, in this case the CC0 license which describes how the data can be re-used.
- Related identifiers, in this case describing members of this collection.
The data itself is held in the members of the collection, each of which is described by a more specific set of metadata in addition to the more general types in the above list (e.g. 10.14469/hpc/4606).
- One important additional metadata descriptor is the ORE locator (Object Re-use and Exchange, itself almost a synonym for FAIR). This allows a machine to deduce a direct path to the data file itself, and hence to retrieve it automatically if desired. It is important to note that the DOI itself (i.e. 10.14469/hpc/4606) points only to the “landing page” for the dataset, and does not necessarily describe the direct path to any specific file in the dataset. The ORE path can be used with e.g. software such as JSmol to directly load a molecule based only on its DOI. You can see an example of this here.
- Each molecule-based dataset contains additional specific metadata relating to the molecule itself. For example this is how the InChiKey, an identifier specific to that molecule, is expressed in metadata;
<subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">PVXKWVPAMVWJSQ-UHFFFAOYSA-N</subject>
The advantage of expressing the metadata in this way is that a general search of the type:
https://search.datacite.org/works?query=subjectScheme:inchikey+subject:CZABGBRSHXZJCF-UHFFFAOYSA-N
can be used to track down any molecule with metadata corresponding to the above InChIkey.
- Here is more metadata, introduced in this blog. It relates to the (computed) value of the Gibbs energy (the energy unit is in Hartree†), as returned by the Gaussian program;
<subject subjectScheme="Gibbs_Energy" schemeURI="https://goldbook.iupac.org/html/G/G02629.html" valueURI="http://gaussian.com/thermo/">-649.732417</subject>
I here argue that it represents a unique identifier for a molecule calculation using the quantum mechanical procedures implemented in e.g. Gaussian. This identifier is different from the InChIkey, in that it can be truncated to provide different levels of information.
- At the coarsest level, a search of the type
https://search.datacite.org/works?query=subjectScheme:Gibbs_energy+subject:-649.*
should reveal all molecules with the same number of atoms and electrons whose Gibbs energy has been calculated, but not necessarily with the same InChI (i.e. they may be isomers, or transition states, etc). This level might be useful for revealing most (not necessarily all‡) molecules involved in say a reaction mechanism. It should also be insensitive to the program system used, since most quantum codes will return a value for the Gibbs energy if the same procedures have been used (i.e. DFT method, basis set, solvation model and dispersion correction) accurate to probably 0.01 Hartree.
- The top level of precision however is high enough to almost certainly relate to a specific molecule and probably using a specific program;
https://search.datacite.org/works?query=subjectScheme:Gibbs_energy+subject:-649.732417
- The searcher can experiment with different levels of precision to narrow or broaden the search.
- I would also address the issue (before someone asks) of why I have used the Gibbs energy rather than the Total energy. Put simply, the Gibbs energy is far more useful in a chemical context. It can be used to relate the relative Gibbs energies of different isomers of the same molecule to e.g. the equilibrium constant that might be measured. Or the difference in Gibbs energies between a reactant and a transition state can be used to derive the free energy activation barrier for a reaction. The total energy is not so useful in such contexts, although of course it too could be added as a subject in the metadata above if a real use for it is found.
- The searcher can also use Boolean combinations of metadata, such as specifying both the InChIKey and the Gibbs Energy, along with say the ORCID of the person who may have published the data;
https://search.datacite.org/works?query=
subjectScheme:Gibbs_energy+subject:-649.*+
subjectScheme:inchikey+subject:CZABGBRSHXZJCF-UHFFFAOYSA-N+
ORCID:0000-0002-8635-8390♥
I have tried to show above how FAIR data implies some form of rich (registered) metadata. And how the metadata can be used to Find (the F in FAIR) data with very specific properties, thus Harnessing FAIR data.
†It is a current limitation of the V4.1 DataCite schema that there appears no way to specify the data type of the subject, including any units. ‡In theory, a range query of the type:
https://search.datacite.org/works?query=
subjectScheme:Gibbs_energy+subject:[-649.1 TO -649.8]
should be more specific, but I have not yet gotten it to work, probably because of the lack of data-typing means it is not recognised as a range of numeric values. ♥Implicit in this search is the grouping
https://search.datacite.org/works?query=(subjectScheme:Gibbs_energy+subject:-649.*)
+
(subjectScheme:inchikey+subject:CZABGBRSHXZJCF-UHFFFAOYSA-N)
+ORCID:0000-0002-8635-8390
Currently however DataCite do not correctly honour this form of grouping.
Tags:Academic publishing, chemical context, Code, data, DataCite, energy, free energy activation barrier, Identifiers, Information, ISO/IEC 11179, ORCiD, quantum chemical calculations, real life applications, Technical communication
Posted in Interesting chemistry | 9 Comments »
Tuesday, January 23rd, 2018
Another occasional conference report (day 1). So why is one about “persistent identifiers” important, and particularly to the chemistry domain?
The PID most familiar to most chemists is the DOI (digital object identifier). In fact there are many; some 60 types have been collected by ORCID (themselves purveyors of researcher identifiers). They sometimes even have different names; in life sciences they tend to be known instead as accession numbers. One theme common to many (probably not all) is that they represent sources of metadata about the object being identified. Further information if which allows you (or a machine) to decide if acquiring the full object is worthwhile. So in no particular order, here are some of the things I learnt today.
- Mark Hahnel noted the recent launch of the Dimensions resource which links research data with other research activities; I have not yet had a chance to learn its capabilities, but it seems an interesting alternative to other stalwarts such as eg Google Scholar etc.
You can try this example: https://app.dimensions.ai/discover/publication?search_text=10.6084&search_type=kws&full_search=true which retrieves articles in which the data repository with prefix 10.6084 (Figshare) is cited. Try also the prefix 10.14469 which is the Imperial College repository.
- Andy Mabbett talked about the deployment and use of persistent identifiers (the Q numbers) in Wikidata, which increasingly underpin the basis for the various flavours of Wikipedia. He also noted their use of some 50 different identifiers.
- Johanna McEntyre noted some 5M published articles in life sciences which reference 1M+ ORCID identifiers, easily the domain with the fastest uptake of this type. Also noted was the new FREYA project; aiming to connect open identifiers for discovery, access and use of research resources.
- Tom Gillespie talked about RRID, or Research Resource Identifiers. Included in this are hardware, including instruments and with around 6000 RRIDs systematized so far. They argue this area promotes both the A and I of FAIR (accessible and inter-operable). Of course A and I mean many things to many people.
- Several other presentations talked about the finer detail of metadata, such as sub-classifications into e.g. descriptive/admin/technical, but I did rather miss demos showing how search queries of such fine-grained metadata could be constructed.
Apart from the presentations themselves, PIDapalooza is unusual for some other activities. Thus you could go get your PIDnails done, with a selection of 8 or so tasteful logos to choose from. There will be tattoos tomorrow (this is a conference for younger people after all). I may grab a photo or two to provide evidence!
Tags:Academic publishing, Andy Mabbett, Digital Object Identifier, Identifiers, Imperial College, Index, Information science, Johanna McEntyre, Knowledge, Mark Hahnel, ORCiD, Persistent identifier, Publishing, Quotation, researcher, Scholarly communication, SciCrunch, search engines, Technical communication, Technology/Internet, Tom Gillespie
Posted in Chemical IT | 1 Comment »
Tuesday, May 24th, 2016
I occasionally post about "RDM" (research data management), an activity that has recently become a formalised essential part of the research processes. I say recently formalised, since researchers have of course kept research notebooks recording their activities and their data since the dawn of science, but not always in an open and transparent manner. The desirability of doing so was revealed by the 2009 "Climategate" events. In the UK, Climategate was apparently the catalyst which persuaded the funding councils (such as the EPSRC, the Royal Society, etc) to formulate policies which required all their funded researchers to adopt the principles of RDM by May 2015 and in their future researches. An early career researcher here, anxious to conform to the funding body instructions, sent me an email a few days ago asking about one aspect of RDM which got me thinking.
The question related to the divide between data as a separate research object (and which therefore has to be managed), and data as an inseparable part of the article narrative, which is of course ostensibly managed by the journal publication processes. Such data may often be the description of a process rather than simply tables of numbers or graphs. In chemistry it may include chemical names and chemical terms as part of an experimental procedure. For one nice illustration of such embedded data, go look at the chemical tagger page. Here the data is blending with the semantics, and the two are not easily separated. So, when such separation is not easily achieved, should the specific processes required by RDM as illustrated in the five bullet points below actually be followed?
-
Specify a data management plan to be followed, as for example points 2-5 below.
-
Decide upon a location for your data, separated into one for "live" or working data (the purpose simply being to ensure it is properly backed up) and the other for a sub-set of formally "published data" which has to be available for at least ten years after its publication.
-
Use 2 to gather metadata (see 6-14 below) and in return get a DOI representing the location of the combined metadata + data, from a suitable registration authority such as DataCite.
-
Quote this DOI(s) in the article describing the results of analysing the data and presenting hypotheses, and conversely once the article itself is allocated its own DOI from a registration authority such as CrossRef, update the metadata in item 3 so as to achieve a bidirectional link between the data and its narrative (and we assume that DataCite and CrossRef will also increasingly exchange the metadata they each hold about the items).
-
Add both the data and the article DOIs to any institutional CRIS or current research information system (parenthetically, I regard this last stipulation as rather redundant if items 3 and 4 are working effectively, but its a good interim measure whilst the overall system matures).
So, should step 2 be included if the data itself is inextricably intertwined with the narrative and cannot be separated? The slightly surprising advice I would suggest is yes! And the answer is that it IS possible to generate metadata (data about the, possibly entwined, data) which CAN be processed in such a step. What forms would such metadata take?
-
Identification of the researcher(s) involved. This would nowadays take the form of an ORCID (Open Research and Collaborator Identifier).
-
Identification of the hosting institution where the data has been produced. There is currently no equivalent to an ORCID for institutions, but it is very likely to come in the future.
-
A date stamp formalising when the (meta)data is actually deposited.
-
A title for the project being described. Here we see a blurring between the narrative/article and the data; a title is the shortest possible description of the narrative/article, and it may also apply to the data object(s) or it could have its own title.
-
A slightly fuller abstract of the project being described. Here we see further blurring between the narrative and data objects.
-
One can include "related identifiers", in particular the DOIs of any other relevant articles that might have been published which may expand the context of the data, and also the DOIs of any other relevant datasets which may have been allocated in step 2 above.
-
It is also beneficial to include "chemical identifiers". These can take the form of InChI strings and InChI keys, which allow discretely defined molecular objects which were the object of the research to be tracked and which relate to both the narrative and any other data objects.
-
If specific software has been used to analyse data, it too can be included as a "related identifier" (e.g. [1]
-
Potentially at least, if a well-defined instrument has been involved, it too could be included with its own "related identifier". With both 13-14, other issues may need addressing, such as versioning etc, but this no doubt will be sorted in due course.
-
etc.
So items such as 6-14 can be collected and sent to e.g. DataCite with a DOI received in return as part of item 2 of the RDM processes. No "pure" data need be involved, only metadata. Nonetheless such metadata can only increase the visibility and discoverability of the research, as illustrated in how such metadata can be searched for the components described above.
References
- H.S. Rzepa, "KINISOT. A basic program to calculate kinetic isotope effects using normal coordinate analysis of transition state and reactants.", 2015. https://doi.org/10.5281/zenodo.19272
Tags:Academic publishing, chemical identifiers, chemical names and chemical terms, chemical tagger page, CrossRef, Data management, Data management plan, DataCite, Identifiers, ORCiD, RDM, researcher, Royal Society, Singular spectrum analysis, Technical communication, Technology/Internet
Posted in Chemical IT | No Comments »
Sunday, April 17th, 2016
I want to describe a recent attempt by a group of collaborators to share the research data associated with their just published article.[1]
I am here introducing things in a hierarchical form (i.e. not necessarily the serial order in which actions were taken).
-
The data repository selected for the data sharing is described by (m3data) doi: 10.17616/R3K64N[2]
-
A collaborative project collection was established on this repository (doi: 10.14469/hpc/244[3]). This data collection has some of the following attributes:
-
Its metadata is sent here: https://search.datacite.org/ui?&q=10.14469/hpc/244 where it can be queried for other details.
-
The project collaborators are all identified by their ORCID, used to obtain further individual information about the researchers. This information is also propagated to the metadata sent to DataCite.
-
In the section labelled associated DOIs there is a link to the recently published peer-reviewed article, which itself cites the data via doi: 10.14469/hpc/244 and which thus establishes a bidirectional link between the article and its data.
-
Also in the associated DOIs section are other DOIs (to two figures and two tables) held in a separate location. One example: doi: 10.14469/hpc/332[4]) which illustrates the original type of data sharing we started about 10 years ago. This form has been variously called a "WEO" or Web-enhanced object (by the ACS) or interactivity boxes (RSC, etc). In such WEOs, we wrap the data into an interactive visual appearance using Jmol or JSmol software. The data itself is directly available to the reader using the Jmol export functions (right mouse click in the visual window).
-
In this specific example the WEO has been assigned its DOI using the repository noted above.[2]
-
We have in the past also used Figshare[5]) for this purpose, see e.g. 10.6084/m9.figshare.1181739‡
-
The WEO itself can itself reference a more complete set of data used to create the visual appearance, for example data that allows the wavefunction of the molecule to be computed, doi: 10.6084/m9.figshare.2581987.v1[6] In this instance this is held on the Figshare[5] repository.
-
The collection has another section labelled Members. These are individual datasets associated with the collection and held on the SAME repository as the collection itself. In this case, there are five such members, two of which are listed below:
-
10.14469/hpc/281[7] contains a variety of other data such as outputs from an IRC (intrinsic reaction coordinate), energy profile diagrams and ZIP archives of other calculations.
-
10.14469/hpc/272[8] itself contains five members, one of which is e.g.
-
10.14469/hpc/267[9] which contains a ZIP archive with NMR data (see here for how this might be packaged in the future) and a file for a GPC (chromatography) instrument.
-
This last item also contains a new section labelled Metadata, which includes e.g. the InChI key and InChI string for the molecule whose properties are reported.
If this mode of presenting data seems a little more complex than a single monolithic PDF file, its because its designed for:
-
collaboration between scientists, potentially at different locations and institutions.
-
attribution of provenance/credit for the individual items (via ORCID).
-
separate date stamping by the various contributors.
-
providing bi-directional links between data and publications.
-
holding what we call FAIR (findable, accessible, interoperable and reusable) data, rather than just data encapsulated in a PDF file.
-
Collecting, storing and sending metadata for aggregation in a formal way, i.e. to DataCite using a formal schema to render the metadata properly searchable.
Thus 10.14469/hpc/244 represents our most complex attempt yet at such collaborative FAIR data sharing with multiple contributors. The tools for packaging many of the datasets are still quite limited (see again here) and the design is still being optimised (call it α). When the repository[2] has been more extensively tested, we intend to make it available as open source for others to experiment with. And of course, when this happens the source code too will have its own DOI!
‡A refactoring of the Figshare site in December 2015 meant that the DOI no longer points directly to the WEO, and you have to follow a manually inserted link on that page to see it.
References
- C. Romain, Y. Zhu, P. Dingwall, S. Paul, H.S. Rzepa, A. Buchard, and C.K. Williams, "Chemoselective Polymerizations from Mixtures of Epoxide, Lactone, Anhydride, and Carbon Dioxide", Journal of the American Chemical Society, vol. 138, pp. 4120-4131, 2016. https://doi.org/10.1021/jacs.5b13070
- Re3data.Org., "Imperial College Research Computing Service Data Repository", 2016. https://doi.org/10.17616/r3k64n
- C. ROMAIN, "Chemo-Selective Polymerizations Using Mixtures of Epoxide, Lactone, Anhydride and CO2", 2016. https://doi.org/10.14469/hpc/244
- H. Rzepa, "Table S8: Comparison of two different basis sets for selected intermediates for CHO/PA ROCOP.", 2016. https://doi.org/10.14469/hpc/332
- Re3data.Org., "figshare", 2012. https://doi.org/10.17616/r3pk5r
- P. Dingwall, "Gaussian Job Archive for C6H10O", 2016. https://doi.org/10.6084/m9.figshare.2581987.v1
- C. ROMAIN, "Figure 9, Figure S18, Figure S19: ROCOP of PA/CHO + IRC", 2016. https://doi.org/10.14469/hpc/281
- C. ROMAIN, "Table 1 : Polymerizations Using Lactone, Epoxide, and CO2", 2016. https://doi.org/10.14469/hpc/272
- C. ROMAIN, "Table 1, entry 1 : Polymerizations Using Lactone, Epoxide, and CO2", 2016. https://doi.org/10.14469/hpc/267
Tags:10.17616, Academic publishing, DataCite, energy profile diagrams, Figshare, Identifiers, Open science, ORCiD, PDF, Scholarly communication, Technical communication, Technology/Internet, Web-enhanced object
Posted in Chemical IT | No Comments »
Saturday, April 16th, 2016
Scientists are familiar with the term data, at least in a scientific or chemical context, but appreciating metadata (meaning "after", or "beyond") is slightly more subtle, in the sense of using it to mean data about data. The challenge lies in clarifying where the boundary between data and its metadata lies and in specifying and controlling the vocabulary used for these metadata descriptions. Items in a chemical metadata dictionary might include e.g. subject classifications such as Organic Molecular Chemistry or identifiers such as InChIkey. But what could metametadata be? Here I briefly show some examples by way of illustration.
Let me start by defining a data repository as a store of both data and the metadata describing it. The metadata is to be exposed in a standard manner which allows it to be aggregated by other agencies. Nowdays, it is becoming common to identify such a data object together with its metadata using a persistent identifier, or DOI. But to decide if any particular repository and the data objects contained therein is generally useful to you, you need information about the metadata itself. Technically, this is defined using a schema[1] describing the metadata (which might e.g. identify any dictionaries used); hence metametadata. Now you need to store the metametadata and so I introduce the concept of a registry which does this. This metametadata object is itself assigned a DOI‡ and here I list these DOIs for a personal selection of some chemically oriented examples, in this case deriving from the largest registry of research data repositories re3data.org. You can search for your own entry at their site: http://service.re3data.org/search.
Not all of the repositories listed in the table above assign formal DOIs to their data collections, meaning that the metadata for their entries cannot be aggregated in a searchable manner using e.g. search.datacite.org/ui (or search.datacite.org/api for the machine version). Currently, the metametadata does not fully carry this information, an aspect which I gather will be rectified in a future revision of the re3data schema.[1]
Importantly, both metadata and (repository) metametadata can be searched using APIs (application programmer interface), ensuring that the entire flow of meta information can be subject to automated software analysis rather than just visual inspections by a human.This should allow a rich and open infrastructure for handling research objects or data to be built up using hierarchical metadata. The examples above indeed show that the chemical space is already the largest component of the Natural Sciences space.
Although the edifice is still largely in its infancy, already I think we can start to see an alternative open approach emerging to "Googling" for data, or the even older traditional bespoke (i.e. non-open) services offered by commercial human-based abstractors of chemical metadata.
‡This DOI is information about the metametadata, and hence it is metametametadata, or m3data. Sorry! ♣The citations at the foot of this post are generated entirely automatically (by a WordPress plugin called Kcite) from the m3data associated with each entry, i.e. the DOI listed. Were the persistent identifier for the entry ever to be changed, this would propagate automatically to the citation, unlike the static entries in the table.
References
- J. Rücknagel, P. Vierkant, R. Ulrich, G. Kloska, E. Schnepf, D. Fichtmüller, E. Reuter, A. Semrau, M. Kindling, H. Pampel, M. Witt, F. Fritze, S. Van De Sandt, J. Klump, H. Goebelbecker, M. Skarupianski, R. Bertelmann, P. Schirmbacher, F. Scholze, C. Kramer, C. Fuchs, S. Spier, and A. Kirchhoff, "Metadata Schema for the Description of Research Data Repositories", 2015. https://doi.org/10.2312/re3.008
- Re3data.Org., "figshare", 2012. https://doi.org/10.17616/r3pk5r
- Re3data.Org., "Zenodo", 2013. https://doi.org/10.17616/r3qp53
- Re3data.Org., "The Cambridge Structural Database", 2013. https://doi.org/10.17616/r36011
- Re3data.Org., "Crystallography Open Database", 2013. https://doi.org/10.17616/r37s31
- Re3data.Org., "Oxford University Research Archive", 2014. https://doi.org/10.17616/r3q056
- Re3data.Org., "ONSchallenge", 2013. https://doi.org/10.17616/r3859d
- Re3data.Org., "UsefulChem", 2014. https://doi.org/10.17616/r3z89n
- Re3data.Org., "chemotion", 2013. https://doi.org/10.17616/r34p5t
- Re3data.Org., "ChemSpider", 2013. https://doi.org/10.17616/r38p4p
- Re3data.Org., "Chemical Database Service", 2012. https://doi.org/10.17616/r36p42
- https://doi.org/
- H. Rzepa, "Imperial College High Performance Computing Service Data Repository Metadata Schema", 2016. https://doi.org/10.14469/hpc/382
- J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
- Re3data.Org., "SPECTRa Project", 2013. https://doi.org/10.17616/r30316
Tags:Academic publishing, automated software analysis, BASE, chemical context, Chemical Database Service, chemical metadata, chemical metadata dictionary, chemical space, City: Cambridge, Data dictionary, Data management, Identifiers, Knowledge representation, programmer, Registry of Research Data Repositories, search.datacite.org/api, SPECTRa, Technology/Internet
Posted in Chemical IT | No Comments »