API « Henry Rzepa's blog

Posts Tagged ‘API’

“Richer metadata makes content more useful”

Saturday, February 16th, 2019

The title of this post comes from the site www.crossref.org/members/prep/ Here you can explore how your favourite publisher of scientific articles exposes metadata for their journal.

Firstly, a reminder that when an article is published, the publisher collects information about the article (the “metadata”) and registers this information with CrossRef in exchange for a DOI. This metadata in turn is used to power e.g. a search engine which allows “rich” or “deep” searching of the articles to be undertaken. There is also what is called an API (Application Programmer Interface) which allows services to be built offering deeper insights into what are referred to as scientific objects. One such service is “Event Data“, which attempts to create links between various research objects such as publications, citations, data and even commentaries in social media. A live feed can be seen here.

So here are the results for the metadata provided by six publishers familiar to most chemists, with categories including;

References
Open References
ORCID IDs
Text mining URLs
Abstracts

RSC

ACS

Elsevier

Springer-Nature

Wiley

Science

One immediately notices the large differences between publishers. Thus most have 0% metadata for the article abstracts, but one (the RSC) has 87%! Another striking difference is those that support open references (OpenCitations). The RSC and Springer Nature are 99-100% compliant whilst the ACS is 0%. Yet another variation is the adoption of the ORCID (Open Researcher and Collaborator Identifier), where the learned society publishers (RSC, ACS) achieve > 80%, but the commercial publishers are in the lower range of 20-49%.

To me the most intriguing was the Text mining URLs. From the help pages, “The Crossref REST API can be used by researchers to locate the full text of content across publisher sites. Publishers register these URLs – often including multiple links for different formats such as PDF or XML – and researchers can request them programatically“. Here the RSC is at 0%, ACS is at 8% but the commercial publishers are 80+%. I tried to find out more at e.g. https://www.springernature.com/gp/researchers/text-and-data-mining but the site was down when I tried. This can be quite a controversial area. Sometimes the publisher exerts strict control over how the text mining can be carried out and how any results can be disseminated. Aaron Swartz famously fell foul of this.

I am intrigued as to how, as a reader with no particular pre-assembled toolkit for text mining, I can use this metadata provided by the publishers to enhance my science. After all, 80+% of articles with some of the publishers apparently have a mining URL that I could use programmatically. If anyone reading this can send some examples of the process, I would be very grateful.

Finally I note the absence of any metadata in the above categories relating to FAIR data. Such data also has the potential for programmatic procedures to retrieve and re-use it (some examples are available here[1]), but apparently publishers do not (yet) collect metadata relating to FAIR. Hopefully they soon will.

References

A. Barba, S. Dominguez, C. Cobas, D.P. Martinsen, C. Romain, H.S. Rzepa, and F. Seoane, "Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data", ACS Omega, vol. 4, pp. 3280-3286, 2019. https://doi.org/10.1021/acsomega.8b03005

Tags:Aaron Swartz, Academic publishing, API, Business intelligence, CrossRef, data, Data management, Elsevier, favourite publisher, Identifiers, Information, Information science, Knowledge, Knowledge representation, metadata, mining, ORCiD, PDF, Pre-exposure prophylaxis, Publishing, Publishing Requirements for Industry Standard Metadata, Records management, Research Object, Scholarly communication, Scientific literature, search engine, social media, Technical communication, Technology/Internet, text mining, Written communication, XML
Posted in Interesting chemistry | 1 Comment »

"Richer metadata makes content more useful"

Saturday, February 16th, 2019

The title of this post comes from the site www.crossref.org/members/prep/ Here you can explore how your favourite publisher of scientific articles exposes metadata for their journal.

So here are the results for the metadata provided by six publishers familiar to most chemists, with categories including;

References
Open References
ORCID IDs
Text mining URLs
Abstracts

RSC

ACS

Elsevier

Springer-Nature

Wiley

Science

References

A. Barba, S. Dominguez, C. Cobas, D.P. Martinsen, C. Romain, H.S. Rzepa, and F. Seoane, "Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data", ACS Omega, vol. 4, pp. 3280-3286, 2019. https://doi.org/10.1021/acsomega.8b03005

A nice example of open data (in London).

Sunday, March 5th, 2017

Living in London, travelling using public transport is often the best way to get around. Before setting out on a journey one checks the status of the network. Doing so today I came across this page: our open data from Transport for London.

I learnt that by making TFL travel data openly available, some 11,000 developers (sic!) have registered for access, out of which some 600 travel apps have emerged.
The data is in XML, which makes it readily inter-operable.[1]
This encourages crowd-sourced innovation.
They have taken the trouble to produce an API (application programmable interface) which allows rich access to the data and information about e.g. AccidentStats, AirQuality, BikePoint, Journey, Line, Mode, Occupancy, Place, Road, Search, StopPoint Vehicle.

Chemists could learn some lessons here! Of course, there are quite a few chemical databases with APIs that are examples of open data, but the “ESI” (electronic supporting information) sources which almost all published articles rely upon to disseminate data are clearly struggling to cope. Take for example this recent article[2], where much of the data has been dropped into the inevitable PDF “coffin” and which is a breathtaking 907 pages long. To give the authors their due, they also provide 20 CIF files which ARE good sources of data. Rarely commented on, but clearly missing from the information associated with this (indeed most) articles is the metadata about the data. Thus the metadata for these CIF files amounts to just e.g. 229. To find out the context, one has to scour the article (or the 907 pages of the ESI) to identify compound 229 (I strongly suspect it’s a molecule because of the implied semantics of the term, not because its been explicitly declared). You will not find the metadata at e.g. data.datacite.org which is one open aggregator and global search engine based on deposited metadata.

I have commented elsewhere on this blog that other types of data could also be enhanced in the manner that CIF crystallographic files represent. For example the Mpublish NMR project,^‡ examples of which are shown here, and for which typical data AND its metadata can be seen at DOI: 10.14469/hpc/1053. I fancy that if this method had been adopted,[2] those 907 pages might have shrunk somewhat, although of course not entirely. But my hope is that gradually the innovative chemistry community will find ways of exhuming more and more data from the PDF coffin and in the process reducing the paginated lengths of the PDF-based ESI further, perchance eventually even to zero?

If you are yourself preparing an article and sweating over the ESI at this very moment, do please take a look at the Mpublish method and how perhaps it can help make your NMR data at least more useful to others.

^‡I understand an article describing this project is in preparation. If you cannot wait, this recent application of the Mpublish project has some details.[3]

References

P. Murray-Rust, and H.S. Rzepa, "Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles", Journal of Chemical Information and Computer Sciences, vol. 39, pp. 928-942, 1999. https://doi.org/10.1021/ci990052b
J.M. Lopchuk, K. Fjelbye, Y. Kawamata, L.R. Malins, C. Pan, R. Gianatassio, J. Wang, L. Prieto, J. Bradow, T.A. Brandt, M.R. Collins, J. Elleraas, J. Ewanicki, W. Farrell, O.O. Fadeyi, G.M. Gallego, J.J. Mousseau, R. Oliver, N.W. Sach, J.K. Smith, J.E. Spangler, H. Zhu, J. Zhu, and P.S. Baran, "Strain-Release Heteroatom Functionalization: Development, Scope, and Stereospecificity", Journal of the American Chemical Society, vol. 139, pp. 3209-3226, 2017. https://doi.org/10.1021/jacs.6b13229
M.J. Harvey, A. McLean, and H.S. Rzepa, "A metadata-driven approach to data repository design", Journal of Cheminformatics, vol. 9, 2017. https://doi.org/10.1186/s13321-017-0190-6

Tags:API, chemical databases, City: London, Company: TfL, Government, Greater London, Local government in London, London, Passenger Transportation Ground & Sea - NEC, PDF, Public transport, Route planning software, search engine, Sustainable transport, Technology/Internet, Transport, Transport for London, travel apps, travel data, XML
Posted in Chemical IT | No Comments »

Goldilocks Data.

Wednesday, April 8th, 2015

Last August, I wrote about data galore, the archival of data for 133,885 (134 kilo) molecules into a repository, together with an associated data descriptor[1] published in the new journal Scientific Data. Since six months is a long time in the rapidly evolving field of RDM, or research data management, I offer an update in the form of some new observations.

Firstly, 131 kilo molecules are now offered in a new different form; http://gdb.koitz.info/gdbrowse/ and it is worth comparing the differences between the presentation of the two sets of otherwise identical data.

The original archive had a single assigned DOI[2] from where you could download a ZIP file to be unpacked and navigated on your own computer. The exposed metadata for the deposition (by which I mean in this case, metadata registered with DataCite, the registration authority used by Figshare) was limited to general information about the 133,885 molecules such as the authorship and license. The granularity is coarse, not extending to descriptions of individual molecules.
The new version forgoes the ZIP archive, replacing it with a proper database (based on MongoDB) containing information about 130,832 molecules. This allows one to search the data at the individual molecule level (formula, InChI descriptor, mass, etc) using the tools provided. To the end-user, this is much more useful; the data is both discoverable and re-usable.

This is no overlap between these two presentations of the data. There also appears to be no API (application programming interface) which might allow one to write code to construct one’s own searches. The apparent absence of an API also means that really only a human navigating the set menus can discover and re-use that data; the data might not be mineable by a machine for example. The absence of an API is not that unusual, only some of the best known molecular databases offer this; the RCSB Protein Data Bank is a good example. More significantly, each instance of such a molecule-based database is likely to have its own way of accessing the data and even if a documented API were available, one would still have to write specific code for each such resource.

So the first bowl contains what I suggest is cold porridge and the second is perhaps equivalent to a table d’hôte menu. Does Goldilocks have a third option? I would argue yes, she could have:

We recently published data for 158 kilo molecules[3] for which each molecule carries its own metadata. That metadata can be queried using any search engine that supports the basic metadata standards:
http://search.datacite.org/ui?q=has_media:true&fq=prefix:10.14469
is an example. Or armed with the metadata schema, one could also write one’s own search engine and in theory at least, that code should serve to query ANY repository that supports these standards.

You could argue that all that has happened is one has simply replaced a specific database API (if it exists) with a specific metadata schema. But these metadata schemas are controlled standards, the components of which should be self-describing (and one can see the schema components by invoking the link above).

As the archival of data (RDM) becomes increasingly important, communities will have to start making decisions about which flavour of data-porridge to offer Goldilocks. For molecular data at least, I suggest the third option is highly desirable and perhaps likely to be the most persistent. Parochial databases very much depend on a specialised team of people to maintain them in perpetuity, which I gather now means 20 years. At very least, we should start to have a debate about how the future will evolve. Let us not leave this debate merely in the hands of a small number of large organisations that are likely to make decisions based on their own business models. After all, it starts off at least as our data, not theirs! Arguably, we as authors have now largely lost control over how our stories (journal articles) are managed, let us not cede the same for data.

References

R. Ramakrishnan, P.O. Dral, M. Rupp, and O.A. von Lilienfeld, "Quantum chemistry structures and properties of 134 kilo molecules", Scientific Data, vol. 1, 2014. https://doi.org/10.1038/sdata.2014.22
Raghunathan Ramakrishnan., P. Dral, P.O. Dral, M. Rupp, and O. Anatole Von Lilienfeld., "Quantum chemistry structures and properties of 134 kilo molecules", 2014. https://doi.org/10.6084/m9.figshare.978904
Y. Zhang, H.S. Rzepa, J.J.P. Stewart, P. Murray-Rust, M.J. Harvey, N. Mason, A. McLean, and Imperial College High Performance Computing Service., "Revised Cambridge NCI database", 2014. https://doi.org/10.14469/ch/2

Tags:API, RCSB Protein Data Bank, search engine
Posted in Chemical IT | No Comments »

Disambiguation/provenance of claimed scientific opinion and research.

Monday, May 5th, 2014

My name is displayed pretty prominently on this blog, but it is not always easy to find out who the real person is behind many a blog. In science, I am troubled by such anonymity. Well, a new era is about to hit us. When you come across an Internet resource, or an opinion/review of some scientific topic, I argue here that you should immediately ask: “what is its provenance?”

In the 350 year history of scientific dissemination[1], provenance has almost always been provided by publishers. Arguably, that was their most important role (and arranging anonymous peer review). Not that they ever met with their authors or always established that a real person or a real group actually existed! But with the explosion of vanity publication and a host of horror stories about articles for sale to authors keen to have a publication to their name, perhaps the role of provenance needs rethinking.

ORCiD is a project that seems to be gaining serious momentum in achieving a mechanism for disambiguation and provenance of researchers. Thus Brian Kelly (who has played an important role in the modern internet in the UK since 1993 or earlier) encourages all researchers to sign up (although I cannot help noting, rather cheekily, that he does not add his own ORCiD as provenance for his blog). ResearcherID was in fact an earlier organisation to offer such a service, but it is run by a commercial publisher and it is hosted at a “.com“. ORCiD at least claims to be an open (.org)anisation, and carries an open source license. It seems that some UK Universities (home to some researchers) have decided to sign up to ORCiD and most I suspect are planning to deploy these resources amongst their researchers, and quite possibly their students as well (postgraduate initially, maybe even undergraduate eventually).

I jumped the gun somewhat, getting mine more than a year ago. Better the devil you know, etc etc! It is orcid.org/0000-0002-8635-8390. What happened next? Well, I publish data@Figshare, who themselves signed up to be an early member of ORCiD. This gives them access to the API (application programmer interface), and so by supplying my ORCiD to Figshare, I can gain access by proxy to the ORCiD features on offer. The most immediate impact is that ORCiD lists all the data-objects I have published at Figshare, thus establishing a trust between them and my ORCiD identity. Mind you, no-one at ORCiD has ever met me, or checked on who I am. I think that task is going to be delegated eventually to e.g. my university (I am not absolutely certain how the linkage between my ORCiD and my employer, who clearly know me since they pay my salary, will be formalised). Because my employer has also now become an ORCiD member, we will be adding ORCiD API access to our own SPECTRa-DSpace data repository shortly, so that the data held there will also be added to my ORCiD lists.

And as the major journal publishers start to do the same, a formal linkage between my identity (perhaps as verified by my employer), journal-published articles (narratives) and my data publications (via the identifiers known as DOIs) will come into being.

How, you might reasonably ask, is this in the least useful? In truth, I am not sure anyone really knows exactly where this is heading. For example, impactstory.org/about is one added value site which attempts to gather altmetrics about the impact your research is having. But hey, the although the preceding link tells you who founded this organisation, you do not get the kind of provenance I am describing above; none of the founders cite their ORCiDs! You do get their @Twitter accounts though; I wonder what that tells us about the modern interpretation of provenance? Well, my impact can be seen here; in truth it’s not quite the impact I imagined my scientific career was having, but I suppose this is early days. What I am pleased to tell you is that ImpactStory does tell you not only about the impact articles I have published has had, but also the data. Two data sets are described as both discussed and highly viewed. Although as usual, you do not get to learn why the data is being discussed!

Where next? Well, to go back to the start of this post; blogs. It would be nice to formally link this blog to my ORCiD ID (this is not done simply by quoting it here, but via the ORCiD API). If/when I work out how to do this, I will no doubt post the event!

References

H. Oldenburg, "Epistle dedicatory", Philosophical Transactions of the Royal Society of London, vol. 1, pp. i-ii, 1665. https://doi.org/10.1098/rstl.1665.0001

Tags:0000-0002-8635-8390, added value site, API, Internet resource, ORCiD, programmer, United Kingdom
Posted in Chemical IT | No Comments »

Digital repositories. An update.

Saturday, July 21st, 2012

I blogged about this two years ago and thought a brief update might be in order now. To support the discussions here, I often perform calculations, and most of these are then deposited into a DSpace digital repository, along with metadata. Anyone wishing to have the full details of any calculation can retrieve these from the repository. Now in 2012, such repositories are more important than ever.

In the UK, the main funding organisations are increasingly requiring researchers to deposit their primary data in such open archives, and some disciplines are better than others at this (chemistry does not rank very highly in general however in terms of deposition of data). Our DSpace server is a local one running at Imperial College, but a few months back I became aware of Figshare, which aspires to operate on a much wider and more general scale. So I have injected one of the calculations reported in another post (the IRC for the sodium tolyl thiolate reaction with dichlorobutenone) into Figshare, making use of the API which has recently been developed for this purpose and implemented by Matt Harvey. As with DSspace, it issues a DOI, which can be then quoted wherever appropriate (and particularly in scientific articles). This particular deposition is 10.6084/m9.figshare.93096

This repository is still undergoing a lot of development, but already one can see many interesting features, such as export to Endnote or Mendeley, and a QR barcode for devices with cameras. I would encourage anyone who regularly generates e.g. computational chemistry data, or knows a group that does, to encourage them to make use of such facilities.

Postscript: If you have a look at this deposition in Figshare you may already notice some of the developments I note above. Matt Harvey (who, with Mark Hahnel of Figshare, developed our publish script) has added to the entry:

* A data descriptor document URL

* Wikipedia and pubchem links (automatically resolved from Inchi/Key searches)

* Links to chemspider searches

* Links to all other objects in the Spectra DSpace repository with a common Inchi/Key

Tags:API, Chemspider, computational chemistry, Digital respository, Imperial College, InChI Key, Mark Hahnel, Matt Harvey, opendata, pubchem, QRCode, Skolnik, United Kingdom, wikipedia
Posted in Chemical IT | 1 Comment »

The blog post as a scientific article: citation management

Monday, February 27th, 2012

Sometimes, as a break from describing chemistry, I take to describing the (chemical/scientific) creations behind the (WordPress) blog system. It is fascinating how there do seem increasing signs of convergence between the blog post and the journal article. Perhaps prompted by transclusion of tools such as Jmol and LaTex into Wikis and blogs, I list the following interesting developments in both genres.

Improved equation display for Chemistry Central articles using MathJax This is a way of rendering equations in the pages of both a Blog and a journal article. This blog is now so empowered, although in fact I employ few equations on these pages.
Citation management and meta-data gathering. This blog plugin takes the form of a numbered citation[1] as here, and which converts the specified DOI to a listing at the bottom of the post in the manner of a conventional scientific article (conventional document citation managers such as EndNote do this as well). It is actually much more than that, since the plugin automatically uses the CrossRef API to retrieve metadata for the quoted Digital Object Identifier (DOI), thus enhancing the metadata associated with the post and its discoverability. Dublin-Core is already present in the post as well as FOAF output, and I occasionally trawl using the Calais archive tagger (although this is not very good at finding chemistry tags).
I installed Chemicalize a year or so ago. This scans the blog text for chemical terms, and adds a hover/popup image of structures it identifies (it is also responsible for the occasional doubled Gravatar image you may see here! Apologies!).
I noted the addition of ChemDoodle to this blog previously. There may be newcomers which I need to track down to this type of non-Java based molecular rendering.

So you can see that building a chemical/science-savvy blog can be great fun! It is also significant that science/chemistry publishers are starting to do this. I bring only one example to your attention, although this introduces a host of other issues that perhaps I should leave for another post.

References

H.S. Rzepa, "The past, present and future of Scientific discourse", Journal of Cheminformatics, vol. 3, 2011. https://doi.org/10.1186/1758-2946-3-46

Tags:API, chemical terms, chemical/science-savvy blog, chemical/scientific, citation management, Digital Object Identifier, Dublin, Java, LaTex, Skolnik, Sometimes
Posted in Chemical IT | 2 Comments »

Henry Rzepa's blog

Posts Tagged ‘API’

A nice example of open data (in London).

References

Goldilocks Data.

References

Disambiguation/provenance of claimed scientific opinion and research.

References

Digital repositories. An update.

The blog post as a scientific article: citation management

References

Recent Posts

Archives

Blogroll

Meta