Archive for the ‘Chemical IT’ Category

A comparison of searches based on metadata records from three (update: five) research repositories.

Tuesday, September 28th, 2021

In the previous blog post, I looked at the metadata records registered with DataCite for some chemical computational modelling files as published in three different repositories. Here I take it one stage further, by looking at how searches of the DataCite metadata store for three particular values of the metadata associated with this dataset compare.

Search 1: The metadata value of -1705.490787 is actually the Gibbs Free energy computed for the molecule associated with the data set, a molecule which featured in this blog post https://commons.datacite.org/?query=*\-170* is an un-fielded search for the truncated string -170* (where * is a wild card character and \ is said to “escape” the minus sign, since on its own a minus can also indicate a Boolean NOT operator), resulting in 70,918 works matching the query. From what we know about the dataset in question, this is a vast number of false positives. How can we reduce them?

Search 1a: https://commons.datacite.org/?query=subjects.subject:\-170* is a fielded search, specifying that the string must occur in the subject field (62 works) but this still has 57 false positives.

Search 1b: https://commons.datacite.org/?query=subjects.subject:\-1705.490787* (in fact precision of -1705.4* is also sufficient) removes all the false positives (5 works). But are there any false negatives? In fact, for other reasons, we know that there are two works in the Figshare repository where the value of of -1705.490787 appears in the keyword items on the landing page of e.g. 10.6084/m9.figshare.16685497 and is indexed and searchable locally, but does not appear in the registered metadata and hence is not included in the results of the above searches.

Search 2: A further, formally much stronger constraint on the search is https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-1705.490787* whereby a subjectScheme is added to search 1b, constrained to the value Gibbs_Energy. This now returns 3 works, two less than search 1b. There are two further false negatives because, as noted previously, the subjectScheme term is not defined in the Zenodo repository metadata record, where the missing two items are located. 

Search 2a: https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:*1705.490787*+AND+subjects.schemeUri:*goldbook* is even further constrained to specify a  Gibbs _Energy according to the  IUPAC Gold book definition.

Search 2b: https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:*1705.490787*+AND+subjects.schemeUri:*goldbook*+AND+subjects.valueUri:*gaussian* is the highest level of constraint, implying not only that the term  Gibbs_Energy is specified by the IUPAC Gold book definition, but that its value is that determined by (in this example) the Gaussian (implementation). 

So to summarise what we have thus far established, we can successfully eliminate false positives by specifying a fielded search with a requirement that the field specifically relates to Gibbs_Energy. But because of omissions in the metadata records, we also have four false negatives resulting from doing this.

Search 3https://commons.datacite.org/?query=subjects.subject:VELNVPXNOKVVTC-VJKZSTDTSA-N searches for another subject term, the InChI key for the molecule relating to the data (5 works). Here again however context for the string VELNVPXNOKVVTC-VJKZSTDTSA-N is missing, although again the string is long enough to ensure it is unique. But we could go one step further.

Search 4: https://commons.datacite.org/?query=subjects.subjectScheme:inchikey+AND+subjects.subject:VELNVPXNOKVVTC-VJKZSTDTSA-N constrains the subject term to only those strings describing an InChIkey (3 works). This again is due to Zenodo not specifying the subjectScheme and Figshare not even containing the InChIkey in its metadata record.

Search 4a: https://commons.datacite.org/?query=subjects.subjectScheme:inchikey+AND+subjects.schemeUri:*inchi-trust*+AND+subjects.subject:VELNVPXNOKVVTC-VJKZSTDTSA-N constrains the inchikey further by specifying the authority for the scheme definition as the InChI Trust. 

Search 5https://commons.datacite.org/?query=subjects.subject:InChI=1S/C25H39NO9* is query 1, but on the InChI string rather than the InChI key, and with the same results as before (5 works). Here, the string is deliberately truncated to return only the molecular formula of the molecule.

Search 5a: https://commons.datacite.org/?query=subjects.subjectScheme:inchi+AND+subjects.subject:InChI=1S/C25H39NO9* is query 4, with the subjectScheme changed to only the molecular formula component of an InChI (3 works). 

Search 5b: https://commons.datacite.org/?query=subjects.subject:InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14\(31-2\)10-23\(29,16\(13\)17\(12\)33-4\)25\(26,30* truncates much less of the InChI string, extending it to the molecular connection table. Notice how characters such as ( or ) have been escaped with a \ prefix. Such characters are used for grouping in the search query and so must be escaped to be included in the query.

Search 5c: https://commons.datacite.org/?query=subjects.subject:InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14\(31-2\)10-23\(29,16\(13\)17\(12\)33-4\)25\(26,30\)19\(34-5\)18\(24\)22\(11-27,21\(28\)35-20\)8-7-15\(24\)32-3\/h12-20,27,29-30H,6-11H2,1-5H3* For this length string (and InChI strings can get very long!) an unidentified error can occur, suggesting that the full InChI string is best not used for such searches.

Search 6: 

From these experiments, we learn that the quality and completeness/richness of the metadata record is vital to ensure no false positives or negatives are returned by the search. Ensuring such metadata richness is something that a repository should do, and it is interesting that two of the best known repositories both currently have failings in this regard. I might try one or two other popular repositories to see how they behave and will report back if I find anything interesting.


Thus https://commons.datacite.org/doi.org?query=subjects.subjectScheme:*inchikey* reveals all entries that specify an InChIkey in the subject metadata (185,414 works) but https://commons.datacite.org/doi.org?query=subjects.subjectScheme:*inchikey*+AND+subjects.schemeUri:*inchi-trust* reveals only 1748 of these further specify the InChI trust as the authority. Two more depositories, Mendeley Data and Harvard Dataverse have been populated with the same data. See here.


This post has DOI: 10.14469/hpc/9162

A comparison of descriptive metadata across different data repositories.

Tuesday, September 28th, 2021

The number of repositories which accept research data across a wide spectrum of disciplines is on the up. Here I report the results of conducting an experiment in which chemical modelling data was deposited in three such repositories and comparing the richness of the metadata describing the essential properties of the three depositions.

The three repositories are as follows:

  1. Figshare as a repository dates from 2012. The computational chemistry dataset used was manually uploaded. Most of the metadata was entered manually by copy/paste operations and included three keywords which comprised the InChI key for the molecule, the corresponding InChI string and the calculated Gibbs Energy obtained from the computed vibrational frequencies.
  2. Zenodo started in 2013 and has been updated several times since then. The same data and metadata were used as for Figshare, including the the same keywords, but with the difference that the upload was not manual but automated using the Zenodo API as implemented in the new computational portal described in the previous post (DOI: 10.14469/hpc/9010). Publication here was a simple button click and so is a much shorter process than that for Figshare.
  3. The original 2006 version of the  Imperial College data repository was based on DSpace, and updated to version 2 in 2016 with entirely new code. It too is populated by publication from the same portal as used for Zenodo.
  4. Mendeley data:
  5. Harvard Dataverse:

Each deposition results in the generation of a DOI, and these, together with the link that allows access to the associated metadata can be seen in the table below.

Repository Dataset DOI Dataset
metadata
Figshare 10.6084/m9.figshare.16685497 XML
JSON
Zenodo 10.5281/zenodo.5511966 XML
JSON
Imperial College 10.14469/hpc/9031 XML
JSON
Harvard Dataverse 10.7910/DVN/4BWOYK XML
XML
Codebook
Mendeley Data 10.17632/dgtvds3xn5.1 XML
JSON

I would note that manual deposition can be rather dependent on how fastidious the depositor is and how they interpret the descriptive keywords that Figshare and Zenodo accept. Automated deposition is a more controlled process, in which the required keywords are a property programmed into the submitting portal tool. Such a process also allows metadata to describe relationships between different datasets, such as a dataset collection, and is inherited from project descriptor on the portal. Additionally, the automated process can then be augmented by manual editing of the metadata record, as for example, the addition of the DOI for this descriptive post which can be added to the metadata records retrospectively. In the case of e.g. Zenodo, retrospective changes to the metadata record require a new DOI to be generated to reflect the changes. 

You can inspect the results of these three depositions yourself by downloading the respective metadata records and viewing the downloaded file using a simple text or XML editor. 

  1. All three repositories contain the ORCID of the depositor, as e.g. from Figshare:
    <creator> 
    <creatorName>Rzepa, Henry S.</creatorName> 
    <givenName>Henry S.</givenName> 
    <familyName>Rzepa</familyName> 
    <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org">
    https://orcid.org/0000-0002-8635-8390
    </nameIdentifier>
    </creator>

    The widespread addition of the unique ORCID researcher identifier is very welcome.

  2. The more interesting component is keyword metadata, populated manually in Figshare and using the automated API in the other two repositories.
    1. Below is the Figshare metadata entry, which displays the assigned categories (from a controlled list) in the <subject> container:
      <subjects>
          <subject>Computational Chemistry</subject>
          <subject>Organic Chemistry</subject>
          <subject subjectScheme="Fields of Science and Technology (FOS)" schemeURI="http://www.oecd.org/science/inno/38235147.pdf">FOS: Chemical sciences</subject>
        </subjects>

      The context of these keywords is clearly defined by the value of the subjectScheme (chemical sciences) but this term is very broad and does not relate very specifically to the deposited data. The more chemically specific keywords themselves are only displayed on the landing page for the entry as shown below and are not expressed in any metadata container, which means that they are not indexed and hence searchable using the DataCite metadata store.

    2. Zenodo interpret this differently, with the keywords now included in the <subject> container.
      <subjects>
          <subject>-1705.490787</subject>
          <subject>InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14(31-2)10-23(29,16(13)17(12)33-4)25(26,30)19(34-5)18(24)22(11-27,21(28)35-20)8-7-15(24)32-3/h12-20,27,29-30H,6-11H2,1-5H3/t12-,13-,14+,15+,16-,17+,18-,19+,20+,22+,23-,24+,25+/m1/s1</subject>
          <subject>VELNVPXNOKVVTC-VJKZSTDTSA-N</subject>
        </subjects>

      However, you might be wondering what the keyword -1705.490787 is all about. Put simply, in this form of expression it has absolutely no context. I previously explained why it might be useful if context is added, it being a persistent identifier for (some) quantum chemical calculations in the form of a computed total energy corrected thermally into a Gibbs energy. The persistence in this case is acquired not by registration with an agency but generation by an algorithm. That algorithm in turn would require additional metadata for its specification, but that is something I will not address in this post. At any rate, because it is part of the metadata record, it is search-enabled in the Zenodo version.

    3. Imperial follows the Zenodo approach, with further addition of context:
      <subjects>
          <subject subjectScheme="Gibbs_Energy" schemeURI="https://doi.org/10.1351/goldbook.G02629" valueURI="http://gaussian.com/thermo/">-1705.490787</subject>
          <subject subjectScheme="inchi" schemeURI="http://www.inchi-trust.org/">InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14(31-2)10-23(29,16(13)17(12)33-4)25(26,30)19(34-5)18(24)22(11-27,21(28)35-20)8-7-15(24)32-3/h12-20,27,29-30H,6-11H2,1-5H3/t12-,13-,14+,15+,16-,17+,18-,19+,20+,22+,23-,24+,25+/m1/s1</subject>
          <subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">VELNVPXNOKVVTC-VJKZSTDTSA-N</subject>
      </subjects>

      The context is added by addition of the attributes subjectScheme, schemeURI and valueURI. The top level context is the definition provided by the IUPAC Gold Book, and the actual implementation of the algorithm is described on the Gaussian site (although the algorithm there is not explicit in a machine implementable sense). These additions allow an indexed search not only of the numerical value (as a simple string and not as a floating point number) but which can be constrained by specifying the value of e.g. the subjectScheme so that any other random number specified as a keyword which does not have this attribute is excluded. This also allows a search where the floating point number is replaced by wild-cards (*), which would then retrieve ANY reported Gibbs energy, which could in turn be constrained by say the nature of the molecule as expressed using  InChI. 

  3. The final aspect of the metadata analysed here is the relatedIdentifier record. This is increasingly recognised as a crucial component for the construction of so-called PID graphs, which are generated to reveal connections between entities in the research landscape such as data, people, organisations, funders, publications and any other object that is assigned a registered PID (such as perhaps in the future connecting data to its origins from a large instrument). So here are these records for the three repositories:
    1. Although the landing page for the Figshare record has three such entries, including pointers to the other two depositions being discussed here, they are not propagated to the metadata record and so cannot participate in any generated PID graph.
    2. Zenodo has the following record
      <relatedIdentifiers>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.5511965</relatedIdentifier>
        </relatedIdentifiers>

      which relates to an earlier version of the metadata for this entry.

    3. The Imperial record is:
      <relatedIdentifiers>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasMetadata">https://data.hpc.imperial.ac.uk/resolve/?ore=9031</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=1</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=2</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=3</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=4</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="References">10.5281/zenodo.5511966</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="References">10.6084/m9.figshare.16685497</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="IsPartOf">10.14469/hpc/9158</relatedIdentifier>
        </relatedIdentifiers>

      where a large number of related PIDs would result in a rich PID graph. These entries include relationType=”HasMetadata” which is a pointer to additional metadata expressed using a different schema (ORE) and which provides a machine-actionable manifest for the files present, specifying the Media Types of each file and a machine method of accessing them. relationType=”HasPart” provides an access URL for each specific item in the fileset. relationType=”References”  is the analogue of the Figshare entries above, citing the other two repositories we are discussing here and finally relationType=”IsPartOf” indicates the deposition is part of a larger collection (in this case the collection generated for this blog) and which could also correspond to e.g. a project comprising multiple researchers at multiple institutions, or say a PhD dissertation containing multiple chapters. The extensive nature of this list of identifiers means that the PID graph would reveal many connections.

I have only covered three repositories here; many more could be added to the list and analyzed for their metadata records. The bottom line is that generally the more metadata that is added, the richer the resulting services and analyses based on PIDs can become. It can only be hoped that this aspect of the operation of repositories continues to improve over time and eventually most will broadcast very rich metadata, including at the very specific subject level. This should enrich the research landscapes, especially at the finely grained subject level.

In the next post, I will analyse the results of searches enabled by this metadata.


Figshare also has an available API, which has not been implemented in the current version of this portal. Policies regarding editing of metadata vary. Some repositories editing updates to the record held by DataCite against the existing DOI. Others require the generation of a new DOI for each new version of the metadata, no matter how small a change (e.g. spelling mistakes in the title etc). An unsolved problem in DataCite metadata is datatypes and units. This entry is a floating point data type, with units of Hartree. How this information can be added is still being discussed.

HPC Access and Metadata Portal (CHAMP).

Monday, September 13th, 2021

You might have noticed if you have read any of my posts here is that many of them have been accompanied since 2006 by supporting calculations, normally based on density functional theory (DFT) and these calculations are accompanied by a persistent identifier pointer to a data repository publication. I have hitherto not gone into the detail here of the infrastructures required to do this sort of thing, but recently one of the two components has been updated to V2, after being at V1 for some fourteen years[1]  and this provides a timely opportunity to describe the system a little more. 

The original design was based on what we called a portal to access the high performance computing (HPC) resources available centrally. These are controlled by a commercial package called PBS which provides a command line driven interface to batch queues. Whilst powerful, PBS can also be complex, and for every day routine use it seemed more convenient to package up this interface into a Web-accessed portal which also included the ability to specify the resources needed (such as memory, number of CPUs, etc) to run the desired compute program, in our case the Gaussian 16 package and to complete things by adding a simple interface to a data repository for use when the calculation was completed.

The process of using this tool, which functions in essence as an Electronic Laboratory Notebook or ELN for computational chemistry, can be summarised as a workflow, which occurs horizontally in the screenshot of V1 above. Each job is assigned an internal ID, which is associated with a pre-configured project and given a searchable description. Its status in the PBS-controlled queues is indicated and when finished the associated input and output files become available for download, with an option to delete these if they are not in fact needed, and a final option to publish to the accompanying tool which is a data repository. V1 of this portal was in fact written in the PHP scripting language and controlled behind the scenes using a MySQL database, which allows the entries to be filtered by search terms such as the assigned project or the description. This proved particularly useful when the number of entries reached large numbers (> 100,000 eventually) and meant that even 15-year old entries could be easily found and inspected!

Although this workflow proved highly robust, the underlying PHP system and associated code became increasingly unmaintainable and in 2021 we decided to refactor it for greater sustainability. We had noticed that in 2018, another group had taken the basic concept we had used in 2006, written a more flexible and portable opensource toolkit for building such a portal, calling it Open OnDemand: A Web-based client portal for HPC centers and published a description.[2] In effect, a lot of the work in maintenance is now divested to a separate group and accordingly our software engineering group here at Imperial were far happier using such a tool. So now enter V2 of our own portal, which we now call HPC Access and Metadata Portal or CHAMP.

The workflow is very much the same as before, but with added flexibility that allows custom resources to be selected which might include eg special grant-funded priority queues. Additionally, a new directory tool allows inspection of any job inputs or outputs, provided by the Open OnDemand package and which greatly facilitates minute-to-minute management/inspection of jobs to ensure the outputs are those expected for a properly functioning job.

If the job is deemed suitable for sharing, the publish button is pressed. This induces a workflow which, inter alia, converts the system specific checkpoint file to an formatted version which can be used on any system and generates a number of extra files needed for publication of the job.

Also of interest is the METADATA file, which generates calculation-specific metadata suitable for injection into the data repository. Currently, this includes the InChI string and Key for the molecule calculated and the Gibbs_Energy, the purpose of which was described in this post. In the future we plan to make this metadata even richer with further information. This calculation-specific metadata will later be conflated with generic metadata for the final publication on the actual repository. That full metadata record includes information about the person who ran the job (their ORCID etc), the institution they are at, the data licensing etc., garnered in part from the profile entry for that user on the CHAMP portal.

After publication, the CHAMP entry for the job is updated to include the DOI for the data publication, and hyperlinked to allow immediate access to this entry in the repository.

An information page about the job also includes a link to the final full published metadata record(s).

CHAMP currently includes workflows to publish to the Imperial College repository. Zenodo has also now been added and possibly other repositories in the future as demand requires.

You can see here  that  I have described how an  ELN was originally designed from scratch to control quantum calculations, and how an essential symbiotic partner to this resource was considered to be a data repository at the outset, even way back in 2006.  Now, the first of these resources has been refactored into modern form and no doubt the repository end will also be in the future. The code is available for anyone to create a similar compute portal for themselves.

A different version of this description, including more details of the software engineering, will shortly be submitted to the Journal of Open Source Software, along with source code suitable for use with Open OnDemand at https://github.com/ImperialCollegeLondon/hpc_portal/.


Originally in the form of a Handle, which was replaced by the use of a DOI. The DOI for this post itself is 10.14469/hpc/9010

References

  1. M.J. Harvey, N.J. Mason, and H.S. Rzepa, "Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks", Journal of Chemical Information and Modeling, vol. 54, pp. 2627-2635, 2014. https://doi.org/10.1021/ci500302p
  2. D. Hudak, D. Johnson, A. Chalker, J. Nicklas, E. Franz, T. Dockendorf, and B. McMichael, "Open OnDemand: A web-based client portal for HPC centers", Journal of Open Source Software, vol. 3, pp. 622, 2018. https://doi.org/10.21105/joss.00622

Octopus publishing: dis-assembling the research article into eight components.

Friday, August 13th, 2021

In 2011, I suggested that the standard monolith that is the conventional scientific article could be broken down into two separate, but interlinked components, being the story or narrative of the article and the data on which the story is based. Later in 2018 the bibliography in the form of open citations were added as a distinct third component.[1] Here I discuss an approach that has taken this even further, breaking the article down into as many as eight components and described as “Octopus publishing” for obvious reasons. These are;

  1. The problem being addressed
  2. An original hypothesis/theoretical Rationale for the problem
  3. A method or protocol for testing the hypothesis in the form of experiments or modelling
  4. The data resulting from these experiments
  5. Analysis of this data
  6. Interpretation of the analysis in terms of the original hypothesis
  7. Translation/application to a real world problem (an extrapolation if you like of the original problem)
  8. A review of any of the previous seven items.

Items 1-3, 5-6 and probably 7 are the dis-assembled components of the standard concept of a scientific publication and item 4 is the data component I refer to in my introduction above. Interestingly, the article bibliography is not separated out into these components, and is presumably distributed throughout the resulting fragments. The essential concept behind this dis-assembly is that each component can rest on its own, provided it is contextually and bi-directionally linked into the others. The author(s) of any individual component will get credit/recognition for that component. A conventional mapping would be that the same author set would be responsible for all the individual items, whilst recognising that each component could in fact have its separate authorship. Thus one might get credit for just suggesting a problem, or for suggesting a protocol for its testing, for acquiring just the data, or for proposing an original analysis or interpretation. Any author’s reputation would then be established by the integrated whole of their contributions across the whole range of article components (leaving aside how the weightings of each individual contribution would be decided). 

I decided to try to give it a go via a prototype at https://science-octopus.org/publish which as you can see is one of the eight above. The process starts with the prospective author providing their ORCID credentials, against which unique metadata is presumably generated.

The next stage however is more interesting; Unlike any other publication platform, Octopus requires all publications to be linked to others that already exist. At the current stage of development, the prototype only has a few entries linked to COVID-19 as a topic, which I did not feel able to add to. Selecting one of those at random, one is asked to link any associated data via a DOI, then the optional indication of appropriate keywords, followed by standard questions about funding sources, conflicts of interest and license declaration. I wanted to use CC0, but this was not an option.

Finally at this stage a question about any other related publications, with the list known to Octopus being offered as suggestions. Next came the main opportunity to insert prose related to the information already provided (this part constituting the conventional article component). This is also the only opportunity to add a bibliography, and the citations would be part of that document, albeit not identified as a citation for the inner workings of the Octopus system. Then comes a Publish now button. No persistent identifier (DOI) is generated in this prototype system, but will be in the production system. A final screen that has four options, including to write a review (of one’s own work?) or to Red Flag the item. The latter can eg arise from plagiarism or any other expression of concern.

Many questions arise about this new approach and I will only note only three. One relates to “Why would I want to publish in Octopus” in the FAQ section. I quote “The traditional system is not only slow and expensive, but the concept of “papers” is not a good way of disseminating scientific work in the 21st century and “Publish it now and establish priority – once it”s out in Octopus it”s yours“. There are many experiments that strive to address this generic issue that the conventional research paper is no longer entirely fit for purpose. One I can supply here is that of “Preprint servers” such as ChemRxiv where one significant motivation is also to “establish priority”.

An aspect of interest to myself is how the metadata for all these eight components will be expressed. Presumably when mature, all eight components will have their own DOI or persistent identifier (PID) and hence all eight will also have a metadata record. These records will formally establish the relationships between the components, and ultimately could be used to construct a PID Graph not only between those components but to the rest of the PID “multiverse” of articles, data, citations and other research objects. Will Octopus join in this PID graph world?

And finally, the greatest challenge to a new paradigm such as Octopus is how quickly will the existing established culture of “publish a blockbuster article in one of the top ten (chemistry) journals” to establish your career evolve into the dis-assembled approach described here? It has taken preprint servers the best part of 25 years to really get going in e.g. chemistry where there are now around 9600 preprints on a variety of topics. I suspect some subject disciplines may be harder to crack than others (and chemistry may well be amongst these!).


Around 2009 or so I ran a student experiment using a Wiki which had some aspects of this approach. Students were asked to do a project on the topic of either a molecule from a suggested list of around 30, or a molecule entirely of their own choosing. The students spontaneously split themselves into three groups. The first were students who wrote the story entirely by themselves, submitted it for credit and did not welcome others as co-authors. The second (largest) group where those that contributed to multiple topics, very much in the manner of Wikipedia itself. Their credit was the sum of their contributions. The final group chose not to tell a story about a molecule, but to help everyone else with the infrastructure of doing so (the protocol if you like) by writing templates which simplified authoring, or correcting errors in existing stories etc.

Arguably a measure of the impact of these 9600 preprints is how many of them have eventually appeared in fully peer reviewed form in journals. That statistic may not be known. Also of interest would be some analysis for those that did end up in journals of how they evolved between eg V1 of the preprint and the final “version of record”. 

Metadata is structured according to a specified schema. Currently, journal publishers use the CrossRef schema for this purpose, which contains a full description of e.g. the bibliography of an article. Data publishers use the DataCite Schema, which has less focus on bibliography. It will be of interest to see how such schemas are applied to the eight components of an Octopus scientific record.


References

  1. D. Shotton, "Funders should mandate open citations", Nature, vol. 553, pp. 129-129, 2018. https://doi.org/10.1038/d41586-018-00104-7

Room-temperature superconductivity in a carbonaceous sulfur hydride!

Saturday, October 17th, 2020

The title of this post indicates the exciting prospect that a method of producing a room temperature superconductor has finally been achived[1]. This is only possible at enormous pressures however; >267 gigaPascals (GPa) or 2,635,023 atmospheres.

The system is made by milling a mixture of elemental carbon and sulfur, followed by adding hydrogen gas, compression to 4 GPa and finally laser-induced photolysis at 532nm for several hours. The result of this is the production of three entirely unexotic molecules, H2S, CH4 and H2 in approximately stoichiometic quantities, which at this pressure form a complex bound by van der Waals attractions. Since in this blog, I am particularly interested in molecular structures, my eye was drawn to “Extended data Figure 6, A DFT-optimized structure for (H2S)(CH4)H2 (variant 2) at 4 GPa. This structure was produced by DFT optimisation modelled at 4 GPa using the PBE functional and importantly the now standard Grimme dispersion correction (often indicated as GD3+BJ, and used frequently on this blog). Since this complex is bound by dispersion attractions, it might be tempting to conclude that the intermolecular features of this structure originate in part from the Grimme dispersion model as well as possible hydrogen bonding from quantum effects.

I would love to be able to play with this structure to e.g. measure properties such as hydrogen bonding lengths or perform e.g. a QTAIM analysis, but have not yet acquired the “extended data” of figure 6 in the form of coordinates.‡  I have italicised the term extended data, being unsure what the journal means by this. If the figure relates to the three-dimensional extended structure of the crystal form of this complex, then one might imagine that any extended data associated with this figure would indeed be the numerical coordinates. Since the authors express the hope that “chemical tuning” of this system might enable complexes exhibiting superconductivity at lower pressures, I fancy that these coordinates might help provide insight into how to achieve such tuning. This closing paragraph of mine arose because I still frequently fail to see even prestiguous journals doing very much to encourage FAIR data associated with articles. In this instance, FAIR, at least to my mind is more than just a Figure (with or without extended data), but is genuinely inter-operable (I) or re-usable (R) data such as indeed are coordinates. To this end, I am unconvinced that this “extended data figure” is indeed properly FAIR.


I have requested these from the authors, and hope to make them available in the form of a 3D rotatable model here on the blog. It would be interesting to know if this model has been tested at the enormous pressures in this experiment. Standard dispersion models pertain to normal pressures. As was done here for Na2He.

References

  1. E. Snider, N. Dasenbrock-Gammon, R. McBride, M. Debessai, H. Vindana, K. Vencatasamy, K.V. Lawler, A. Salamat, and R.P. Dias, "RETRACTED ARTICLE: Room-temperature superconductivity in a carbonaceous sulfur hydride", Nature, vol. 586, pp. 373-377, 2020. https://doi.org/10.1038/s41586-020-2801-z

Exploiting the power of persistent identifiers (PIDs) for locating all kinds of research object.

Saturday, August 29th, 2020

The folks at DataCite have announced a new research object discovery service which aims to give users a “comprehensive overview of connections between entities in the research landscape”. The portal https://commons.datacite.org acts as the entry point for three basic types of persistent identifiers (PIDs);

  1. Research works, using the DOI (digital object identifier) as a PID. This includes both research articles and research data as “works” or research objects and can be invoked using the prefix https://commons.datacite.org/doi.org?query= to the search query.
  2. People, using the ORCID as a PID via the prefix https://commons.datacite.org/orcid.org?query=
  3. Organisations, using ROR as a PID using the prefix https://commons.datacite.org/ror.org?query=
  4. If one wants to construct a search which combines any two, or all three of the above categories, then the search prefix is simply https://commons.datacite.org/?query=

To use this very modern type of discovery portal, one currently has to be familiar with how to construct a valid search query to be appended to any of the above prefixes. This is now well documented at https://support.datacite.org/docs/datacite-commons, although it still requires some work and patience to construct a precise search query. This in turn requires knowledge of the so-called “metadata schema“, on which the indexing is based.

This sort of activity is best illustrated using examples. As it happens I have already collected a decent set at https://doi.org/drrm, nicely illustrating that a search query, or a collection of search queries, can themselves be considered as a valid research object! That collection used the prefix https://search.datacite.org/works?query= which might usefully be considered as now obsoleted by https://commons.datacite.org/?query=. You can take any of the original queries and try them out here. I will show just two:

  1. https://commons.datacite.org/?query=titles.title:*amidation* The orignal search gives 170 hits, since it is based largely on DOIs for datasets only. The new version of the search yields 1016 hits, since it includes authors and organisations as well. The results look like this, indicating 846 hits come from the CrossRef registration agency (mostly journals) and the rest from DataCite (mostly data).

  1. https://commons.datacite.org/?query=media.media_type:chemical/x-mnpub*+AND+(subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*)+AND+(subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B)+AND+(subjects.subjectScheme:NMR_Solvent+AND+subjects.subject:CDCl3) is a the other end of the spectrum for specificity, constraining the search to some very specific chemical properties, the nature of which should be reasonably obvious from the syntax of the query. This specificity is why it continues to give just one hit.

The evolution of these search facilities gives an interesting pointer to what the future might hold. New registration agencies can be easily added to the above lists for including other kinds of research object. For example, instruments and their properties. One can combine these diverse properties into a single search, thus revealing scientific information or connections that may not be apparent from historical (chemical) abstracting agencies such as e.g. CAS or Reaxys. Importantly, all the metadata on which the indexing is based is fully open and not proprietary and currently at least searches such as the above are free at point of use (unlike the chemical registration agencies noted for which commercial licenses have to be purchased by organisations). The concept of searching for relationships across different types of PID is summarised by the term “PID Graph“. This in turn can reveal other properties of the objects, such as e.g. usage statistics and citations;

It is good to see this evolution of new ways of finding scientific information and I rather think that we have only just began to see the potential of this approach; there is much more to come. Exciting times ahead I fancy!


This post has a PID: 10.14469/hpc/7366.


A cascading tutorial in finding rich NMR data using the Datacite datasearch engine.

Saturday, April 11th, 2020

In the previous post, I introduced three of a new generation of search engines specialising in the discovery of data. Data has some special features which make its properties slightly different from the conceptual (or natural language) searches we are used to performing for general information and so a search engine specifically for data is invariably going to reflect this. At the simplest level, the data search can retain much of the generic simplicity of a regular search, but to exploit the unique features of data, one really does have to move on to an advanced mode. Here, by introducing a set of search definitions that gradually increase in specificity and power, I hope to convey some of the flavour of one way in which this could be done.


Let me first introduce the search: we want to track down raw NMR FID data for the 11B nucleus associated with the chemical concepts of catalytic amidation.


To understand how to construct a search query which is specific to this set of constraints, one has to understand metadata and in particular its context of describing data. This is done via a specification known as a schema. We are going to exploit one of the better known schemas for describing data, that produced by DataCite[1] (DOI: 10.14454/f2wp-s162). It can be illustrated by just three small metadata components, which can be implemented in say an XML language and the properties controlled by their specification in the schema and shown below, with the actual value of the metadata highlighted in red.

  1. <titles>
      <title>
      16b. 2-((2-aminoethyl)-λ4-azaneyl)-2,4,6-tris(3,4,5-trifluorophenyl)-1,3,5,2,4,6-trioxatriborinan-2-uide
      </title>
    </titles>
    
  2. <descriptions>
      <description descriptionType="Other">NMR spectra for 1H, 13C, 19F and 11B nuclei.</description>
    </descriptions>
    
  3. <subjects>
      <subject subjectScheme="inchi" schemeURI="http://www.inchi-trust.org/">
      InChI=1S/C20H14B3F9N2O3/c24-12-3-9(4-13(25)18(12)30)21-35-22(10-5-14(26)19(31)15(27)6-10)37-23(36-21,34-2-1-33)11-7-16(28)20(32)17(29)8-11/h3-8H,1-2,33-34H2/q-1
       </subject>
      <subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">BHYQUOWHUMNGMD-UHFFFAOYSA-N
       </subject>
      <subject subjectScheme="NMR_Nucleus">11B</subject>
      <subject subjectScheme="NMR_Solvent">CDCl3</subject>
    </subjects>
    

The metadata is registered with a store (MDS, DataCite in this instance) in this form and then indexed there. To search that index, we need to learn the query syntax and expression. This is illustrated below for various examples, which can be broken down into components:

  1. The prefix https://search.datacite.org/works?query= is common to all the queries, and hence is only shown for example 1.
  2. The syntax e.g. titles.title: derives from the hierarchy of the metadata, as in 1 above.
  3. Immediately followed by a search string. The * character means the string may be part of a longer string, both preceding and following the actual search string. A literal string would be enclosed in quotes, “…”
  4. Two or more separate queries can be related by a Boolean operator, as +AND+ or +OR+.
  5. The Boolean operations can be grouped using (…) to ensure the logic is unambiguous.

With the syntax dealt with, we can now proceed to some actual queries. The hits shown were obtained on the day this post was written, and may change with time (hopefully but not necessarily upwards). A brief attempt at a natural language expression of each search appears in the table below, with the Boolean operators indicated in red. Each example is elaborated below to show the logic of their evolution.

Examples 1-9 deal with keywords typically found in either the title or the description metadata fields. Because there are no hard and fast rules as to which of these two any particular keyword might be found in, searches have to be defined which allow both possibilities. Search 2 seeks to find datasets where both keywords are found in a title (or indeed titles, since multiple titles for the same dataset are allowed). Search 3 allows each term to be found in either the title(s) or the description(s) using grouping operators; the difference in hits shows the necessity of doing this. The search outlined at top also indicated we specifically wanted NMR data. Searches 4-6 search for this term in either the title or the description. We are now assuming that NMR really does relate to spectroscopy and not some other acronym in use by another community. This can be a real problem if the same term has different meanings across different subject areas. In example 7-9, we now turn to boron, since 11B NMR requires a boron compound! Allowing any of the terms to appear either as a title or a description increases the hits compared to more restricted searches.

Time now to restrict the searches even more. In the previous searches, we had identified a potential discovery lead (i.e. one we might wish to follow up in more detail). Looking this lead up, we find its molecular formula, a very useful chemical search term. Because this is quite subject specific, we now turn to <subject> rather than <title> or <description>. Search 10 illustrates how this might be done. Search 11 is even more specific; whereas it is possible that two different chemical species might share a common molecular formula (as isomers), their chemical identifier (InChI and InChiKey) should be more unique. These latter two can be generated algorithmically for any given compound and so should return information about that specific molecule. Search 12 now combines this search with the 11B nucleus specified as a description, and search 13 generalises it to title as well.

We are now ready to go to the next level of refinement, that of media types. These are descriptors which identify the type of document in which the data is held. We are all familiar with e.g. .docx as belonging to the Microsoft Word family, originating in early computer operating systems where each document or file name had two components, with the suffix indicating the application (family) likely to be able to process it or the application to be used when the document is double clicked on the desktop. So in search 14, we combine a search of NMR in the title or description with the media type application/zip. We know that Bruker spectrometers export their data in a folder containing about 24 components and this is generally packaged up as a ZIP archive to make it tractable for submission and exchange. We do not know for sure what will be in the ZIP archive, but in combination with the title/description we may be reasonably optimistic (but not certain). However, a ZIP file identified and downloaded by this procedure still has to be accessed in a manner that will recognise any NMR data therein. This function must now be devolved to whatever program is used to access the ZIP file. 

In search 15, we try to be a bit more specific by combining the molecular identifier (InChiKey) with 11B (an NMR active nucleus) in a title or description and a JCAMP-DX media type. This latter type is more clearly associated with NMR spectroscopic data in JCAMP format, so the expectation is that any hits for this search sequence should provide us with an actual NMR spectrum! There is a slight spanner in the works; we do not yet know whether to expect processed NMR data (i.e. a spectrum) or raw NMR data (i.e. an FID), since JCAMP can hold either (but not both. Most examples in fact relate to spectra). Example 16 takes us to a media type which IS known to hold both raw and spectral data concurrently, the Mnova format. But this again leads to a new issue. Mnova is commercial software and to use it you need a license. It would be indeed cruel if you managed to find some data, but then had to pay money to view it in its commercial format (although of course that is how some journals operate). Example 17 addresses that problem. The media type is associated not with a data file as such, but with a single-use license file which can be read by Mnova to license the program to read the actual data file. You can now view the data in either FID or spectral form and process the data to your heart’s content. This largely encapsulates the aspiration of the acronym FAIR. We have Found and Accessed the data, Interoperated (i.e. converted an FID to a spectrum) and Re-used it (having checked the re-use license in the metadata) to e.g. analyze the spectrum.

Example 18 takes us to our final level. Previously the acronym NMR was used as a search term. You might be surprised to learn that it can have up to 33 meanings! In this context, we are interested in only one of them (nuclear magnetic resonance). So rather than imprecisely specify it in a title or a description, we are now going to (also) give it a more precise meaning using <subject>. The exact way in which to do this is still being debated; here is one possibility. Elaborating list item 3 above, we get
subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B
which is used to disambiguate from the other 32 possible meanings of NMR. Hence we are interested specifically in the 11B nucleus. We are controlling the data itself to relate to NMR data about that nucleus, using the media type. And example 19 now specifies also that the measurement must be made in a particular solvent. There are of course many other parameters which could be used.

# Search query Hits Plain(er) English
description
General keywords such as Title and Description
1 https://search.datacite.org/works?query=titles.title:*amidation* 161 Amidation in title.
2 titles.title:*amidation*+AND+titles.title:*catalytic* 2 Amidation AND catalytic in title.
3 (titles.title:*amidation*+OR+descriptions.description:*amidation*)+AND+(titles.title:*catalytic*+OR+descriptions.description:*catalytic*) 28 Amidation in either title OR description AND Catalytic in either title OR description.
4 descriptions.description:*NMR* 17,978 NMR in description
5 descriptions.description:*NMR*+OR+titles.title:*NMR* 26,152 NMR in either title OR description.
6 titles.title:*boron*+AND+titles.title:*catalysed* 20 Boron AND Catalysed in title.
7 titles.title:*boron*+AND+titles.title:*catalysed*+AND+titles.title:*NMR* 1 Boron AND Catalysed AND NMR in title.
8 titles.title:*boron*+AND+titles.title:*catalysed*+AND+(titles.title:*NMR*+OR+descriptions.description:*NMR*) 3 Boron AND Catalysed in Title and NMR in either title OR description.
9 (titles.title:*boron*+OR+descriptions.description:*boron*)+AND+(titles.title:*catalysed*+OR+descriptions.description:*catalysed*)+AND+(titles.title:*NMR*+OR+descriptions.description:*NMR*) 6 Boron AND Catalysed AND NMR in either title OR description.
Discovery lead: 10.14469/hpc/2247
Subject keywords
10 subjects.subjectScheme:inchi+AND+subjects.subject:*C20H14B3F9N2O3* 1 Molecular formula in subject.
11 subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N* 1 InChIkey in subject.
12 subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+descriptions.description:*11B* 1 InChI in Subject AND 11B in description.
13 subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+(descriptions.description:*11B*+OR+titles.title:*11B*) 1 InChI in Subject AND 11B in either description OR title.
Discovery lead:10.14469/hpc/2365
14 media.media_type:application/zip+AND+(descriptions.description:*NMR*+OR+titles.title:*NMR* 219 NMR in either title OR description AND media type which might contain (Bruker spectrometer) FID data. As it happens, all 219 ZIP files in this instance do.
15 media.media_type:chemical/x-jcamp*+AND+subjects.subjectScheme:inchikey+AND+
subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+(descriptions.description:*11B*+OR+titles.title:*11B*)
1 InChIkey in subject AND 11B in either subject OR title AND Media type known to contain spectral NMR data (and possibly raw NMR data).
16 media.media_type:chemical/x-mnova*+AND+subjects.subjectScheme:inchikey+AND+
subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+descriptions.description:*11B*
1 InChIkey in subject AND 11B in either subject OR title AND Media type known to contain both raw and spectral data (probably NMR)
17 media.media_type:chemical/x-mnpub*+AND+subjects.subjectScheme:inchikey+AND+
subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*+AND+descriptions.description:*11B*
1 InChIkey in subject AND 11B in either subject OR title AND Media type known to contain a license for use of MestreNova.
18 media.media_type:chemical/x-mnpub*+AND+(subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*)+AND+(subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B) 1 InChIkey in subject AND 11B Nucleus in Subject AND Media type known to contain a license for use of MestreNova for the dataset.
19 media.media_type:chemical/x-mnpub*+AND+(subjects.subjectScheme:inchikey+AND+subjects.subject:*BHYQUOWHUMNGMD-UHFFFAOYSA-N*)+AND+(subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:11B)+AND+(subjects.subjectScheme:NMR_Solvent+AND+subjects.subject:CDCl3) 1 InChIkey in subject AND 11B Nucleus in Subject AND Media type known to contain both raw and spectral data AND solvent chloroform in subject.

The searches above are meant to be illustrative and to serve as a tutorial showing one way of constraining a data search to have very specific, in this example chemical, properties. Many of the examples could be tightened up further (thus making them look even more intimidating). Also, some of the precise ways of defining such constraints are still being debated. In the above, I use both the definitions found in the Schema coupled with the media types property. It would also be possible to e.g. dispense with the media types and achieve this using the other properties obtained from the schema. When the dust settles (if it ever does) on this, it is quite possible the searches will look rather different from the above. The purpose here was not to set any standards in stone, but simply to illustrate the potential of searching for data in this manner. Other methods may emerge; the Google dataset search system does not use the same schema for example and so the searches themselves would also look different.

It should also be mentioned that the examples in the table above are not likely, in their present form, to be willingly used by most chemists. These queries are largely formulated in a syntax more suited for machines than for humans. But there is nothing to prevent a more human-friendly “front end” being written that takes the quite complex syntax above and render it more usable by people. Such a front end could also absorb queries formulated against different schemas and unify them for the user.


You can see a more complete set here. Of course, the 11B nucleus can have many properties other than NMR. Programs such as MestreNova can do this, but you will need a commercial license to process in this way. If there is a media type chemical/x-mnpub also associated with the ZIP file, then this can be used in lieu of such a license key for that dataset only. See examples 17-19. Bagit is one schema for adding metadata to a container such as ZIP to indicate the contents, albeit with the requirement that the software reading the ZIP file must process this information for it to be of use. This post has DOI: drrm.

References

  1. DataCite Metadata Working Group., "DataCite Metadata Schema for the Publication and Citation of Research Data v4.3", DataCite, 2019. https://doi.org/10.14454/f2wp-s162

New generations of globally aggregating search engines – for (chemical) data.

Tuesday, April 7th, 2020

Chemists have long been familiar with search engines that aspire to index a large proportion of the chemical literature. Think for example the old-generation (and commercial) SciFinder (Scholar) and Reaxys or those that arrived in the 1990s in the online era such as the non-commercial Pubchem or ChemSpider (there are more). But you may not be as familiar with the latest generation of global search engines and here I will focus on three relatively new ones that specialise specifically in tracking down data rather than just publications.

I will illustrate first using a regular or non-advanced search. The keyword will be obtusallene, which is selected largely because it is a relatively unique string which is likely to result in fewer false positives. It is a family of marine alkaloids containing, unusually, bromine and /or chlorine[1] and the citation here is to a journal article describing some of its chemistry. But what if you want to find data associated with such molecules?

  1. DataCite (the name gives a clue) specialises in finding data. It was launched ten years ago and has been rapidly expanding its index since. A regular search can be formulated using the string

    As these three advanced queries imply, there are many more ways of constraining the search, which I will describe at a later time.

  2. A more recent introduction is DataSetSearch from Google.
    • https://datasetsearch.research.google.com/search?query=obtusallene (20 hits). Google cites as its sources DataCite itself and the specific repository Figshare (for this search query). 
    • Which leaves a slight mystery. Whilst there is considerable overlap between the DataCite and Google searches, the latter should clearly be potentially a superset of the former, but in fact it is slightly less comprehensive (by at least 5 hits).
  3. My third new engine is OpenAIRE (a European project supporting Open Science). It is also the search engine provided by Zenodo.
    • https://explore.openaire.eu/search/find?keyword=obtusallene (20 hits on research data, 6 hits on publications, 5 hits on “other research products” and zero hits on “software”).
    • Which introduces not just data but other concepts associated with “research objects”, clearly more useful than data alone. One of these may well shortly be Instruments (as eg used to acquire data) and another is e.g. the software used to analyze the data.

I think these new-generation search engines specialising in data have lots of exciting potential. They are still maturing and I hope we will see some interesting new capabilities emerge which we have not had before.


All are on-line nowadays, but engines such as SciFinder had two previous existences, from about 1980 as CAS online using merely a terminal interface, and prior to that as printed copies to be searched manually.

References

  1. J. Clarke, K.J. Bonney, M. Yaqoob, S. Solanki, H.S. Rzepa, A.J.P. White, D.S. Millan, and D.C. Braddock, "Epimeric Face-Selective Oxidations and Diastereodivergent Transannular Oxonium Ion Formation Fragmentations: Computational Modeling and Total Syntheses of 12-Epoxyobtusallene IV, 12-Epoxyobtusallene II, Obtusallene X, Marilzabicycloallene C, and Marilzabicycloallene D", The Journal of Organic Chemistry, vol. 81, pp. 9539-9552, 2016. https://doi.org/10.1021/acs.joc.6b02008

The Persistent Identifier ecosystem expands – to instruments!

Saturday, March 21st, 2020

A PID or persistent identifier has been in common use in scientific publishing for around 20 years now. It was introduced as a DOI (Digital Object Identifier), and the digital object in this case was the journal article. From 2000 onwards, DOIs started appearing for most journal articles, journals having obtained them from a registration agency, CrossRef. This is a not-for-profit organisation set up by a publishers association for the purpose. Most readers of journal articles started to use this DOI as an easier way of navigating through invariably different and sometimes confusing metaphors set up by any given journal to navigate through its issues. Readers slowly learnt to prepend the URL http://dx.doi.org/ to the DOI to “resolve” it directly to what is known as the “landing page” of the article. More recently, the prefix recommendation has changed to the slightly shorter https://doi.org/ form. Few readers are aware  however that the DOI can serve a much more interesting purpose than just taking you to the article landing page. This post will explore a few of these extras.

  1. Firstly, a DOI has something called metadata associated with it, and you can view this metadata by prepending a different prefix, such as https://api.crossref.org/works/ to a DOI (as in https://api.crossref.org/works/10.1021/acsomega.8b03005) This returns a “machine response”, since this is very much the audience this version of a resolved DOI is intended for. A simple example of why this can be useful can be seen at the end of this blog post.
  2. An alternative prefix is https://data.datacite.org/application/vnd.datacite.datacite+xml/ and this brings us to the next big deployment of persistent identifiers, starting around 2010 with the focus now on data. The PID is still called a DOI, but the digital object is now data (or software) rather than a journal article and the agency registering the metadata is now DataCite rather than CrossRef. So e.g. https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/4844 now returns metadata about data. The usefulness of this has in recent times become encapsulated by the expression FAIR data. The metadata can help you Find the data, Access it, how it might be Interoperable and how to Reuse it.
  3. In 2012 a third prepend of the type https://orcid.org/ was introduced to provide metadata about researchers, as in https://orcid.org/0000-0002-8635-8390
  4. Then in 2019, the growing ecosystem expanded to organisations, as with the new resolver  https://ror.org/ and with PID e.g. 041kmwe10, hence https://ror.org/041kmwe10

After this long introduction, its time to turn to the latest proposed PID type. As the title suggests, it is for instrumentation and it is introduced at https://doi.org/10.5438/tdk2-2g94 (and yes, metadata at https://data.datacite.org/application/vnd.datacite.datacite+xml/10.5438/tdk2-2g94). An example describing the properties of an instrument can be found at DOI: 10.7914/SN/SH and in the chemistry community we can already start asking ourselves questions such as what types of instrument deserve their own PID, and what sort of information about the instrument might be usefully associated with the data and be of interest to other researchers.

This is early days yet for this latest proposal, but already one can start to see how this ecosystem might be operating in the future. Consider the scenario. A research team at a specified institute (PID) consisting of say four individuals (PID for each) uses a recently funded (PID) NMR spectrometer (PID) fitted with an special ultra-sensitive low temperature probe (PID), record a collection of individual solution spectra and then publish both the collection and the raw data from which the spectra are derived, each with their own PID.  With the help of quantum simulations (PID) of the spectra, they interpret the molecular structures and confirm this with a crystal structure (PID). A student graduates with a PhD based on this work (PID). Finally they publish their story (PID) in a journal that releases open citations (PID), thank their instrument funders (PID) and perhaps blog about it (PID). Since machines can access the metadata records of all these PIDs, the entire endeavour becomes linked with exchanged information. Starting at any single PID, one should easily be able to trace all the others and locate the data and other information associated with all the aspects of the project.

I used the term future above, but in fact much of the above infrastructure is already operating, albeit in early days mode. So this is one to keep an eye out for; things might happen more quickly than you might think!


Documented at eg https://github.com/CrossRef/rest-api-doc#queries If you want a more human readable version, use this JSON to XML converter This is how the citations at the end of this post are generated. In the post itself they are inserted using e.g. ⌈cite⌉10.1021/acsomega.8b03005⌈/cite⌉ and a plug-in then expands this to a query of the above resource and formats the response to generate the bibliographic details at the end.[1] Documentation for how to implement this is found at https://github.com/rdawg-pidinst/schema/blob/master/schema.rst and before you ask, no this one does NOT have a PID! This blog post has PID: 10.14469/hpc/7016

References

  1. A. Barba, S. Dominguez, C. Cobas, D.P. Martinsen, C. Romain, H.S. Rzepa, and F. Seoane, "Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data", ACS Omega, vol. 4, pp. 3280-3286, 2019. https://doi.org/10.1021/acsomega.8b03005

A Non-nitrogen Containing Morpholine Isostere; an application of FAIR data principles.

Sunday, August 4th, 2019

In the pipeline reports on an intriguing new ring system acting as an isostere for morpholine. I was interested in how the conformation of this ring system might be rationalised electronically and so I delved into the article.[1] Here I recount what I found.

The basis for the isosteric claim can be found in the conformational analysis reported in Figure 4. The N-diazine ring in A is found to be co-planar with the morpholine ring as shown in the diagram (the dihedral measured is indicated using boldened bonds). Compound D contains the cyclopropanated variation, which is postulated as isosteric at least in part on the basis that it too is co-planar, with a dihedral angle of ~170° (with a second slightly higher minimum having a value of ~10°) and hence might be capable of acting as an isostere to the morpholine.

Figure 4. Dihedral scanning plots for various pyrimidine fragments using DFT/6-31G**.

I was intrigued as to why a saturated sp3-carbon would exhibit the same behaviour as a nitrogen centre. The latter has a lone pair oriented at 90° to the aryl ring, the resulting conjugation favouring co-planarity. But how would that sp3-carbon centre do the same? Time to do some calculations, and hence on to the supporting information (SI) for the article in an effort to get a starting base – initially to replicate the calculation results shown above. I start by focusing on the value quoted above, ~170°. Note the ~, since I obtained that visually from the figure. In fact it must remain “~“, since no further geometrical information is available from the SI. Quickly I also realized that replication must also remain elusive, since the caption to the figure is the only information on the calculations which were used to produce figure 4. DFT you see is a generic term, standing for density functional theory. But in that theory the functional has to be defined; there are possibly about 500 different functionals that have been used in the literature. We do get a citation to the method (ref 25 in the article) which is to the commercial Jaguar program system. Herein lies a problem. Programs implement what might be described as default calculation options and quite possibly it is the default option that has been invoked here. A licensed user of Jaguar can probably find out what that default option is and hence can expand DFT to the actual functional used. But unfortunately I am not a licensed user, and even if the default option could be tracked down to an online manual somewhere, there is no certainty it was actually used to produce Figure 4.

So here I make my first plea. The SI for this article is not fully FAIR! In this instance it contains no accessible data that can be used to replicate the results reported. At a minimum, if DFT based results are going to be reported, then FAIR data containing the input(s) used for the calculation and one or more outputs should be made available. Perhaps then if one is lucky, those outputs might declare any default assumptions, such as the precise DFT method used.

I therefore went ahead with my own calculations, deciding to use B3LYP (being my declared DFT functional) with the 6-311++G(d,p) basis (an improvement on the 6-31G** (≡ 6-31G(d,p) basis set declared as used for Figure 4). I did two variations, one without a D3+BJ dispersion attraction correction and one with. It is now recognised that such corrections can be important, even for small molecules. Because we do not know the nature of the DFT method used in the article itself, we do not know if it incorporates such corrections or not. The results are shown below, with a FAIR data location of DOI: 10.14469/hpc/5990

The two minima from the new B3LYP+D3BJ/6-311++G(d,p) calculation have dihedral values of -139.4° and  +55.6° with dispersion included and essentially the same without, indicating that dispersion has only a small effect on the conformational geometry (the top trace above is without dispersion). These values are different from the ones inferred from Figure 4,  being closer to gauche than to co-planar. These new values can be rationalised as allowing good overlap between a C-C bond of the cyclopropane and the π-system of the aromatic ring (dihedrals 75 and 88° for the two minima vs 90° for the overlap of the  N-lone pair). The values for the conformation implied in Figure 4 are 41 and 55°, which is less favourable hyperconjugative overlap. The rotational barriers are ~18 and 25 kJ/mol, rather higher than those obtained visually from Figure 4, but still indicating a relatively flexible molecule which can probably adopt a relatively low energy co-planar isosteric conformation in the correct environment.

There is however some more information about these molecules reported in the article,[1] being a small molecule crystal structure for a related compound 12b (quoted for Figure 5 as CCDC 1864315). To quote, “the small molecule crystal structure of 12b confirms coplanarity in the solid phase”. However, the dihedral angle for this crystal structure is not given either in the text or the  SI. A search of the CCDC database reveals no entries in May 2019 database (the data is clearly too new to have been indexed there) and unfortunately the article SI contains no atom coordinates.  The calculations reported in Figure 4 and the ones in the plot above are of course for an isolated molecule. Once I manage to acquire the crystal coordinates, it should be possible to see if there are any intermolecular interactions which are a factor in explaining why the geometries of the isolated molecule and its crystal form might differ in co-planarity.

Until then I conclude that the inclusion of  FAIR data pertaining to this co-planarity in the article itself would certainly have helped to resolve the origins of the difference in the geometries reported in the article and my own calculations reported here; it may still be of course that functionals other than B3LYP+D3BJ reproduce the crystal structure better. Nonetheless, I think there is a more rational electronic basis for the conformation of the N-aryl ring in the isolated molecule based on the dihedral angles reported here, whilst an attempt to replicate the values reported in the article itself[1] based on further information would also be useful. 


Reprinted with permission from [1]. Copyright 2019 American Chemical Society.

In this article[2], making quite some waves, you can find a fascinating discussion of the perils of using “packaged” programs in which many “defaults” are allowed to persist by the user. In this particular case, the default was the size of the integration grid in the DFT calculation. This article make the very alarming case that for many years the default size in at least one popular DFT program was not good enough to ensure that resulting calculated free energies were sufficiently accurate to sustain many conclusions for regio and stereoselectivity out there in the wild. A awful lot of computational chemistry derived results might be wrong! You may only be slightly re-assured that the default grid sizes used for the calculations reported in this blog, at least for the last five years or so, are suitably larger than the one critiqued in this article.

From which you find a keyword integral=(acc2e=14,grid=ultrafine) defined, which ensures that not only is the integration grid declared, but also that the integral accuracy is pumped up beyond the program defaults of 12. We have found that this is very often helpful for calculation of frequencies. 

New entries, not yet available in the distributed database, can be accessed as e.g. https://www.ccdc.cam.ac.uk/structures/search?pid=ccdc:1864315 The dihedral is 167.7° for one conformation of compound 12b. This has now also been assigned  DOI: 10.5517/ccdc.csd.cc20kz6s The coordinates obtained from this source correspond to an absolute stereochemistry of 1R,6S, or 12a in the article.

 

References

  1. H. Hobbs, G. Bravi, I. Campbell, M. Convery, H. Davies, G. Inglis, S. Pal, S. Peace, J. Redmond, and D. Summers, "Discovery of 3-Oxabicyclo[4.1.0]heptane, a Non-nitrogen Containing Morpholine Isostere, and Its Application in Novel Inhibitors of the PI3K-AKT-mTOR Pathway", Journal of Medicinal Chemistry, vol. 62, pp. 6972-6984, 2019. https://doi.org/10.1021/acs.jmedchem.9b00348
  2. A.N. Bootsma, and S. Wheeler, "Popular Integration Grids Can Result in Large Errors in DFT-Computed Free Energies", 2019. https://doi.org/10.26434/chemrxiv.8864204.v1