Posts Tagged ‘Chemical IT’

Raw data and the evolution of crystallographic FAIR data. Journals, processed and raw structure data.

Monday, March 28th, 2022

In my previous post on the topic, I introduced the concept that data can come in several forms, most commonly as “raw” or primary data and as a “processed” version of this data that has added value. In crystallography, the chemist is interested in this processed version, carried by a CIF file. However on rare occasions when a query arises about the processed component, this can in principle at least be resolved by taking a look at the original raw data, expressed as diffraction images. I established with much appreciated help from CCDC that since 2016, around 65 datasets in the CSD (Cambridge structural database) have appeared with such associated raw data. The problem is easily reconciling the two sets of data (the raw data is not stored on CSD) and one way of doing this is via the metadata associated with the datasets. In turn, if this metadata is suitably registered, one can query the metadata store for such associations, as was illustrated in the previous post on the topic. Here I explore the metadata records for five of these 65 sets to find out their properties, selected to illustrate the five data repositories thus far that host such data for compounds in the CSD database.

Raw data
repository
Raw Data
DOI
Raw data
→CSD?
CSD→
Raw data?
⇐Journal⇒
Zenodo 10.5281/zenodo.4271549 No No 10.1039/C6RA28567H
Imperial College research data repository 10.14469/hpc/2298 Yes Yes 10.1021/acsomega.7b00482
RepoD, a Harvard Dataverse instance 10.18150/repod.6628285 No No 10.1021/acs.cgd.0c01252
Cambridge university repository 10.17863/CAM.21968 No No 10.1016/j.inoche.2018.08.024
Isis neutron and muon source data journal 10.5286/ISIS.E.RB1620465 No No 10.1039/D0CC02418J

Ideally, one is looking for bidirectional links between the data as expressed in the metadata and in both directions. As you can see from the above, these links are present in only one of the five sets. More common is that both the raw and the processed data will contain links to the journal article where the data is discussed. Very much less commonly are there links from the journal article to the raw data, although such links are slightly more likely to exist from the journal to the processed data. If you click on the link in any of the last three columns, a copy of the metadata will download for you to inspect. There you can verify if the assertions made above are correct. 

What the metadata records demonstrate above is a very small scale so-called PID graph (DOI: [1] 10.5438/jwvf-8a66) where each DOI is a node in that graph and if a connection exists, it is shown by a line connecting the nodes. The PID graph can be extended to include a third type of node, the journal article and then it starts to get interesting! I will investigate if I can generate the PID graph for the above, although be prepared, it will not (yet) contain very many lines between nodes!

References

  1. M. Fenner, and A. Aryani, "Introducing the PID Graph", 2019. https://doi.org/10.5438/jwvf-8a66

Raw data: the evolution of FAIR data and crystallography.

Tuesday, March 1st, 2022

Scientific data in chemistry has come a long way in the last few decades. Originally entangled into scientific articles in the form of tables of numbers or diagrams, it was (partially) disentangled into supporting information when journals became electronic in the late 1990s. The next phase was the introduction of data repositories in the early naughties. Now associated with innovative commercial companies such as Figshare and later the non-commercial Zenodo, such repositories are also gradually spreading to institutional form such as eg the earlier SPECTRa project of 2006[1] and still evolving.[2] Perhaps the best known, and certainly the oldest example of curated data in chemistry is the CCDC (Cambridge crystallographic data centre) CSD (Cambridge structural database) which has been operating for more than 55 years now. Curation here is the important context, since there you will find crystal diffraction data which has been refined into a structural model, firstly by the authors reporting the structure and then by CSD who amongst other operations, validate the associated data using a utility called CheckCIF.[3] What perhaps is not realised by most users of this data source is that the original or “raw” data, as obtained from a X-ray diffractometer and which the CSD data is derived from, is not actually available from the CSD. This primary form of crystallographic data is the topic of this post.

Most chemical data now emerges from an instrument, where it is already partially processed internally before being offered. Such raw/primary data is perhaps best known in the form of NMR information, where it is offered by the instrument in the form of an FID or free induction decay. Its transformation from this form into what all chemists know as a spectrum requires further software processing, and including other operations such as peak integration. It is this processed spectrum that had traditionally been offered as part of a scientific article (often only in visual, or peak listed form) and rarely has the FID form been made available to anyone interested. It is important to state that the transformation to spectrum also incurrs significant loss of data. An interesting project led by the editors of two organic chemistry journals[4],[5] had the aim of encouraging the submission of FAIR data to the journal, although in fact the project actually concentrated on the submission of raw NMR data. As it turned out, only a very small proportion of all the submissions to these journals over the period of a year actually provided such data (~113 datasets) in the form of ZIP archives and containing anywhere between one and ~100 actual sets of raw NMR data per archive. One should make the point that raw data is not necessarily FAIR data. The latter requires rich metadata describing the data to become findable, accessible, interoperable and reusable (FAIR), and such metadata was not actually generated as part of this project. 

Here I will take a closer look at potentially FAIR raw data in the area of crystallography. This project is perhaps less well known than the previous one,[4],[5] hence the present post strives to make it better known. As with NMR, a useful starting point is to describe the various stages in the lifecycle of crystal data.

  1. A crystal is mounted in the diffractometer and x-ray diffraction images are recorded. These are considered the raw data, and as with most instruments, their form is determined both by the instrument itself and the software used to start the refinement process into a molecular structure
  2. This refinement then assigns a space group to the data and derives so-called structure factors or hkl data. This data can now be captured in a much more standard form known as a CIF (crystallographic information file) and is nowadays the format that is deposited with CSD.
  3. A reduced form of the CIF file, containing a sub-set of the information but lacking the hkl data is much the more common, and was the form originally sent to CSD until a few years ago.
  4. Very often an image of the resulting model for the molecular structure is also included. Whilst it is based on the data in the CIF file, it does not contain reusable data as such and is considered as being made available only for human use and perception

It is form 1 that is missing from the CSD datasets. Because it can be quite large (~0.5-9 Gbyte), the current recommendation is that it is not stored on the CSD but on local data repositories. So now we see a need to establish if possible bidirectional links between type 1 and types 2-4 and to identify what characteristics of FAIR each has. Primarily, the F (findable) of FAIR will be explored here. This is done by illustrating some searches for this data, based on the metadata registered for it with DataCite.

  1. https://commons.datacite.org/?query=relatedIdentifiers.relatedIdentifier:10.5517/ccdc.csd*  (72 works)
    This simple search identifies any entry in any repository which cites in its metadata record the DOI for an entry in CSD, taking the form 10.5517/ccdc.csd* which is common to all entries.
  2. https://commons.datacite.org/?query=relatedIdentifiers.relatedIdentifier:*10.5517/ccdc.csd*+AND+(media.media_type:chemical/x-cif+OR+media.media_type:application/x-7z-compressed+OR+media.media_type:application/gzip+OR+media.media_type:application/zip) (8 works).
    This also specifies that search 5 is further constrained by requiring one of four media types to ALSO be present in the repository metadata record. These types are standard compressed archives which the raw crystal data is likely to be stored as, along with a CIF entry that is clearly associated with crystal structure data. The Boolean OR indicates that any one of them can be present! One can now be a little more certain that these entries contain crystal structure data. That we cannot be absolutely certain is clearly a current deficiency of the metadata present for the entries! 
  3. https://commons.datacite.org/?query=identifier:*10.5517/ccdc.csd*+AND+(relatedIdentifiers.relatedIdentifier:*10.14469/hpc/*) (7 works)
    The 8 works from search 6 originate from a repository with the prefix 10.14469/hpc/* and so now one can reverse the direction and ask how many are referenced in the metadata for each published item in the CSD. Around 327,064 entries in the CSD currently have a persistent DOI identifier associated with them, all starting with 10.5517/ccdc.csd (this is only around 25% of the total depositions there however) and so now one can search for how many of these also reference a related identifier at 10.14469/hpc/*  Seven of them show up there.
  4. Also in the CSD metadata records is an item with the attribute relationType=”IsDerivedFrom” carrying the meaning that the CSD data is itself derived from (raw) data held elsewhere. This information is captured during the deposition process with CCDC as per below.

    It should be possible to incorporate this property into a search as above, but its currently not working. When that is sorted, I will add that as search 8 here. This will give more idea of how many datasets in the  CSD are actually associated with additional raw data (CCDC tell me its around 65).

So with these projects aiming to capture data from chemical instrumentation are just starting to reveal the potential of this modern system for storing data in two or more locations and reconciling various forms of this data, from raw form to derived or processed data. The interested user can then use whichever form is most relevant to their needs, and having found one form can then trace back to the other form(s). We might anticipate many developments in this area in the near future. 


One has to expand the archive to find out how many actual raw datasets are inside, which is not ideal. 


This post has DOI: 10.14469/hpc/10177


References

  1. J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
  2. M.J. Harvey, A. McLean, and H.S. Rzepa, "A metadata-driven approach to data repository design", Journal of Cheminformatics, vol. 9, 2017. https://doi.org/10.1186/s13321-017-0190-6
  3. A.L. Spek, "Structure validation in chemical crystallography", Acta Crystallographica Section D Biological Crystallography, vol. 65, pp. 148-155, 2009. https://doi.org/10.1107/s090744490804362x
  4. A.M. Hunter, E.M. Carreira, and S.J. Miller, "Encouraging Submission of FAIR Data at <i>The Journal of Organic Chemistry</i> and <i>Organic Letters</i>", The Journal of Organic Chemistry, vol. 85, pp. 1773-1774, 2020. https://doi.org/10.1021/acs.joc.0c00248
  5. A.M. Hunter, E.M. Carreira, and S.J. Miller, "Encouraging Submission of FAIR Data at <i>The Journal of Organic Chemistry</i> and <i>Organic Letters</i>", Organic Letters, vol. 22, pp. 1231-1232, 2020. https://doi.org/10.1021/acs.orglett.0c00383

Data base or Data repository? – A brief and very selective history of data management in chemistry.

Wednesday, January 26th, 2022

Way back in the late 1980s or so, research groups in chemistry started to replace the filing of their paper-based research data by storing it in an easily retrievable digital form. This required a computer database and initially these were accessible only on specific dedicated computers in the laboratory. These gradually changed from the 1990s onwards into being accessible online, so that more than one person could use them in different locations. At least where I worked, the infrastructures to set up such databases were mostly not then available as part of the standard research provisions and so had to be installed and maintained by the group itself. The database software took many different forms and it was not uncommon for each group in a department to come up with a different solution that suited its needs best. The result was a proliferation of largely non-interoperable solutions which did not communicate with each other. Each database had to be searched locally and there could be ten or more such resources in a department. The knowledge of how the system operated also often resided in just one person, which tended to evaporate when this guru left the group.

After the millennium, two newcomers started to appear, one being called an ELN (electronic laboratory notebook) and the second a data repository. The first was a heavily customised database containing research data as obtained from instruments, computers, images/video, chemical structure drawings etc. ELNs, even to this day, have limitations of interoperability with other ELNs and the contents of an ELN are often closed, requiring authentication credentials to access. The data repository also started to appear in chemistry around this period. Even in its early incarnations, it could be associated with an ELN “front end” as part of the data pipeline; an early example of this coupling is described here.[1] Another key phrase that became associated with repositories starting around 2014 was the concept of FAIR, including ideas such as the Findability (discoverability) and Interoperablity of data, a theme often explored and illustrated on this blog.

These last seventeen years has seen organisations such as funding agencies and publishers increasingly mandating the use of such data management methods, using either a repository on its own or a combination of an ELN and repository as routine operations in research activity and publication processes. The close coupling of an ELN and repository is still however uncommon. 

A colleague recently alerted me to a computational chemistry repository first launched in 2014; www.iochem-bd.org  Reading the about text, I found these statements;

  • Chem-BD is a digital repository aimed to manage and store Computational Chemistry files.
  • Goals: Build a distributed database of computational chemistry results: reduce size and increase value.
  • Set a common data standard among all quantum chemistry legacy formats (XML – CML[2])

So this is both a database and a data repository, as well as espousing a commendable common data standard![2] I decided to explore the first two aspects here using this resource as an example.

  • Whilst the absolute distinction between the two types can be blurry, the crucial difference between the two is that a database functions on curation via a structured index of the data, whilst a repository aspires to having FAIR attributes primarily through its metadata as exposed by registration (metadata is data describing the data).
  • A database holds this data index locally and the Findability of the data is associated purely with the functionality of  the database. The data structures are defined by a database schema, describing in detail all the terms indexed (a key and its value) and searched using the values of these key pairs. This schema is unlikely to be exactly the same as e.g. databases on related topics, largely because the database is self-contained and self-consistent.
  • A data repository also uses a schema (DOI: 10.14454/3w3z-sa82 and[3]) to express the key pairs, but this time it is expressed as metadata. Now, this metadata is registered externally to the repository using a registration agency.[3] The metadata for each deposited object is assigned a persistent identifier known as a DOI. Although it might be indexed and searchable locally, it must be capable of also being searched in aggregated/federated form using services provided by registration or other agencies. This independence of metadata is part of those FAIR criteria.
  • Whereas a database can be very finely grained in order to describe individual properties of an object, repository metadata tends to be more coarsely grained to describe the object as a whole, to place it in context and to impart provenance.
  • Both databases and repositories can have what is called an API (application programmer interface) to allow machine access (the A of FAIR) to the contents. Accessing the former would normally require bespoke code to be written and possibly authentication credentials, whereas information to access to repository held data is provided via the registered metadata (which does not normally require credentials). Access to the repository may also require code, but if the metadata is carefully standardised by adherence to the schema, the code can be made more general than that required for a database.
  • A typical entry in the www.iochem-bd.org repository has a DOI of 10.19061/iochem-bd-4-36
  • This DOI is registered with the CrossRef agency, one normally used for registering journal articles, rather than DataCite which is used for registering data and other research objects. The metadata for this DOI can be viewed using the resolution service https://api.crossref.org/works/10.19061/iochem-bd-4-36/transform/application/vnd.crossref.unixsd+xml and shows that it largely contains the bibliographic information typical of a journal article. So in this sense it is certainly a repository, but using a metadata schema that is more frequently used for journal articles than for data sets.
  • The CrossRef metadata record also has an item <resource>https://www.iochem-bd.org/handle/10/235025</resource> which points to the so-called landing page for that item, but information about the properties of the actual data itself must be instead obtained directly from the repository. 
  • Because the metadata describing the data is only held at this repository and not elsewhere (a local metadata record), it can only be queried locally and the query cannot be upon aggregated metadata  provided by the registration agency. A machine query would have to be constructed by coding a suitable request using the API provided for the database aspect of this repository. 

This example has served to highlight just a few of the often quite subtle distinctions between eg a database and a data repository and that some examples can indeed be both.  It also highlights that repositories can have the attributes of  FAIR, which in themselves are driven by asking “what could a machine do to obtain data?” rather than what could a human achieve by browsing. So another question that arises when evaluating the characteristics of a repository is whether each item held there has a FAIR-enabling metadata record describing the data, a record which is registered in a manner that can be aggregated and hence used to find and access content across multiple independent repositories.


This post has DOI 10.14469/hpc/10043


Indeed in that era, few online/Internet infrastructures were available as part of departmental resources. See also here.  In this last regard, I note a workshop devoted largely to such interoperability and machine access in chemistry coming up soon; https://www.cecam.org/workshop-details/1165 The CrossRef schema is not referenced using an assigned DOI: data.crossref.org/reports/help/schema_doc/5.3.1/.An example can be seen at doi: 10.14469/hpc/10059 Here, invoking a hyperlink based purely on the data DOI and the data media type required in turn calls code (Javascript) which retrieves the metadata held for that DOI and parses it to identify whether it indicates the presence of a file manifest. If it does, it identifies the type of manifest (ORE in this case) and the media types the manifest points to and finally uses that manifest to then retrieve data filtered by media type and pipes it into a visualiser (JSmol). In this case the endpoint is visualisation, but it could also be eg piped into an AI/ML program for analysis. In this case only one instance of data is machine retrieved, but in principle it could be a multitude of data files obtained from a multitude of different locations and based on a multitude of criteria as filtered by suitable searches of registered metadata.[4]


References

  1. M.J. Harvey, N.J. Mason, and H.S. Rzepa, "Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks", Journal of Chemical Information and Modeling, vol. 54, pp. 2627-2635, 2014. https://doi.org/10.1021/ci500302p
  2. P. Murray-Rust, and H.S. Rzepa, "Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles", Journal of Chemical Information and Computer Sciences, vol. 39, pp. 928-942, 1999. https://doi.org/10.1021/ci990052b
  3. H. Cousijn, T. Habermann, E. Krznarich, and A. Meadows, "Beyond data: Sharing related research outputs to make data reusable", Learned Publishing, vol. 35, pp. 75-80, 2022. https://doi.org/10.1002/leap.1429
  4. H.S. Rzepa, and S. Kuhn, "A data‐oriented approach to making new molecules as a student experiment: artificial intelligence‐enabling FAIR publication of NMR data for organic esters", Magnetic Resonance in Chemistry, vol. 60, pp. 93-103, 2021. https://doi.org/10.1002/mrc.5186

Quantum chemistry interoperability (library): another step towards FAIR data.

Saturday, January 1st, 2022

To be FAIR, data has to be not only Findable and Accessible, but straightforwardly Interoperable. One of the best examples of interoperability in chemistry comes from the domain of quantum chemistry. This strives to describe a molecule by its electron density distribution, from which many interesting properties can then be computed. The process is split into two parts:

  1. Computation of the wavefunction. This can be very very compute intensive process, which can take quite a few days even using 64 or more processors in parallel and requires highly specialised programs to achieve this.
  2. Analysis of the wavefunction. The range of properties that can be computed is impressively large, but again this requires specialised algorithms and programs.

So one can see that the need to Interoperate wavefunction data computed during process 1 into analysis in process 2 is crucial. This is normally achieved using intermediate data files, and clearly the semantics of the data in these files must be perfectly communicated between the two processes.

With this introduction over, my attention was drawn to a recent post on the CCL (Computational Chemistry List, http://www.ccl.net), a veritable resource that has been running for many decades and where many aspects of computational chemistry are discussed. One recent such relates to quantum chemistry interoperability; http://www.ccl.net/cgi-bin/ccl/day-index.cgi?2021+12+30 where many interesting points were made. I highlight just two here (but urge you to read the entire thread).

  1. The first, by Mike Frisch (http://www.ccl.net/cgi-bin/ccl/message-new?2021+12+30+003) introduces two interoperability formats (the binary array file format) along with a library of routines in both Fortran and Python which facilitate interoperability between wavefunction calculating and the post-processing analysis programs. The advantages of this include “Like the fchk file, this is a self-defining file, but it is binary so that full precision can be retained and reading/writing the file is much faster” and is described at https://gaussian.com/interfacing/ Output in this format is controlled by the keyword Output=MatrixElement or use of environment variables. As a long time user of an older interoperability mechanism, the so-called WFN and WFX formats for use with programs such as AIMALL and MultiWFN, I have often set this keyword to eg Output=wfn and when generated, such files are routinely included in our FAIR data publications which are often mentioned both in this blog and in the journal articles we write. If you read the post by Mike, you will understand both the deficiencies of these earlier formats and how the binary array file is an important advance. 
    • I make one “user interface plea” here in the hope that Gaussian might be able to do something about it. By default, the output key word is not set and so no wavefunction data is produced other than a binary .CHK file. This in turn requires an extra step to convert it into the interoperable non-binary .FCHK file. When needing a WFN file, very often I forget to set the output keyword flag to a value and have to re-run the program to obtain it. So my plea is to consider setting the program defaults to write out some form of the binary array file when the job completes. There are additional flags that can be set for specialised applications, but assuming a default option would be practical, it would be good to have.
  2. The second email is a response to Mike’s post by Tian Lu  who is well known for his amazing “swiss army knife” program MultWFN, which can compute a large variety of molecular properties using wavefunction files. He had in fact proposed his own interoperability format to eliminate many of the recognised issues with the older WFN, FCHK and WFX formats and which is called MWFN (documented here[1]). Currently this particular format is not yet widely supported by wavefunction-computing programs such as e.g. Gaussian, but perhaps Output=mwfn will come one day!
  3. This is a later email describing the Trexio Project (https://trex-coe.github.io/trexio/ and specifically https://trex-coe.github.io/trexio/trex.html) in which a metadata group is specifically identified because “we need to give the possibility to the users to store some metadata inside the files.” In fact, metadata is also useful for registration with metadata agencies.

This increasing discussion of Interoperability in Quantum Chemistry has to be warmly welcomed. It directly feeds into FAIR data and may even set a trend for other areas of chemistry, such as e.g. NMR spectroscopy!


I have now learnt that inserting one of the environment variables below as per

export GAUSS_OMDEF=fortranbinaryarray.faf
or
export GAUSS_ORDEF=rawbinaryarray.baf

into job scripts will achieve this (proposed media types chemical/x-rawbinaryarray  .baf and chemical/x-fortranwbinaryarray  .faf).

Currently doing both at the same time is not supported (G16 C C.01), so the second file can be generated from a .chk file using the post-processing commands appended to the job script:

formchk -raw mychk.chk rawbinaryarray.baf
or
formchk -mat mychk.chk fortranbinaryarray.faf


This post has DOI: 10.14469/hpc/10043


References

  1. T. Lu, and Q. Chen, "mwfn: A Strict, Concise and Extensible Format for Electronic Wavefunction Storage and Exchange", 2021. https://doi.org/10.26434/chemrxiv-2021-lt04f-v5

First came Molnupiravir – now there is Paxlovid as a SARS-CoV-2 protease inhibitor. An NCI analysis of the ligand.

Saturday, November 13th, 2021

Earlier this year, Molnupiravir hit the headlines as a promising antiviral drug. This is now followed by Paxlovid, which is the first small molecule to be aimed by design at the SAR-CoV-2 protein and which is reported as reducing greatly the risk of hospitalization or death when given within three days of symptoms appearing in high risk patients.

The Wikipedia page (first created in 2021) will display a pretty good JSmol 3D model of this; the coordinates being generated automatically on the fly from a SMILES string, which specifies only what atoms are connected in the structure by bonds. Given that the structure of this molecule as embedded in the SARS-CoV-2 main protease[1] has been determined (and can be viewed here), I thought I might display those coordinates as an alternative to the Wikipedia/JSmol generated structure.

Click to get 3D model

I extracted the ligand from the PDF file and then added hydrogens manually to obtain the above result. There are two noteworthy points about these representations:

  1. A mystery concerns the nominal C≡N group on the top right, which displays an angle at the carbon of 117°. A cyano group is of course linear (180°). This is not a defect of the crystal structure determination, but an indication of a rather stronger interaction occurring (as indeed noted[1]). The distance between the carbon of the cyano group and an adjacent sulfur is 1.814Å, which indicates a covalent bond has formed to the cyano group. The nitrogen of the erstwhile cyano group is 3.013Å away from an adjacent NH group, which suggests it is stabilised by a hydrogen bond.
  2. Crystal structure searching of units with S…C…N in which the N has only one bond reveals zero hits, but searches of S…C…NH reveal nine hits, with S…C distances in the range 1.74 – 1.80Å and C…N distances in the region 1.25-1.27&Aring. The reported CN distance is 1.251&ARing, confirming that when bound to the protein, the cyano group is replaced by an S-C=NH group and hence is clearly an important component of the mode of action of Paxlovid.
  3. The conformation of Paxlovid is in one respect not fully represented by the Wikipedia diagram, as shown below. This implies the t-butyl group (on the left) as being well separated from the pyrrolidinone ring system at the right of the molecule.

    In fact the two groups are adjacent, being held in that conformation by probably a combination of weak dispersion forces and a contribution from the surrounding protein in the crystal structure. This is more graphically shown by the NCI (non-covalent-interaction) diagram below (DOI: 10.14469/hpc/9964), where the green areas in the region between the two groups (ringed in red) represent stabilising interactions between them. You might also spot other green/cyan regions indicating additional weak hydrogen bonds between C-H groups and oxygen!

PAXLOVID NCI analysis

There are only a small number of crystal structures of small molecules containing the S-C=NH motif. I will try to find out how common this is in protein-ligand structures.


There are many tools for performing this operation. I used the following procedure. I downloaded the PDB file (https://files.rcsb.org/download/7vh8.cif), opened it in CSD Mercury, selected the ligand (by identifying the CF3 group and clicking on one atom), inverted the selection so that everything but the ligand was then selected and using edit/structure, I deleted the selected atoms, leaving only the ligand.

Postsript

The cyanopyrrolidine group such as in Paxlovid is well known as a specific probe.[2],[3],[4] CovalentInDB is a comprehensive database facilitating the discovery of such covalent inhibitors[5] and is available here. There is also a program called DataWarrior that is potentially able to find such probes.

References

  1. Y. Zhao, C. Fang, Q. Zhang, R. Zhang, X. Zhao, Y. Duan, H. Wang, Y. Zhu, L. Feng, J. Zhao, M. Shao, X. Yang, L. Zhang, C. Peng, K. Yang, D. Ma, Z. Rao, and H. Yang, "Crystal structure of SARS-CoV-2 main protease in complex with protease inhibitor PF-07321332", Protein & Cell, vol. 13, pp. 689-693, 2021. https://doi.org/10.1007/s13238-021-00883-2
  2. N. Panyain, A. Godinat, A.R. Thawani, S. Lachiondo-Ortega, K. Mason, S. Elkhalifa, L.M. Smith, J.A. Harrigan, and E.W. Tate, "Activity-based protein profiling reveals deubiquitinase and aldehyde dehydrogenase targets of a cyanopyrrolidine probe", RSC Medicinal Chemistry, vol. 12, pp. 1935-1943, 2021. https://doi.org/10.1039/d1md00218j
  3. N. Panyain, A. Godinat, T. Lanyon-Hogg, S. Lachiondo-Ortega, E.J. Will, C. Soudy, M. Mondal, K. Mason, S. Elkhalifa, L.M. Smith, J.A. Harrigan, and E.W. Tate, "Discovery of a Potent and Selective Covalent Inhibitor and Activity-Based Probe for the Deubiquitylating Enzyme UCHL1, with Antifibrotic Activity", Journal of the American Chemical Society, vol. 142, pp. 12020-12026, 2020. https://doi.org/10.1021/jacs.0c04527
  4. C. Bashore, P. Jaishankar, N.J. Skelton, J. Fuhrmann, B.R. Hearn, P.S. Liu, A.R. Renslo, and E.C. Dueber, "Cyanopyrrolidine Inhibitors of Ubiquitin Specific Protease 7 Mediate Desulfhydration of the Active-Site Cysteine", ACS Chemical Biology, vol. 15, pp. 1392-1400, 2020. https://doi.org/10.1021/acschembio.0c00031
  5. H. Du, J. Gao, G. Weng, J. Ding, X. Chai, J. Pang, Y. Kang, D. Li, D. Cao, and T. Hou, "CovalentInDB: a comprehensive database facilitating the discovery of covalent inhibitors", Nucleic Acids Research, vol. 49, pp. D1122-D1129, 2020. https://doi.org/10.1093/nar/gkaa876

A comparison of searches based on metadata records from three (update: five) research repositories.

Tuesday, September 28th, 2021

In the previous blog post, I looked at the metadata records registered with DataCite for some chemical computational modelling files as published in three different repositories. Here I take it one stage further, by looking at how searches of the DataCite metadata store for three particular values of the metadata associated with this dataset compare.

Search 1: The metadata value of -1705.490787 is actually the Gibbs Free energy computed for the molecule associated with the data set, a molecule which featured in this blog post https://commons.datacite.org/?query=*\-170* is an un-fielded search for the truncated string -170* (where * is a wild card character and \ is said to “escape” the minus sign, since on its own a minus can also indicate a Boolean NOT operator), resulting in 70,918 works matching the query. From what we know about the dataset in question, this is a vast number of false positives. How can we reduce them?

Search 1a: https://commons.datacite.org/?query=subjects.subject:\-170* is a fielded search, specifying that the string must occur in the subject field (62 works) but this still has 57 false positives.

Search 1b: https://commons.datacite.org/?query=subjects.subject:\-1705.490787* (in fact precision of -1705.4* is also sufficient) removes all the false positives (5 works). But are there any false negatives? In fact, for other reasons, we know that there are two works in the Figshare repository where the value of of -1705.490787 appears in the keyword items on the landing page of e.g. 10.6084/m9.figshare.16685497 and is indexed and searchable locally, but does not appear in the registered metadata and hence is not included in the results of the above searches.

Search 2: A further, formally much stronger constraint on the search is https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-1705.490787* whereby a subjectScheme is added to search 1b, constrained to the value Gibbs_Energy. This now returns 3 works, two less than search 1b. There are two further false negatives because, as noted previously, the subjectScheme term is not defined in the Zenodo repository metadata record, where the missing two items are located. 

Search 2a: https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:*1705.490787*+AND+subjects.schemeUri:*goldbook* is even further constrained to specify a  Gibbs _Energy according to the  IUPAC Gold book definition.

Search 2b: https://commons.datacite.org/?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:*1705.490787*+AND+subjects.schemeUri:*goldbook*+AND+subjects.valueUri:*gaussian* is the highest level of constraint, implying not only that the term  Gibbs_Energy is specified by the IUPAC Gold book definition, but that its value is that determined by (in this example) the Gaussian (implementation). 

So to summarise what we have thus far established, we can successfully eliminate false positives by specifying a fielded search with a requirement that the field specifically relates to Gibbs_Energy. But because of omissions in the metadata records, we also have four false negatives resulting from doing this.

Search 3https://commons.datacite.org/?query=subjects.subject:VELNVPXNOKVVTC-VJKZSTDTSA-N searches for another subject term, the InChI key for the molecule relating to the data (5 works). Here again however context for the string VELNVPXNOKVVTC-VJKZSTDTSA-N is missing, although again the string is long enough to ensure it is unique. But we could go one step further.

Search 4: https://commons.datacite.org/?query=subjects.subjectScheme:inchikey+AND+subjects.subject:VELNVPXNOKVVTC-VJKZSTDTSA-N constrains the subject term to only those strings describing an InChIkey (3 works). This again is due to Zenodo not specifying the subjectScheme and Figshare not even containing the InChIkey in its metadata record.

Search 4a: https://commons.datacite.org/?query=subjects.subjectScheme:inchikey+AND+subjects.schemeUri:*inchi-trust*+AND+subjects.subject:VELNVPXNOKVVTC-VJKZSTDTSA-N constrains the inchikey further by specifying the authority for the scheme definition as the InChI Trust. 

Search 5https://commons.datacite.org/?query=subjects.subject:InChI=1S/C25H39NO9* is query 1, but on the InChI string rather than the InChI key, and with the same results as before (5 works). Here, the string is deliberately truncated to return only the molecular formula of the molecule.

Search 5a: https://commons.datacite.org/?query=subjects.subjectScheme:inchi+AND+subjects.subject:InChI=1S/C25H39NO9* is query 4, with the subjectScheme changed to only the molecular formula component of an InChI (3 works). 

Search 5b: https://commons.datacite.org/?query=subjects.subject:InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14\(31-2\)10-23\(29,16\(13\)17\(12\)33-4\)25\(26,30* truncates much less of the InChI string, extending it to the molecular connection table. Notice how characters such as ( or ) have been escaped with a \ prefix. Such characters are used for grouping in the search query and so must be escaped to be included in the query.

Search 5c: https://commons.datacite.org/?query=subjects.subject:InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14\(31-2\)10-23\(29,16\(13\)17\(12\)33-4\)25\(26,30\)19\(34-5\)18\(24\)22\(11-27,21\(28\)35-20\)8-7-15\(24\)32-3\/h12-20,27,29-30H,6-11H2,1-5H3* For this length string (and InChI strings can get very long!) an unidentified error can occur, suggesting that the full InChI string is best not used for such searches.

Search 6: 

From these experiments, we learn that the quality and completeness/richness of the metadata record is vital to ensure no false positives or negatives are returned by the search. Ensuring such metadata richness is something that a repository should do, and it is interesting that two of the best known repositories both currently have failings in this regard. I might try one or two other popular repositories to see how they behave and will report back if I find anything interesting.


Thus https://commons.datacite.org/doi.org?query=subjects.subjectScheme:*inchikey* reveals all entries that specify an InChIkey in the subject metadata (185,414 works) but https://commons.datacite.org/doi.org?query=subjects.subjectScheme:*inchikey*+AND+subjects.schemeUri:*inchi-trust* reveals only 1748 of these further specify the InChI trust as the authority. Two more depositories, Mendeley Data and Harvard Dataverse have been populated with the same data. See here.


This post has DOI: 10.14469/hpc/9162

A comparison of descriptive metadata across different data repositories.

Tuesday, September 28th, 2021

The number of repositories which accept research data across a wide spectrum of disciplines is on the up. Here I report the results of conducting an experiment in which chemical modelling data was deposited in three such repositories and comparing the richness of the metadata describing the essential properties of the three depositions.

The three repositories are as follows:

  1. Figshare as a repository dates from 2012. The computational chemistry dataset used was manually uploaded. Most of the metadata was entered manually by copy/paste operations and included three keywords which comprised the InChI key for the molecule, the corresponding InChI string and the calculated Gibbs Energy obtained from the computed vibrational frequencies.
  2. Zenodo started in 2013 and has been updated several times since then. The same data and metadata were used as for Figshare, including the the same keywords, but with the difference that the upload was not manual but automated using the Zenodo API as implemented in the new computational portal described in the previous post (DOI: 10.14469/hpc/9010). Publication here was a simple button click and so is a much shorter process than that for Figshare.
  3. The original 2006 version of the  Imperial College data repository was based on DSpace, and updated to version 2 in 2016 with entirely new code. It too is populated by publication from the same portal as used for Zenodo.
  4. Mendeley data:
  5. Harvard Dataverse:

Each deposition results in the generation of a DOI, and these, together with the link that allows access to the associated metadata can be seen in the table below.

Repository Dataset DOI Dataset
metadata
Figshare 10.6084/m9.figshare.16685497 XML
JSON
Zenodo 10.5281/zenodo.5511966 XML
JSON
Imperial College 10.14469/hpc/9031 XML
JSON
Harvard Dataverse 10.7910/DVN/4BWOYK XML
XML
Codebook
Mendeley Data 10.17632/dgtvds3xn5.1 XML
JSON

I would note that manual deposition can be rather dependent on how fastidious the depositor is and how they interpret the descriptive keywords that Figshare and Zenodo accept. Automated deposition is a more controlled process, in which the required keywords are a property programmed into the submitting portal tool. Such a process also allows metadata to describe relationships between different datasets, such as a dataset collection, and is inherited from project descriptor on the portal. Additionally, the automated process can then be augmented by manual editing of the metadata record, as for example, the addition of the DOI for this descriptive post which can be added to the metadata records retrospectively. In the case of e.g. Zenodo, retrospective changes to the metadata record require a new DOI to be generated to reflect the changes. 

You can inspect the results of these three depositions yourself by downloading the respective metadata records and viewing the downloaded file using a simple text or XML editor. 

  1. All three repositories contain the ORCID of the depositor, as e.g. from Figshare:
    <creator> 
    <creatorName>Rzepa, Henry S.</creatorName> 
    <givenName>Henry S.</givenName> 
    <familyName>Rzepa</familyName> 
    <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="https://orcid.org">
    https://orcid.org/0000-0002-8635-8390
    </nameIdentifier>
    </creator>

    The widespread addition of the unique ORCID researcher identifier is very welcome.

  2. The more interesting component is keyword metadata, populated manually in Figshare and using the automated API in the other two repositories.
    1. Below is the Figshare metadata entry, which displays the assigned categories (from a controlled list) in the <subject> container:
      <subjects>
          <subject>Computational Chemistry</subject>
          <subject>Organic Chemistry</subject>
          <subject subjectScheme="Fields of Science and Technology (FOS)" schemeURI="http://www.oecd.org/science/inno/38235147.pdf">FOS: Chemical sciences</subject>
        </subjects>

      The context of these keywords is clearly defined by the value of the subjectScheme (chemical sciences) but this term is very broad and does not relate very specifically to the deposited data. The more chemically specific keywords themselves are only displayed on the landing page for the entry as shown below and are not expressed in any metadata container, which means that they are not indexed and hence searchable using the DataCite metadata store.

    2. Zenodo interpret this differently, with the keywords now included in the <subject> container.
      <subjects>
          <subject>-1705.490787</subject>
          <subject>InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14(31-2)10-23(29,16(13)17(12)33-4)25(26,30)19(34-5)18(24)22(11-27,21(28)35-20)8-7-15(24)32-3/h12-20,27,29-30H,6-11H2,1-5H3/t12-,13-,14+,15+,16-,17+,18-,19+,20+,22+,23-,24+,25+/m1/s1</subject>
          <subject>VELNVPXNOKVVTC-VJKZSTDTSA-N</subject>
        </subjects>

      However, you might be wondering what the keyword -1705.490787 is all about. Put simply, in this form of expression it has absolutely no context. I previously explained why it might be useful if context is added, it being a persistent identifier for (some) quantum chemical calculations in the form of a computed total energy corrected thermally into a Gibbs energy. The persistence in this case is acquired not by registration with an agency but generation by an algorithm. That algorithm in turn would require additional metadata for its specification, but that is something I will not address in this post. At any rate, because it is part of the metadata record, it is search-enabled in the Zenodo version.

    3. Imperial follows the Zenodo approach, with further addition of context:
      <subjects>
          <subject subjectScheme="Gibbs_Energy" schemeURI="https://doi.org/10.1351/goldbook.G02629" valueURI="http://gaussian.com/thermo/">-1705.490787</subject>
          <subject subjectScheme="inchi" schemeURI="http://www.inchi-trust.org/">InChI=1S/C25H39NO9/c1-6-26-20-24-13-9-12-14(31-2)10-23(29,16(13)17(12)33-4)25(26,30)19(34-5)18(24)22(11-27,21(28)35-20)8-7-15(24)32-3/h12-20,27,29-30H,6-11H2,1-5H3/t12-,13-,14+,15+,16-,17+,18-,19+,20+,22+,23-,24+,25+/m1/s1</subject>
          <subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">VELNVPXNOKVVTC-VJKZSTDTSA-N</subject>
      </subjects>

      The context is added by addition of the attributes subjectScheme, schemeURI and valueURI. The top level context is the definition provided by the IUPAC Gold Book, and the actual implementation of the algorithm is described on the Gaussian site (although the algorithm there is not explicit in a machine implementable sense). These additions allow an indexed search not only of the numerical value (as a simple string and not as a floating point number) but which can be constrained by specifying the value of e.g. the subjectScheme so that any other random number specified as a keyword which does not have this attribute is excluded. This also allows a search where the floating point number is replaced by wild-cards (*), which would then retrieve ANY reported Gibbs energy, which could in turn be constrained by say the nature of the molecule as expressed using  InChI. 

  3. The final aspect of the metadata analysed here is the relatedIdentifier record. This is increasingly recognised as a crucial component for the construction of so-called PID graphs, which are generated to reveal connections between entities in the research landscape such as data, people, organisations, funders, publications and any other object that is assigned a registered PID (such as perhaps in the future connecting data to its origins from a large instrument). So here are these records for the three repositories:
    1. Although the landing page for the Figshare record has three such entries, including pointers to the other two depositions being discussed here, they are not propagated to the metadata record and so cannot participate in any generated PID graph.
    2. Zenodo has the following record
      <relatedIdentifiers>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.5511965</relatedIdentifier>
        </relatedIdentifiers>

      which relates to an earlier version of the metadata for this entry.

    3. The Imperial record is:
      <relatedIdentifiers>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasMetadata">https://data.hpc.imperial.ac.uk/resolve/?ore=9031</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=1</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=2</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=3</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart">https://data.hpc.imperial.ac.uk/resolve/?doi=9031&file=4</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="References">10.5281/zenodo.5511966</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="References">10.6084/m9.figshare.16685497</relatedIdentifier>
          <relatedIdentifier relatedIdentifierType="DOI" relationType="IsPartOf">10.14469/hpc/9158</relatedIdentifier>
        </relatedIdentifiers>

      where a large number of related PIDs would result in a rich PID graph. These entries include relationType=”HasMetadata” which is a pointer to additional metadata expressed using a different schema (ORE) and which provides a machine-actionable manifest for the files present, specifying the Media Types of each file and a machine method of accessing them. relationType=”HasPart” provides an access URL for each specific item in the fileset. relationType=”References”  is the analogue of the Figshare entries above, citing the other two repositories we are discussing here and finally relationType=”IsPartOf” indicates the deposition is part of a larger collection (in this case the collection generated for this blog) and which could also correspond to e.g. a project comprising multiple researchers at multiple institutions, or say a PhD dissertation containing multiple chapters. The extensive nature of this list of identifiers means that the PID graph would reveal many connections.

I have only covered three repositories here; many more could be added to the list and analyzed for their metadata records. The bottom line is that generally the more metadata that is added, the richer the resulting services and analyses based on PIDs can become. It can only be hoped that this aspect of the operation of repositories continues to improve over time and eventually most will broadcast very rich metadata, including at the very specific subject level. This should enrich the research landscapes, especially at the finely grained subject level.

In the next post, I will analyse the results of searches enabled by this metadata.


Figshare also has an available API, which has not been implemented in the current version of this portal. Policies regarding editing of metadata vary. Some repositories editing updates to the record held by DataCite against the existing DOI. Others require the generation of a new DOI for each new version of the metadata, no matter how small a change (e.g. spelling mistakes in the title etc). An unsolved problem in DataCite metadata is datatypes and units. This entry is a floating point data type, with units of Hartree. How this information can be added is still being discussed.

HPC Access and Metadata Portal (CHAMP).

Monday, September 13th, 2021

You might have noticed if you have read any of my posts here is that many of them have been accompanied since 2006 by supporting calculations, normally based on density functional theory (DFT) and these calculations are accompanied by a persistent identifier pointer to a data repository publication. I have hitherto not gone into the detail here of the infrastructures required to do this sort of thing, but recently one of the two components has been updated to V2, after being at V1 for some fourteen years[1]  and this provides a timely opportunity to describe the system a little more. 

The original design was based on what we called a portal to access the high performance computing (HPC) resources available centrally. These are controlled by a commercial package called PBS which provides a command line driven interface to batch queues. Whilst powerful, PBS can also be complex, and for every day routine use it seemed more convenient to package up this interface into a Web-accessed portal which also included the ability to specify the resources needed (such as memory, number of CPUs, etc) to run the desired compute program, in our case the Gaussian 16 package and to complete things by adding a simple interface to a data repository for use when the calculation was completed.

The process of using this tool, which functions in essence as an Electronic Laboratory Notebook or ELN for computational chemistry, can be summarised as a workflow, which occurs horizontally in the screenshot of V1 above. Each job is assigned an internal ID, which is associated with a pre-configured project and given a searchable description. Its status in the PBS-controlled queues is indicated and when finished the associated input and output files become available for download, with an option to delete these if they are not in fact needed, and a final option to publish to the accompanying tool which is a data repository. V1 of this portal was in fact written in the PHP scripting language and controlled behind the scenes using a MySQL database, which allows the entries to be filtered by search terms such as the assigned project or the description. This proved particularly useful when the number of entries reached large numbers (> 100,000 eventually) and meant that even 15-year old entries could be easily found and inspected!

Although this workflow proved highly robust, the underlying PHP system and associated code became increasingly unmaintainable and in 2021 we decided to refactor it for greater sustainability. We had noticed that in 2018, another group had taken the basic concept we had used in 2006, written a more flexible and portable opensource toolkit for building such a portal, calling it Open OnDemand: A Web-based client portal for HPC centers and published a description.[2] In effect, a lot of the work in maintenance is now divested to a separate group and accordingly our software engineering group here at Imperial were far happier using such a tool. So now enter V2 of our own portal, which we now call HPC Access and Metadata Portal or CHAMP.

The workflow is very much the same as before, but with added flexibility that allows custom resources to be selected which might include eg special grant-funded priority queues. Additionally, a new directory tool allows inspection of any job inputs or outputs, provided by the Open OnDemand package and which greatly facilitates minute-to-minute management/inspection of jobs to ensure the outputs are those expected for a properly functioning job.

If the job is deemed suitable for sharing, the publish button is pressed. This induces a workflow which, inter alia, converts the system specific checkpoint file to an formatted version which can be used on any system and generates a number of extra files needed for publication of the job.

Also of interest is the METADATA file, which generates calculation-specific metadata suitable for injection into the data repository. Currently, this includes the InChI string and Key for the molecule calculated and the Gibbs_Energy, the purpose of which was described in this post. In the future we plan to make this metadata even richer with further information. This calculation-specific metadata will later be conflated with generic metadata for the final publication on the actual repository. That full metadata record includes information about the person who ran the job (their ORCID etc), the institution they are at, the data licensing etc., garnered in part from the profile entry for that user on the CHAMP portal.

After publication, the CHAMP entry for the job is updated to include the DOI for the data publication, and hyperlinked to allow immediate access to this entry in the repository.

An information page about the job also includes a link to the final full published metadata record(s).

CHAMP currently includes workflows to publish to the Imperial College repository. Zenodo has also now been added and possibly other repositories in the future as demand requires.

You can see here  that  I have described how an  ELN was originally designed from scratch to control quantum calculations, and how an essential symbiotic partner to this resource was considered to be a data repository at the outset, even way back in 2006.  Now, the first of these resources has been refactored into modern form and no doubt the repository end will also be in the future. The code is available for anyone to create a similar compute portal for themselves.

A different version of this description, including more details of the software engineering, will shortly be submitted to the Journal of Open Source Software, along with source code suitable for use with Open OnDemand at https://github.com/ImperialCollegeLondon/hpc_portal/.


Originally in the form of a Handle, which was replaced by the use of a DOI. The DOI for this post itself is 10.14469/hpc/9010

References

  1. M.J. Harvey, N.J. Mason, and H.S. Rzepa, "Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks", Journal of Chemical Information and Modeling, vol. 54, pp. 2627-2635, 2014. https://doi.org/10.1021/ci500302p
  2. D. Hudak, D. Johnson, A. Chalker, J. Nicklas, E. Franz, T. Dockendorf, and B. McMichael, "Open OnDemand: A web-based client portal for HPC centers", Journal of Open Source Software, vol. 3, pp. 622, 2018. https://doi.org/10.21105/joss.00622

Octopus publishing: dis-assembling the research article into eight components.

Friday, August 13th, 2021

In 2011, I suggested that the standard monolith that is the conventional scientific article could be broken down into two separate, but interlinked components, being the story or narrative of the article and the data on which the story is based. Later in 2018 the bibliography in the form of open citations were added as a distinct third component.[1] Here I discuss an approach that has taken this even further, breaking the article down into as many as eight components and described as “Octopus publishing” for obvious reasons. These are;

  1. The problem being addressed
  2. An original hypothesis/theoretical Rationale for the problem
  3. A method or protocol for testing the hypothesis in the form of experiments or modelling
  4. The data resulting from these experiments
  5. Analysis of this data
  6. Interpretation of the analysis in terms of the original hypothesis
  7. Translation/application to a real world problem (an extrapolation if you like of the original problem)
  8. A review of any of the previous seven items.

Items 1-3, 5-6 and probably 7 are the dis-assembled components of the standard concept of a scientific publication and item 4 is the data component I refer to in my introduction above. Interestingly, the article bibliography is not separated out into these components, and is presumably distributed throughout the resulting fragments. The essential concept behind this dis-assembly is that each component can rest on its own, provided it is contextually and bi-directionally linked into the others. The author(s) of any individual component will get credit/recognition for that component. A conventional mapping would be that the same author set would be responsible for all the individual items, whilst recognising that each component could in fact have its separate authorship. Thus one might get credit for just suggesting a problem, or for suggesting a protocol for its testing, for acquiring just the data, or for proposing an original analysis or interpretation. Any author’s reputation would then be established by the integrated whole of their contributions across the whole range of article components (leaving aside how the weightings of each individual contribution would be decided). 

I decided to try to give it a go via a prototype at https://science-octopus.org/publish which as you can see is one of the eight above. The process starts with the prospective author providing their ORCID credentials, against which unique metadata is presumably generated.

The next stage however is more interesting; Unlike any other publication platform, Octopus requires all publications to be linked to others that already exist. At the current stage of development, the prototype only has a few entries linked to COVID-19 as a topic, which I did not feel able to add to. Selecting one of those at random, one is asked to link any associated data via a DOI, then the optional indication of appropriate keywords, followed by standard questions about funding sources, conflicts of interest and license declaration. I wanted to use CC0, but this was not an option.

Finally at this stage a question about any other related publications, with the list known to Octopus being offered as suggestions. Next came the main opportunity to insert prose related to the information already provided (this part constituting the conventional article component). This is also the only opportunity to add a bibliography, and the citations would be part of that document, albeit not identified as a citation for the inner workings of the Octopus system. Then comes a Publish now button. No persistent identifier (DOI) is generated in this prototype system, but will be in the production system. A final screen that has four options, including to write a review (of one’s own work?) or to Red Flag the item. The latter can eg arise from plagiarism or any other expression of concern.

Many questions arise about this new approach and I will only note only three. One relates to “Why would I want to publish in Octopus” in the FAQ section. I quote “The traditional system is not only slow and expensive, but the concept of “papers” is not a good way of disseminating scientific work in the 21st century and “Publish it now and establish priority – once it”s out in Octopus it”s yours“. There are many experiments that strive to address this generic issue that the conventional research paper is no longer entirely fit for purpose. One I can supply here is that of “Preprint servers” such as ChemRxiv where one significant motivation is also to “establish priority”.

An aspect of interest to myself is how the metadata for all these eight components will be expressed. Presumably when mature, all eight components will have their own DOI or persistent identifier (PID) and hence all eight will also have a metadata record. These records will formally establish the relationships between the components, and ultimately could be used to construct a PID Graph not only between those components but to the rest of the PID “multiverse” of articles, data, citations and other research objects. Will Octopus join in this PID graph world?

And finally, the greatest challenge to a new paradigm such as Octopus is how quickly will the existing established culture of “publish a blockbuster article in one of the top ten (chemistry) journals” to establish your career evolve into the dis-assembled approach described here? It has taken preprint servers the best part of 25 years to really get going in e.g. chemistry where there are now around 9600 preprints on a variety of topics. I suspect some subject disciplines may be harder to crack than others (and chemistry may well be amongst these!).


Around 2009 or so I ran a student experiment using a Wiki which had some aspects of this approach. Students were asked to do a project on the topic of either a molecule from a suggested list of around 30, or a molecule entirely of their own choosing. The students spontaneously split themselves into three groups. The first were students who wrote the story entirely by themselves, submitted it for credit and did not welcome others as co-authors. The second (largest) group where those that contributed to multiple topics, very much in the manner of Wikipedia itself. Their credit was the sum of their contributions. The final group chose not to tell a story about a molecule, but to help everyone else with the infrastructure of doing so (the protocol if you like) by writing templates which simplified authoring, or correcting errors in existing stories etc.

Arguably a measure of the impact of these 9600 preprints is how many of them have eventually appeared in fully peer reviewed form in journals. That statistic may not be known. Also of interest would be some analysis for those that did end up in journals of how they evolved between eg V1 of the preprint and the final “version of record”. 

Metadata is structured according to a specified schema. Currently, journal publishers use the CrossRef schema for this purpose, which contains a full description of e.g. the bibliography of an article. Data publishers use the DataCite Schema, which has less focus on bibliography. It will be of interest to see how such schemas are applied to the eight components of an Octopus scientific record.


References

  1. D. Shotton, "Funders should mandate open citations", Nature, vol. 553, pp. 129-129, 2018. https://doi.org/10.1038/d41586-018-00104-7

Room-temperature superconductivity in a carbonaceous sulfur hydride!

Saturday, October 17th, 2020

The title of this post indicates the exciting prospect that a method of producing a room temperature superconductor has finally been achived[1]. This is only possible at enormous pressures however; >267 gigaPascals (GPa) or 2,635,023 atmospheres.

The system is made by milling a mixture of elemental carbon and sulfur, followed by adding hydrogen gas, compression to 4 GPa and finally laser-induced photolysis at 532nm for several hours. The result of this is the production of three entirely unexotic molecules, H2S, CH4 and H2 in approximately stoichiometic quantities, which at this pressure form a complex bound by van der Waals attractions. Since in this blog, I am particularly interested in molecular structures, my eye was drawn to “Extended data Figure 6, A DFT-optimized structure for (H2S)(CH4)H2 (variant 2) at 4 GPa. This structure was produced by DFT optimisation modelled at 4 GPa using the PBE functional and importantly the now standard Grimme dispersion correction (often indicated as GD3+BJ, and used frequently on this blog). Since this complex is bound by dispersion attractions, it might be tempting to conclude that the intermolecular features of this structure originate in part from the Grimme dispersion model as well as possible hydrogen bonding from quantum effects.

I would love to be able to play with this structure to e.g. measure properties such as hydrogen bonding lengths or perform e.g. a QTAIM analysis, but have not yet acquired the “extended data” of figure 6 in the form of coordinates.‡  I have italicised the term extended data, being unsure what the journal means by this. If the figure relates to the three-dimensional extended structure of the crystal form of this complex, then one might imagine that any extended data associated with this figure would indeed be the numerical coordinates. Since the authors express the hope that “chemical tuning” of this system might enable complexes exhibiting superconductivity at lower pressures, I fancy that these coordinates might help provide insight into how to achieve such tuning. This closing paragraph of mine arose because I still frequently fail to see even prestiguous journals doing very much to encourage FAIR data associated with articles. In this instance, FAIR, at least to my mind is more than just a Figure (with or without extended data), but is genuinely inter-operable (I) or re-usable (R) data such as indeed are coordinates. To this end, I am unconvinced that this “extended data figure” is indeed properly FAIR.


I have requested these from the authors, and hope to make them available in the form of a 3D rotatable model here on the blog. It would be interesting to know if this model has been tested at the enormous pressures in this experiment. Standard dispersion models pertain to normal pressures. As was done here for Na2He.

References

  1. E. Snider, N. Dasenbrock-Gammon, R. McBride, M. Debessai, H. Vindana, K. Vencatasamy, K.V. Lawler, A. Salamat, and R.P. Dias, "RETRACTED ARTICLE: Room-temperature superconductivity in a carbonaceous sulfur hydride", Nature, vol. 586, pp. 373-377, 2020. https://doi.org/10.1038/s41586-020-2801-z