Archive for the ‘Chemical IT’ Category

Five things you did not know about (fork) handles.

Tuesday, March 18th, 2014

OK, you have to be British to understand the pun in the title, a famous comedy skit about four candles. Back to science, and my mention of some crystal data now having a DOI in the previous post. I thought it might be fun to replicate the contents of one of my ACS slides here.

Firstly, a DOI is one implementation of a more generic (and quite old) concept known as a Handle. This is one form of a persistent digital identifier. Article DOIs have been in common use for at least ten years now, and even new chemistry students know about them! A DOI points to an article in a journal? Not quite as it happens, but in fact it could be a whole lot more that a DOI could lead to! Let me explain by showing you five examples:

  1. doi.org/10042/26065 resolves to a landing page. Crucially, this is NOT the article itself, which may remain obstinately behind a paywall to which you have no access.
  2. doi.org/10042/26065?locatt=filename:input.gjf resolves to a file input.gjf that may be present off the landing page, and hence allowing a machine action to retrieve it.
  3. doi.org/10042/26065?locatt=mimetype:chemical/x-gaussian-input resolves to the first file matching the MIME type that may be present off the landing page, and hence allowing a machine action to retrieve it.
  4. doi.org/10042/26065?locatt=id:1 resolves to the  first file matching ID=1 that may be present off the landing page, and hence allowing a machine action to retrieve it.
  5. doi.org/api/10042/26065 will return the JSON-encoded full handle record for processing in Javascript, so that a machine now has access to all the information it might need to perform a machine action.

Now, items 2-5 are not generally available; they work only on our servers. We have placed them there to show how item 6 of the Amsterdam Manifesto could be made to work. There are other ways of course. But you can see them in action here[1] (the article is open access, so you should not get any paywall behaviour from the landing page).


Postscript. A few days ago, I asked my group of 1st year undergraduate students how they might go about tracking down a journal article from its authors, the journal name and the page numbers. The most common reply was “Google it”. Next came “go to the library and find it on the shelves”. One replied “from its DOI” (that student had done an internship in a pharma company before joining us). I used to teach a chemical information course here[2] between 1996 – 2010 where this sort of stuff was a staple. That course is no longer taught. Hence the aforementioned replies!

References

  1. A. Armstrong, R.A. Boto, P. Dingwall, J. Contreras-García, M.J. Harvey, N.J. Mason, and H.S. Rzepa, "The Houk–List transition states for organocatalytic mechanisms revisited", Chem. Sci., vol. 5, pp. 2057-2071, 2014. https://doi.org/10.1039/c3sc53416b
  2. "It:lectures-2011 - ChemWiki", 2019. http://doi.org/10042/a3v06

The Amsterdam Manifesto and crystal structures.

Tuesday, March 18th, 2014

I have mentioned the Amsterdam manifesto before on these pages. It is worth repeating the eight simple principles:

  1. Data should be considered citable products of research.
  2. Such data should be held in persistent public repositories.
  3. If a publication is based on data not included with the article, those data should be cited in the publication.
  4. A data citation in a publication should resemble a bibliographic citation and be located in the publication’s reference list.
  5. Such a data citation should include a unique persistent identifier (a DataCite DOI recommended, or other persistent identifiers already in use within the community).
  6. The identifier should resolve to a page that either provides direct access to the data or information concerning its accessibility. Ideally, that landing page should be machine-actionable to promote interoperability of the data.
  7. If the data are available in different versions, the identifier should provide a method to access the previous or related versions.
  8. Data citation should facilitate attribution of credit to all contributors

I just gave a talk at the ACS meeting in Dallas which touched upon the need to emancipate data according to these principles. My talk, in case you are interested, focused particularly upon item 6 above.[1]

Just after my talk I heard that crystallographic data was about to be emancipated (my phrase) and so I was interested to find out what this might mean, and how many of the above principles were being adhered to. Indeed, it is an interesting test to apply to any chemistry data that you might find out there. Thus 10.5517/cc10ftfp[2] is the DOI of a recently published crystal data structure. This adheres to points 1-3 and 5 above, and probably also 8. As I have already noted, 6 is the interesting one! So let’s go to the landing page and see what we find.

doi-x1

 

Firstly, note that you do not need any sort of access code to get to this page, it is open to all. But it is after all just a landing page, not actual data. Next, click on the Download button, and you get asked to identify yourself by providing a name, email address and affiliation as mandatory fields, as well as agreeing to conditions of use. I reproduce these conditions here:

Individual CIF data sets are provided freely by the CCDC on the understanding that they are used for bona fide research purposes only. They may contain copyright material of the CCDC or of third parties, and may not be copied or further disseminated in any form, whether machine-readable or not, except for the purpose of generating routine backup copies on your local computer system“.

As with most such conditions, it is what one cannot do that is most interesting.

  1. Teach, as for example incorporating the data into lecture notes
  2. Make a copy, e.g. to place into this blog (is this for research purposes?)
  3. Do bona fide research purposes in fact allow a copy to be made, or does the second sentence over-ride the first in this regard, since it lists exclusions and research copying is not an exclusion.
  4. Judging from the landing page, it is pretty much impossible for any machine action to take place (item 6 in the Amsterdam manifesto). Even though the data is machine actionable, the landing page pretty much prevents this from happening. 

What did cause my eyebrows to shoot up was that I have to reveal my full identity and affiliation (which appears not to be actually checked) in order to get the data. Think about this. Do journals ask for this information when you download an article from them? (OK, they probably know your affiliation). Which scientist is reading which article (or viewing which data) could be construed as sensitive information after all. So why in order to acquire crystal data do you have to provide personal information? Surely, looking at data should be a private process if one wants it to be?

doi-x2

The release of crystal data in this manner, with a decent partial adherence to the Amsterdam Manifesto is an excellent start; this data after all is well curated and of high value. But I must call upon CCDC to rethink that landing page, the conditions of use and the mandatory gathering of personal information. Not quite there yet!

References

  1. "Digital data repositories in chemistry and their integration with journals and electronic laboratory notebooks", 2014. http://doi.org/10042/a3uza
  2. Sowa, Michał., Ślepokura, Katarzyna., and Matczak-Jon, Ewa., "CCDC 936802: Experimental Crystal Structure Determination", 2014. https://doi.org/10.5517/cc10ftfp

Chemistry data round-tripping. Has there been ANY progress?

Monday, December 2nd, 2013

This is one of those topics that seems to crop up every three years or so. Since then, new versions of operating systems, new versions of programs, mobile devices and perhaps some progress? 

Right, I will briefly recapitulate. Chemical structure diagrams are special; they contain chemical semantics (what an atom is, what a bond is, stereochemistry, charges, etc). One needs special programs to represent this. Take two well-known ones. ChemBioDraw V 13 is the latest in a long line dating back to 1985 or so. A newcomer is ChemDoodle, just updated to version 6. The idea is you express your molecule, and capture some of its semantics using one of these programs. And then paste the data into another veritable word processor, Word (also dating back to around 1984). Then send the Word document to a colleague. Who might want to copy the structure back out, and put it back into ChemBioDraw/ChemDoodle. And put those semantics to good use, by editing it, or re-purposing the information. This is round-tripping the data. Its been almost 30 years, surely the process should be seamless by now? Wrong!

One problem is that the “exchange-particle” is the clipboard, yet another ancient and presumed mature technology. Its invisible of course, we rarely get to see it. And very operating system specific! So what is the current state of play? Round tripping ChemBiodraw structures across a single operating system might work. Well, it currently does for just one of the two most common desktop operating systems (remember, Word is provided by the originator of one of these operating systems). The other program, ChemDoodle round trips within both operating systems.

But, here is the key point, not across operating systems. Paste either a ChemBioDraw or a Chemdoodle structure into Word on one of these OS, and try re-editing that diagram on the version of Word on the other OS. The data is lost unless you have the “right” operating system.

An experiment I have not tried, but regarding which I would welcome any feedback is to factor in the two newest operating systems, this time for mobile devices such as tablets and phones. Lets not even worry whether different flavours of one of these mobile OSs are compatible. Apps for drawing chemical structures are available for both of these. Here, the amazing clipboard still exists. One now has four OS to consider, and four homogenous permutations and a minimum of six heterogenous round trips the data could try to take for any given app. We do not even consider app2app transfers not involving discrete intermediate documents. I would predict that only a few of these permutations preserve round-tripped data and its semantics.

Perhaps we need to look at it in a different way? One simply avoids putting data from one program into another. Chemical data is kept in its own files, never mixed with data from other programs, but always kept/sent separately. Pre-1984 and the clipboard, this might have made sense. But in an era when XML was invented around 17 years ago to allow data to fully retain semantic information in any environment it finds itself in, it seems surprising that we still have this situation.

I mention all of this, since there is a current refocusing on the importance of data; “emancipating data” is now important. But the reality is that much current software destroys the semantics in data at almost every turn. Thirty years of no progress then. But what of Chem4Word, a combination of differently namespaced  XML in which the chemistry is expressed in CML (it is only available for a single operating system!). I will perhaps devote a separate post to that one; first I have to try a few experiments!

Blasts from the past and present: altmetrics.

Sunday, October 13th, 2013

I reminisced about the wonderfully naive but exciting Web-period of 1993-1994. This introduced the server-log analysis to us for the first time, and hits-on-a-web-page. One of our first attempts at crowd-sourcing and analysis was to run an electronic conference in heterocyclic chemistry and to look at how the attendees visited the individual posters and presentations by analysing the server logs.

all_accesses

You can read all about that analysis here. This is one interesting graphic below, showing the 24-hour distribution of accesses. Remember, this was before Google and its analytics even existed (and yes, we were also doing Google-like searches before they did).

hourly-accesses

But let me get to the actual point of this post. A decade or so ago, all universities in the UK were asked to undertake a quality review exercise of their research outputs. One of the metrics of such outputs is the scientific publication, and each research group leader had to collect their most important four articles published in the previous few years and submit them (as paper) to a review panel. This poor panel was faced with a mountain of paperwork (literally!) when they arrived to do their job. It was soon decided that a better (electronic) system had to be devised. So now we have a product called Symplectic (which as it happens originated in the physics department here at Imperial College), which tirelessly gathers such outputs. More accurately, it gathers the meta-data for research publications, since most publishers do not allow actual reprints to be so harvested! And when it finds a new article, it informs its author, and asks them to check that the meta-data is accurate.

So it was a few days ago that I received such an alert. I checked the meta-data (adding in fact some which associates the scientific work with a particular resource, our High-Performance-Computing unit, and also the NMR systems here) but then the following thumbnail caught my eye. The wonderful Symplectic system had computed this for me. 

altmetrics3

This I had to see. Expanded, it shows as follows. An altmetric measures  attention. And attention (however transient) is apparently itself measured by tweets, facebook, news outlets, science blogs, Mendeley and CiteULike

altmetrics1

Well, things have certainly moved on from the days of analysing server-logs! Now, would an aspiring tenure-track young scientist, presenting an altmetric score of 28 to their head of department expect to get their tenure on this basis? Of course, we are back to the old hoary chestnut. Is attention necessarily good? You cannot tell from the above if we have indeed produced worthy science, or science to be scorned.

Well, the above represents a 20 year period in the evolution of science and how it is communicated. Whether this represents positive progress I leave you to decide. And if one of your altmetric scores is > 28, you have done better than us!


Does the icon look familiar? See here.

Internet Archaeology: Blasts from the past.

Friday, October 11th, 2013

In 1993-1994, when the Web (synonymous in most minds now with the Internet) was still young, the pace of progress was so rapid that some wag worked out that one “web-year” was like a dog-year, worth about 7 years of normal human time. So in this respect, 1994 is now some 133 web-years ago. Long enough for an archaeological excavation.

And so it was that I came across two Web-pages that have suddenly acquired a topical significance:

  1. http://www.ariadne.ac.uk/issue1/clic
  2. http://doi.org/10.1080/13614579509516846[1]

Their topicality in part arises from e.g. http://www.rsc.org/AboutUs/News/PressReleases/2013/RSC-announces-chemical-sciences-repository.asp where the RSC seeks community support to help curate the data we as scientists produce.

Some of my recent posts (this one on dual-publisher models and this one on publishing procedures) also pertain to this and Peter Murray-Rust is constantly blogging on the topic (see this for the latest).

Perhaps 2013 will indeed be the year of data! 

References

  1. D. James, B.J. Whitaker, C. Hildyard, H.S. Rzepa, O. Casher, J.M. Goodman, D. Riddick, and P. Murray‐Rust, "The case for content integrity in electronic chemistry journals: The CLIC project", New Review of Information Networking, vol. 1, pp. 61-69, 1995. https://doi.org/10.1080/13614579509516846

Publishing a procedure with a doi.

Wednesday, October 2nd, 2013

In the two-publisher model I proposed a post or so back, I showed an example of how data can be incorporated (transcluded) into the story narrative of a scientific article, with both that story and the data each having their own independently citable reference (using a doi for the citation). Here I take it a step further, by publishing a functional procedure in a digital repository[1] and assigned its own doi:10.6084/m9.figshare.811862.

The following HTML

<iframe src="http://wl.figshare.com/articles/811862/embed?show_title=1" height="443" width="500" frameborder="0"></iframe>

can then be incorporated into any Web page, including this post, to invoke the service. What does this do? It takes a pre-prepared Gaussian-style cube file containing values of the electron density of a molecule, and converts this into non-covalent-interaction (NCI) isosurfaces[2] (as described here). Two new two files, a .xyz coordinate file and a .jvxl isosurface file (see here for an example of its application) are written to the user’s local file space. These files in turn can be integrated into an interactive data presentation and this new object can have a doi.

So now we see how unique identifiers can be used with a digital repository to:

  1. Publish a data calculation and assign it a doi
  2. A script or procedure (as a Web Service) to convert the preceding data can itself be published and assigned a doi
  3. Step two is then invoked using that doi, and the output(s) can be also be raw into a digital repository, or wrapped beforehand in some manner to produce a visual presentation of this new data before being assigned a doi
  4. All three components, if needed, can now be cited in a narrative article describing the science, and this too of course may (after peer review) also receive its own doi
  5. The first three components can, if needed, be transcluded into the fourth to create the final composite appearing in the journal (or blog post as here). 

So below is this service. You can either use it here, or simply resolve the doi above into a separate web page. This version uses Java, and so you have to be prepared to answer questions about security etc. An alternative version not using Java (based on JSmol) is probably too slow; sometimes the procedure has to convert 300+ Mbytes of Gaussian cube, and take about 30 seconds to do so.

At any rate, if you have read any of my posts which show NCI isosurfaces, and wondered how to do it for yourself, here is your chance!

References

  1. H.S. Rzepa, "Script for creating an NCI surface as a JVXL compressed file from a (Gaussian) cube of total electron density", 2013. https://doi.org/10.6084/m9.figshare.811862
  2. E.R. Johnson, S. Keinan, P. Mori-Sánchez, J. Contreras-García, A.J. Cohen, and W. Yang, "Revealing Noncovalent Interactions", Journal of the American Chemical Society, vol. 132, pp. 6498-6506, 2010. https://doi.org/10.1021/ja100936w

A two-publisher model for the scientific article: narrative+shared data.

Sunday, September 15th, 2013

I do go on rather a lot about enabling or hyper-activating[1] data. So do others[2]. Why is sharing data important?

  1. Reproducibility is a cornerstone in science,
  2. To achieve this, it is important that scientific research be open and transparent.
  3. Openly available research data is central to achieving this. It is estimated that less than 20% of the data collected in chemistry is made available in any open manner.
  4. RCUK (the UK research councils) wish increased transparency of publicly funded research and availability of its outputs

But it’s not all hot air, honestly. Peter Murray-Rust and I had started out on a journey to improve reproducibility, openness and transparency in (inter alia) scientific publishing in 1994. In 2001 we published an example of a data-rich article[3] based on CML, and by 2004 the concept had evolved into something Peter termed a datument[4]. Some forty such have now been crafted.[5]

In 2009, the journal Nature Chemistry was starting up, and I approached them with the idea of an interactive data exploratorium on the premise that a new journal might be receptive to new ways of presenting science. It was accepted and published[6] and was followed in 2010 by a second variation.[7] In both cases, these activated-figures were sent to the journal as part of the submission process, and hosted by them (they still are). You can even access them without a subscription to the journal!

Move on to 2012, when David Scheschkewitz had some very exciting silicon chemistry to report, we collaborated on some computational modelling, and sent the resulting article to Nature Chemistry for publication. This included the usual interactive table reporting the modelling and its data. However, it transpired that the production workflows for Nature Chemistry had been streamlined and I was informed that interactive tables could no longer be accepted. This time, we (i.e. the authors) would have to solve the issue of how to host and present the data ourselves.

I was very keen that this table be treated with equal weight to the article itself (citable in its own right) and that it not be downgraded to supporting information (ESI). My objection to ESI is that it is often poorly structured by authors, i.e. it is not prepared in a form which allows the data to be re-used, either by a perceptive human, or a logical machine. As a result it is often given little attention by referees (although bloggers seem to do a far better job) and furthermore can end up being lost behind a pay wall (the two Nature Chem interactive objects noted above can be openly accessed, but only if you know that they exist). So I determined that:

  1. The table should be immediately accessible by non-experts, but not through any convoluted processes of downloading a file, expanding it and finding the correct document within the resulting fileset to view in the correct program, which is how normal ESI is handled.
  2. The table and the data it contained within should be capable of acting as a scientific tool, forming what could be the starting point for a new investigation if appropriate.

To solve this issue, some lateral and quick thinking was needed. The solution was a two-component model in which the original article is treated as a “narrative“, intertwingled with a second, but nevertheless distinct component, the “data“. This data would follow the principles of the Amsterdam Manifesto; it would itself be citable. The two components would become symbiotes (a datument). The narrative[8] could cite this data and the data could back-link to the narrative. The data would inherit trust (i.e. peer review) from that applied to the narrative and the latter would inherit a date stamp and integrity from the data host (in this case Figshare[9]).*

The data itself can have two layers, presentation [9] using a combination of software (Jmol or JSmol for chemistry) which are used to invoke the “raw” data. That data itself is citable[10] (this is just a single example, resident as it happens on a different repository). The reader can choose use just the presentation layer or the underlying data.

The data object can be embedded in other pages; here it is below. The data sources for this table are themselves citable[11].



What are the advantages of such an approach? (the “what’s in it for me” question often asked by research students and their supervisors)

  1. Each of the components is held in an environment optimised for it and so can be presented to full advantage.
  2. The conventional narrative publisher does not necessarily also have to develop their own infrastructures for handling the data. They can choose to devolve that task to a “data publisher”.
  3. The data publisher (Figshare in this case) makes the data open. One does not need an institutional subscription to access it.
  4. “Added value” for each component can be done separately. Thus most narrative publishers would not necessarily wish to develop infrastructures for validating it or subsequently mining such “big data”. Indeed data mining of journals is prohibited by many publishers; it simply is either not possible or rendered so administratively difficult as to be impractical.
  5. Whilst a narrative article must clearly exist as a single instance (otherwise the authors would be accused of plagiarism), data can have multiple instances. Indeed, there exist protocols (SWORD) for moving data from one repository to another as the need arises. Publishing the same data in two or more locations is not currently considered plagiarism!
  6. The data component can be published as part of an article or say as part of a PhD thesis. This way, the creator of the data gets the advantages not of a date stamp associated with a narrative citation but of a much earlier stamp associated more closely with the actual creation of the data. That could easily and usefully resolve many disputes about who discovered what first, leaving the other issue of who interpreted what first to the narrative. I should mention that it is perfectly possible to “embargo” the data deposition so that it only becomes public when the narrative does (although you may choose not to do this).
  7. A data deposition cannot be modified, but a new version (which bidirectionally links back to the old one) can be published if say more data is collected at a future date.
  8. A whole infrastructure devoted just to enhancing the cited data can evolve; one that is unlikely to do so if the narrative publishers are the only stakeholders. For example, synthetic procedural data can be tagged using the excellent chemical tagger.
  9. It is relatively simple (=cheap) to build a pre-processor for publishing data, which for a research student can act as an electronic laboratory notebook, holding meta-data about the deposited/published data and the handles (doi) associated with each deposition. I have been using such an environment now for about seven years as the e-notebook for this blog for example. Thus the task of preparing figures and tables for a publication (or a blog post) is greatly facilitated. The same system is also used by research students and undergraduates for their lab work.
  10. I have noted previously how e.g. Google Scholar identifies data citations along with article citations in constructing an individual research profile. A researcher could become known for their published data as well as their published narratives. Indeed, it seems likely that the person who acquires and publishes the data, i.e. the research student, would then get accolades directly rather them all accruing to their supervisor.

But what can you, gentle reader of this blog, do to help? Well, ask if your institution already has, or plans to create a data repository. It can be local (we use DSpace) or “in-the-cloud” (e.g. Figshare). If not, ask why not! And if you are planning to submit an article for publication in the near future, ponder how you might better share its data.


As first circulated on 28 April, 2011. See 
http://www.epsrc.ac.uk/about/standards/researchdata/Pages/policyframework.aspx

The example given at the start of this post[8] contains only one table processed in this manner; the actual synthetic procedures are still held in more conventional SI.

*This blog uses the excellent Kcite plugin to manage citations.

The good folks at Figshare were extremely helpful in converting this deposition into an interactive presentation. Thanks guys!


References

  1. O. Casher, G.K. Chandramohan, M.J. Hargreaves, C. Leach, P. Murray-Rust, H.S. Rzepa, R. Sayle, and B.J. Whitaker, "Hyperactive molecules and the World-Wide-Web information system", Journal of the Chemical Society, Perkin Transactions 2, pp. 7, 1995. https://doi.org/10.1039/p29950000007
  2. R. Van Noorden, "Data-sharing: Everything on display", Nature, vol. 500, pp. 243-245, 2013. https://doi.org/10.1038/nj7461-243a
  3. P. Murray-Rust, H.S. Rzepa, and M. Wright, "Development of chemical markup language (CML) as a system for handling complex chemical content", New Journal of Chemistry, vol. 25, pp. 618-634, 2001. https://doi.org/10.1039/b008780g
  4. H.S. Rzepa, "Chemical datuments as scientific enablers", Journal of Cheminformatics, vol. 5, 2013. https://doi.org/10.1186/1758-2946-5-6
  5. H.S. Rzepa, "Transclusions of data into articles", 2013. https://doi.org/10.6084/m9.figshare.797481
  6. H.S. Rzepa, "The importance of being bonded", Nature Chemistry, vol. 1, pp. 510-512, 2009. https://doi.org/10.1038/nchem.373
  7. H.S. Rzepa, "The rational design of helium bonds", Nature Chemistry, vol. 2, pp. 390-393, 2010. https://doi.org/10.1038/nchem.596
  8. M.J. Cowley, V. Huch, H.S. Rzepa, and D. Scheschkewitz, "Equilibrium between a cyclotrisilene and an isolable base adduct of a disilenyl silylene", Nature Chemistry, vol. 5, pp. 876-879, 2013. https://doi.org/10.1038/nchem.1751
  9. D. Scheschkewitz, M.J. Cowley, V. Huch, and H.S. Rzepa, "The Vinylcarbene – Cyclopropene Equilibrium of Silicon: an Isolable Disilenyl Silylene", 2013. https://doi.org/10.6084/m9.figshare.744825
  10. H.S. Rzepa, "Gaussian Job Archive for C60H92Si3", 2012. https://doi.org/10.6084/m9.figshare.96410

The Amsterdam Manifesto on Data Citation Principles

Wednesday, July 31st, 2013

The Amsterdam manifesto espouses the principles of citable open data. It is a short document, and it is worth re-stating its eight points here:

  1. Data should be considered citable products of research.
  2. Such data should be held in persistent public repositories.
  3. If a publication is based on data not included with the article, those data should be cited in the publication.
  4. A data citation in a publication should resemble a bibliographic citation and be located in the publication’s reference list.
  5. Such a data citation should include a unique persistent identifier (a DataCite DOI recommended, or other persistent identifiers already in use within the community).
  6. The identifier should resolve to a page that either provides direct access to the data or information concerning its accessibility. Ideally, that landing page should be machine-actionable to promote interoperability of the data.
  7. If the data are available in different versions, the identifier should provide a method to access the previous or related versions.
  8. Data citation should facilitate attribution of credit to all contributors.

The manifesto itself is dated  20 March 2013, but the principles above go back far earlier, and most of the articles above have been implemented in this blog for a little while now. But its best illustrated with an example[1]. Here I have used the excellent wordpress extension Kcite to adhere to points 1-5 and 8 above. Point 6 is the most interesting perhaps, and it is illustrated below. If you click on the graphic, it will load the log file associated with the calculation described in the previous post to convert the data to a rotatable 3D model, using Jmol.

Click for  3D

Click to load from Figshare

This actually represents a departure from how I normally invoke data on this blog. I have hitherto done it by uploading the data to the WordPress uploads directory (it is thus in effect a local copy of the data). Here in this post I have not done that; it’s coming directly from the citable data repository, using e.g.

onclick="jmolInitialize('../Jmol/',true);jmolSetAppletColor('white');jmolApplet([500,500],'load http://files.figshare.com/1134372/logfile.log;frame 27;vectors on;vectors 4;vectors scale 5.0; color vectors orange; vibration 10;animation mode loop;');"

Well, almost. The URL of the actual dataset (http://files.figshare.com/1134372/logfile.log) is derived from the doi (10.6084/m9.figshare.757728) by an internal process which has no exposed algorithm. As it happens, the DSpace digital repository[2] is a bit better in this regard.

Click for  3D

Click to load from DSpace

This is loaded using

https://spectradspace.lib.imperial.ac.uk:8443/dspace/bitstream/handle/10042/24916/logfile.log;

which is directly derived from the handle itself (although again that algorithm has to be worked out by a human).

Point 7 above is also implemented (sort of). If you look at the Figshare repository, you will notice an additional link there which points to the Dspace repository. As it happens, the two datasets are identical (they are not different versions) and these semantics are currently NOT handled well. You could probably work this out from the date stamps of the two depositions (which are respectively Published on 29 Jul 2013 – 05:58 (GMT) and 2013-07-29T05:58:11Z for Figshare and DSpace respectively). 

Whereas the Amsterdam manifesto is here implemented on a blog post, I think the grander aspiration is that the principles are to be followed in ALL scientific publications. We are some way away yet from achieving this. But watch this space for an upcoming example! 

So can I urge all scientists who care about data to promulgate the principles of the Amsterdam manifesto, and wherever possible to practice what they preach! 

References

  1. H.S. Rzepa, "Gaussian Job Archive for ClF3", 2013. https://doi.org/10.6084/m9.figshare.757728
  2. "Cl 1 F 3", 2013. http://hdl.handle.net/10042/24916

150,000,000 DFT calculations on 2,300,000 compounds!

Friday, July 5th, 2013

The title of this post summarises the contents of a new molecular database: www.molecularspace.org[1] and I picked up on it by following the post by Jan Jensen at www.compchemhighlights.org (a wonderful overlay journal that tracks recent interesting articles). The molecularspace project more formally is called “The Harvard Clean Energy Project: Large-scale computational screening and design of organic photovoltaics on the world community grid“. It reminds of a 2005 project by Peter Murray-Rust et al at the same sort of concept[2] (the World-Wide-Molecular-Matrix, or WWMM[3]), although the new scale is certainly impressive. Here I report my initial experiences looking through molecularspace.org

The 150,000,000 calculations are released under the the CC-BY license, which is an encouraging (open) start. One does need however to login to the site, which I was able to do using my Google credentials. Shown below is a screenshot of a typical result in a search (of Power conversion efficiency in my case).

CEPDB1

It comes in two parts, the first being the structure (given as a SMILES and 2D layout) with the principle predicted energy levels and predicted photovoltaic performance listed below that. This is then followed by what might be called an annotation with further computed/predicted properties using the algorithms applied by Chemicalize.org. This idea that a data set could accrete via semantically powerful annotations using other tools was also very much part of the concept of the WWMM (the matrix had at its heart a molecule in one dimension and a property, measured or computed in the other. The matrix is of course very sparse, which is why it needs annotation!).

It was at this point however that I started to wonder how I might add other annotations, based perhaps on other types of calculations. But thus far at least, I have not found any trace of something which I could immediately use for my own calculation; 3D coordinates specifically. Thus, the HOMO-LUMO energy gap is the key property which makes molecularspace unique and valuable (to someone working in the field of photovoltaics). But HOMO/LUMO gaps can be calculated in many different ways, and it can always be valuable to calibrate/validate the reported values against other methods. Perhaps if I continue to look, I might find these 3D coordinates (which, for 2,300,000 molecules would be a very valuable resource).  Certainly for example, should  I wish to do so, I could not at the moment readily replicate the calculation for any specific entry on the molecularspace site (which can be regarded as an essential component of scientific validation). When I use the first person, I mean of course either myself as a human or a software agent acting on my behalf (the latter having the endurance to repeat its procedures millions of times if necessary). 

The reader of this blog may have noticed that whenever I report a calculation here, I like to cite its doi (more formally its handle), which links to a digital repository. In my case, the repository certainly carries the 3D coordinates, and also the full wavefunction provided if the reader wishes other properties to be derived from it. Now if molecularspace is able to provide that in the fullness of time, it truly would be an impressive resource.

But the important take-home message from molecularspace is that archiving (under a CC-BY license) the “big” data from any given research in a manner which makes it readily re-usable by others (perhaps from quite different fields of science) is now an essential requisite of doing science. And it is really nice to see good examples of this in practice!


Generally, the calculations I perform for this blog are published in a DSpace repository (the original one, started in 2006[4]), and more recently in Chempound (a project by Peter Murray-Rust and colleagues which emerged out of the WWMM experiments) as well as Figshare[5]. The first and the third assign unique handles (i.e. a doi) to the data; chempound does not (and neither does molecularspace).

References

  1. J. Hachmann, R. Olivares-Amaya, S. Atahan-Evrenk, C. Amador-Bedolla, R.S. Sánchez-Carrera, A. Gold-Parker, L. Vogt, A.M. Brockway, and A. Aspuru-Guzik, "The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid", The Journal of Physical Chemistry Letters, vol. 2, pp. 2241-2251, 2011. https://doi.org/10.1021/jz200866s
  2. P. Murray-Rust, H.S. Rzepa, J.J.P. Stewart, and Y. Zhang, "A global resource for computational chemistry", Journal of Molecular Modeling, vol. 11, pp. 532-541, 2005. https://doi.org/10.1007/s00894-005-0278-1
  3. P. Murray-Rust, S.E. Adams, J. Downing, J.A. Townsend, and Y. Zhang, "The semantic architecture of the World-Wide Molecular Matrix (WWMM)", Journal of Cheminformatics, vol. 3, 2011. https://doi.org/10.1186/1758-2946-3-42
  4. J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
  5. H.S. Rzepa, "Gaussian Job Archive for CLi6", 2013. https://doi.org/10.6084/m9.figshare.739310

Research data and the "h-index".

Monday, June 24th, 2013

The blog post by Rich Apodaca entitled “The Horrifying Future of Scientific Communication” is very thought provoking and well worth reading. He takes us through disruptive innovation, and how it might impact upon how scientists communicate their knowledge. One solution floated for us to ponder is that “supporting Information, combined with data mining tools, could eliminate most of the need for manuscripts in the first place“. I am going to juxtapose that suggestion on something else I recently discovered. 

Someone encouraged me to take a look at Google Scholar. It is one of those resources that, amongst other features, computes an individual’s h-index and i10-index (the former, having gone through its purple patch, is now apparently at the end of the road, at least for chemists). One reason perhaps why proper curation of research data is not high on most chemists’ list of priorities is that it does not contribute to one’s h-index, and particularly one’s prospects of a successful research career. Thus “supporting information (data)” is one of those things, like styling the citations in a research article, that most people probably prepare through gritted teeth (a rather annoying ritual without which a research article cannot be published). So when I inspected my own Google Scholar profile (you can do the same here) I was rather surprised to find, appended to all the regular research articles, a long list of data citations (sic!). Because I have placed much of my own data into a digital repository, this has opened it up to Google (where don’t they get to nowadays?) for listing (if not actually mining). These citations of themselves actually do not (currently?) contribute to eg the h-index, since currently these entries are not attracting citations by others. And that of course is because doing so is not yet an accepted part of the ritual of preparing a scientific article.

Most scientists must now be pondering what the future holds in terms of how they can bring themselves to the attention of others (in a good way) and hence progress their careers. So I will take Rich’s suggestion one step further. Those scientists who create new data in a process called research, should firstly curate this data properly (via eg a digital repository) and then expect to promote their activity by garnering not only citations for the published narratives (= articles) but also associated published data. Their success as a researcher would be (in part) judged by both. Who knows, as well as famous published narratives, perhaps we will also rank famous published datasets! 


I do the same for the data I use to support many of the posts for this blog.