Posts Tagged ‘opendata’

A two-publisher model for the scientific article: narrative+shared data.

Sunday, September 15th, 2013

I do go on rather a lot about enabling or hyper-activating[1] data. So do others[2]. Why is sharing data important?

  1. Reproducibility is a cornerstone in science,
  2. To achieve this, it is important that scientific research be open and transparent.
  3. Openly available research data is central to achieving this. It is estimated that less than 20% of the data collected in chemistry is made available in any open manner.
  4. RCUK (the UK research councils) wish increased transparency of publicly funded research and availability of its outputs

But it’s not all hot air, honestly. Peter Murray-Rust and I had started out on a journey to improve reproducibility, openness and transparency in (inter alia) scientific publishing in 1994. In 2001 we published an example of a data-rich article[3] based on CML, and by 2004 the concept had evolved into something Peter termed a datument[4]. Some forty such have now been crafted.[5]

In 2009, the journal Nature Chemistry was starting up, and I approached them with the idea of an interactive data exploratorium on the premise that a new journal might be receptive to new ways of presenting science. It was accepted and published[6] and was followed in 2010 by a second variation.[7] In both cases, these activated-figures were sent to the journal as part of the submission process, and hosted by them (they still are). You can even access them without a subscription to the journal!

Move on to 2012, when David Scheschkewitz had some very exciting silicon chemistry to report, we collaborated on some computational modelling, and sent the resulting article to Nature Chemistry for publication. This included the usual interactive table reporting the modelling and its data. However, it transpired that the production workflows for Nature Chemistry had been streamlined and I was informed that interactive tables could no longer be accepted. This time, we (i.e. the authors) would have to solve the issue of how to host and present the data ourselves.

I was very keen that this table be treated with equal weight to the article itself (citable in its own right) and that it not be downgraded to supporting information (ESI). My objection to ESI is that it is often poorly structured by authors, i.e. it is not prepared in a form which allows the data to be re-used, either by a perceptive human, or a logical machine. As a result it is often given little attention by referees (although bloggers seem to do a far better job) and furthermore can end up being lost behind a pay wall (the two Nature Chem interactive objects noted above can be openly accessed, but only if you know that they exist). So I determined that:

  1. The table should be immediately accessible by non-experts, but not through any convoluted processes of downloading a file, expanding it and finding the correct document within the resulting fileset to view in the correct program, which is how normal ESI is handled.
  2. The table and the data it contained within should be capable of acting as a scientific tool, forming what could be the starting point for a new investigation if appropriate.

To solve this issue, some lateral and quick thinking was needed. The solution was a two-component model in which the original article is treated as a “narrative“, intertwingled with a second, but nevertheless distinct component, the “data“. This data would follow the principles of the Amsterdam Manifesto; it would itself be citable. The two components would become symbiotes (a datument). The narrative[8] could cite this data and the data could back-link to the narrative. The data would inherit trust (i.e. peer review) from that applied to the narrative and the latter would inherit a date stamp and integrity from the data host (in this case Figshare[9]).*

The data itself can have two layers, presentation [9] using a combination of software (Jmol or JSmol for chemistry) which are used to invoke the “raw” data. That data itself is citable[10] (this is just a single example, resident as it happens on a different repository). The reader can choose use just the presentation layer or the underlying data.

The data object can be embedded in other pages; here it is below. The data sources for this table are themselves citable[11].



What are the advantages of such an approach? (the “what’s in it for me” question often asked by research students and their supervisors)

  1. Each of the components is held in an environment optimised for it and so can be presented to full advantage.
  2. The conventional narrative publisher does not necessarily also have to develop their own infrastructures for handling the data. They can choose to devolve that task to a “data publisher”.
  3. The data publisher (Figshare in this case) makes the data open. One does not need an institutional subscription to access it.
  4. “Added value” for each component can be done separately. Thus most narrative publishers would not necessarily wish to develop infrastructures for validating it or subsequently mining such “big data”. Indeed data mining of journals is prohibited by many publishers; it simply is either not possible or rendered so administratively difficult as to be impractical.
  5. Whilst a narrative article must clearly exist as a single instance (otherwise the authors would be accused of plagiarism), data can have multiple instances. Indeed, there exist protocols (SWORD) for moving data from one repository to another as the need arises. Publishing the same data in two or more locations is not currently considered plagiarism!
  6. The data component can be published as part of an article or say as part of a PhD thesis. This way, the creator of the data gets the advantages not of a date stamp associated with a narrative citation but of a much earlier stamp associated more closely with the actual creation of the data. That could easily and usefully resolve many disputes about who discovered what first, leaving the other issue of who interpreted what first to the narrative. I should mention that it is perfectly possible to “embargo” the data deposition so that it only becomes public when the narrative does (although you may choose not to do this).
  7. A data deposition cannot be modified, but a new version (which bidirectionally links back to the old one) can be published if say more data is collected at a future date.
  8. A whole infrastructure devoted just to enhancing the cited data can evolve; one that is unlikely to do so if the narrative publishers are the only stakeholders. For example, synthetic procedural data can be tagged using the excellent chemical tagger.
  9. It is relatively simple (=cheap) to build a pre-processor for publishing data, which for a research student can act as an electronic laboratory notebook, holding meta-data about the deposited/published data and the handles (doi) associated with each deposition. I have been using such an environment now for about seven years as the e-notebook for this blog for example. Thus the task of preparing figures and tables for a publication (or a blog post) is greatly facilitated. The same system is also used by research students and undergraduates for their lab work.
  10. I have noted previously how e.g. Google Scholar identifies data citations along with article citations in constructing an individual research profile. A researcher could become known for their published data as well as their published narratives. Indeed, it seems likely that the person who acquires and publishes the data, i.e. the research student, would then get accolades directly rather them all accruing to their supervisor.

But what can you, gentle reader of this blog, do to help? Well, ask if your institution already has, or plans to create a data repository. It can be local (we use DSpace) or “in-the-cloud” (e.g. Figshare). If not, ask why not! And if you are planning to submit an article for publication in the near future, ponder how you might better share its data.


As first circulated on 28 April, 2011. See 
http://www.epsrc.ac.uk/about/standards/researchdata/Pages/policyframework.aspx

The example given at the start of this post[8] contains only one table processed in this manner; the actual synthetic procedures are still held in more conventional SI.

*This blog uses the excellent Kcite plugin to manage citations.

The good folks at Figshare were extremely helpful in converting this deposition into an interactive presentation. Thanks guys!


References

  1. O. Casher, G.K. Chandramohan, M.J. Hargreaves, C. Leach, P. Murray-Rust, H.S. Rzepa, R. Sayle, and B.J. Whitaker, "Hyperactive molecules and the World-Wide-Web information system", Journal of the Chemical Society, Perkin Transactions 2, pp. 7, 1995. https://doi.org/10.1039/p29950000007
  2. R. Van Noorden, "Data-sharing: Everything on display", Nature, vol. 500, pp. 243-245, 2013. https://doi.org/10.1038/nj7461-243a
  3. P. Murray-Rust, H.S. Rzepa, and M. Wright, "Development of chemical markup language (CML) as a system for handling complex chemical content", New Journal of Chemistry, vol. 25, pp. 618-634, 2001. https://doi.org/10.1039/b008780g
  4. H.S. Rzepa, "Chemical datuments as scientific enablers", Journal of Cheminformatics, vol. 5, 2013. https://doi.org/10.1186/1758-2946-5-6
  5. H.S. Rzepa, "Transclusions of data into articles", 2013. https://doi.org/10.6084/m9.figshare.797481
  6. H.S. Rzepa, "The importance of being bonded", Nature Chemistry, vol. 1, pp. 510-512, 2009. https://doi.org/10.1038/nchem.373
  7. H.S. Rzepa, "The rational design of helium bonds", Nature Chemistry, vol. 2, pp. 390-393, 2010. https://doi.org/10.1038/nchem.596
  8. M.J. Cowley, V. Huch, H.S. Rzepa, and D. Scheschkewitz, "Equilibrium between a cyclotrisilene and an isolable base adduct of a disilenyl silylene", Nature Chemistry, vol. 5, pp. 876-879, 2013. https://doi.org/10.1038/nchem.1751
  9. D. Scheschkewitz, M.J. Cowley, V. Huch, and H.S. Rzepa, "The Vinylcarbene – Cyclopropene Equilibrium of Silicon: an Isolable Disilenyl Silylene", 2013. https://doi.org/10.6084/m9.figshare.744825
  10. H.S. Rzepa, "Gaussian Job Archive for C60H92Si3", 2012. https://doi.org/10.6084/m9.figshare.96410

The Amsterdam Manifesto on Data Citation Principles

Wednesday, July 31st, 2013

The Amsterdam manifesto espouses the principles of citable open data. It is a short document, and it is worth re-stating its eight points here:

  1. Data should be considered citable products of research.
  2. Such data should be held in persistent public repositories.
  3. If a publication is based on data not included with the article, those data should be cited in the publication.
  4. A data citation in a publication should resemble a bibliographic citation and be located in the publication’s reference list.
  5. Such a data citation should include a unique persistent identifier (a DataCite DOI recommended, or other persistent identifiers already in use within the community).
  6. The identifier should resolve to a page that either provides direct access to the data or information concerning its accessibility. Ideally, that landing page should be machine-actionable to promote interoperability of the data.
  7. If the data are available in different versions, the identifier should provide a method to access the previous or related versions.
  8. Data citation should facilitate attribution of credit to all contributors.

The manifesto itself is dated  20 March 2013, but the principles above go back far earlier, and most of the articles above have been implemented in this blog for a little while now. But its best illustrated with an example[1]. Here I have used the excellent wordpress extension Kcite to adhere to points 1-5 and 8 above. Point 6 is the most interesting perhaps, and it is illustrated below. If you click on the graphic, it will load the log file associated with the calculation described in the previous post to convert the data to a rotatable 3D model, using Jmol.

Click for  3D

Click to load from Figshare

This actually represents a departure from how I normally invoke data on this blog. I have hitherto done it by uploading the data to the WordPress uploads directory (it is thus in effect a local copy of the data). Here in this post I have not done that; it’s coming directly from the citable data repository, using e.g.

onclick="jmolInitialize('../Jmol/',true);jmolSetAppletColor('white');jmolApplet([500,500],'load http://files.figshare.com/1134372/logfile.log;frame 27;vectors on;vectors 4;vectors scale 5.0; color vectors orange; vibration 10;animation mode loop;');"

Well, almost. The URL of the actual dataset (http://files.figshare.com/1134372/logfile.log) is derived from the doi (10.6084/m9.figshare.757728) by an internal process which has no exposed algorithm. As it happens, the DSpace digital repository[2] is a bit better in this regard.

Click for  3D

Click to load from DSpace

This is loaded using

https://spectradspace.lib.imperial.ac.uk:8443/dspace/bitstream/handle/10042/24916/logfile.log;

which is directly derived from the handle itself (although again that algorithm has to be worked out by a human).

Point 7 above is also implemented (sort of). If you look at the Figshare repository, you will notice an additional link there which points to the Dspace repository. As it happens, the two datasets are identical (they are not different versions) and these semantics are currently NOT handled well. You could probably work this out from the date stamps of the two depositions (which are respectively Published on 29 Jul 2013 – 05:58 (GMT) and 2013-07-29T05:58:11Z for Figshare and DSpace respectively). 

Whereas the Amsterdam manifesto is here implemented on a blog post, I think the grander aspiration is that the principles are to be followed in ALL scientific publications. We are some way away yet from achieving this. But watch this space for an upcoming example! 

So can I urge all scientists who care about data to promulgate the principles of the Amsterdam manifesto, and wherever possible to practice what they preach! 

References

  1. H.S. Rzepa, "Gaussian Job Archive for ClF3", 2013. https://doi.org/10.6084/m9.figshare.757728
  2. "Cl 1 F 3", 2013. http://hdl.handle.net/10042/24916

150,000,000 DFT calculations on 2,300,000 compounds!

Friday, July 5th, 2013

The title of this post summarises the contents of a new molecular database: www.molecularspace.org[1] and I picked up on it by following the post by Jan Jensen at www.compchemhighlights.org (a wonderful overlay journal that tracks recent interesting articles). The molecularspace project more formally is called “The Harvard Clean Energy Project: Large-scale computational screening and design of organic photovoltaics on the world community grid“. It reminds of a 2005 project by Peter Murray-Rust et al at the same sort of concept[2] (the World-Wide-Molecular-Matrix, or WWMM[3]), although the new scale is certainly impressive. Here I report my initial experiences looking through molecularspace.org

The 150,000,000 calculations are released under the the CC-BY license, which is an encouraging (open) start. One does need however to login to the site, which I was able to do using my Google credentials. Shown below is a screenshot of a typical result in a search (of Power conversion efficiency in my case).

CEPDB1

It comes in two parts, the first being the structure (given as a SMILES and 2D layout) with the principle predicted energy levels and predicted photovoltaic performance listed below that. This is then followed by what might be called an annotation with further computed/predicted properties using the algorithms applied by Chemicalize.org. This idea that a data set could accrete via semantically powerful annotations using other tools was also very much part of the concept of the WWMM (the matrix had at its heart a molecule in one dimension and a property, measured or computed in the other. The matrix is of course very sparse, which is why it needs annotation!).

It was at this point however that I started to wonder how I might add other annotations, based perhaps on other types of calculations. But thus far at least, I have not found any trace of something which I could immediately use for my own calculation; 3D coordinates specifically. Thus, the HOMO-LUMO energy gap is the key property which makes molecularspace unique and valuable (to someone working in the field of photovoltaics). But HOMO/LUMO gaps can be calculated in many different ways, and it can always be valuable to calibrate/validate the reported values against other methods. Perhaps if I continue to look, I might find these 3D coordinates (which, for 2,300,000 molecules would be a very valuable resource).  Certainly for example, should  I wish to do so, I could not at the moment readily replicate the calculation for any specific entry on the molecularspace site (which can be regarded as an essential component of scientific validation). When I use the first person, I mean of course either myself as a human or a software agent acting on my behalf (the latter having the endurance to repeat its procedures millions of times if necessary). 

The reader of this blog may have noticed that whenever I report a calculation here, I like to cite its doi (more formally its handle), which links to a digital repository. In my case, the repository certainly carries the 3D coordinates, and also the full wavefunction provided if the reader wishes other properties to be derived from it. Now if molecularspace is able to provide that in the fullness of time, it truly would be an impressive resource.

But the important take-home message from molecularspace is that archiving (under a CC-BY license) the “big” data from any given research in a manner which makes it readily re-usable by others (perhaps from quite different fields of science) is now an essential requisite of doing science. And it is really nice to see good examples of this in practice!


Generally, the calculations I perform for this blog are published in a DSpace repository (the original one, started in 2006[4]), and more recently in Chempound (a project by Peter Murray-Rust and colleagues which emerged out of the WWMM experiments) as well as Figshare[5]. The first and the third assign unique handles (i.e. a doi) to the data; chempound does not (and neither does molecularspace).

References

  1. J. Hachmann, R. Olivares-Amaya, S. Atahan-Evrenk, C. Amador-Bedolla, R.S. Sánchez-Carrera, A. Gold-Parker, L. Vogt, A.M. Brockway, and A. Aspuru-Guzik, "The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid", The Journal of Physical Chemistry Letters, vol. 2, pp. 2241-2251, 2011. https://doi.org/10.1021/jz200866s
  2. P. Murray-Rust, H.S. Rzepa, J.J.P. Stewart, and Y. Zhang, "A global resource for computational chemistry", Journal of Molecular Modeling, vol. 11, pp. 532-541, 2005. https://doi.org/10.1007/s00894-005-0278-1
  3. P. Murray-Rust, S.E. Adams, J. Downing, J.A. Townsend, and Y. Zhang, "The semantic architecture of the World-Wide Molecular Matrix (WWMM)", Journal of Cheminformatics, vol. 3, 2011. https://doi.org/10.1186/1758-2946-3-42
  4. J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
  5. H.S. Rzepa, "Gaussian Job Archive for CLi6", 2013. https://doi.org/10.6084/m9.figshare.739310

Research data and the "h-index".

Monday, June 24th, 2013

The blog post by Rich Apodaca entitled “The Horrifying Future of Scientific Communication” is very thought provoking and well worth reading. He takes us through disruptive innovation, and how it might impact upon how scientists communicate their knowledge. One solution floated for us to ponder is that “supporting Information, combined with data mining tools, could eliminate most of the need for manuscripts in the first place“. I am going to juxtapose that suggestion on something else I recently discovered. 

Someone encouraged me to take a look at Google Scholar. It is one of those resources that, amongst other features, computes an individual’s h-index and i10-index (the former, having gone through its purple patch, is now apparently at the end of the road, at least for chemists). One reason perhaps why proper curation of research data is not high on most chemists’ list of priorities is that it does not contribute to one’s h-index, and particularly one’s prospects of a successful research career. Thus “supporting information (data)” is one of those things, like styling the citations in a research article, that most people probably prepare through gritted teeth (a rather annoying ritual without which a research article cannot be published). So when I inspected my own Google Scholar profile (you can do the same here) I was rather surprised to find, appended to all the regular research articles, a long list of data citations (sic!). Because I have placed much of my own data into a digital repository, this has opened it up to Google (where don’t they get to nowadays?) for listing (if not actually mining). These citations of themselves actually do not (currently?) contribute to eg the h-index, since currently these entries are not attracting citations by others. And that of course is because doing so is not yet an accepted part of the ritual of preparing a scientific article.

Most scientists must now be pondering what the future holds in terms of how they can bring themselves to the attention of others (in a good way) and hence progress their careers. So I will take Rich’s suggestion one step further. Those scientists who create new data in a process called research, should firstly curate this data properly (via eg a digital repository) and then expect to promote their activity by garnering not only citations for the published narratives (= articles) but also associated published data. Their success as a researcher would be (in part) judged by both. Who knows, as well as famous published narratives, perhaps we will also rank famous published datasets! 


I do the same for the data I use to support many of the posts for this blog.

Research data and the “h-index”.

Monday, June 24th, 2013

The blog post by Rich Apodaca entitled “The Horrifying Future of Scientific Communication” is very thought provoking and well worth reading. He takes us through disruptive innovation, and how it might impact upon how scientists communicate their knowledge. One solution floated for us to ponder is that “supporting Information, combined with data mining tools, could eliminate most of the need for manuscripts in the first place“. I am going to juxtapose that suggestion on something else I recently discovered. 

Someone encouraged me to take a look at Google Scholar. It is one of those resources that, amongst other features, computes an individual’s h-index and i10-index (the former, having gone through its purple patch, is now apparently at the end of the road, at least for chemists). One reason perhaps why proper curation of research data is not high on most chemists’ list of priorities is that it does not contribute to one’s h-index, and particularly one’s prospects of a successful research career. Thus “supporting information (data)” is one of those things, like styling the citations in a research article, that most people probably prepare through gritted teeth (a rather annoying ritual without which a research article cannot be published). So when I inspected my own Google Scholar profile (you can do the same here) I was rather surprised to find, appended to all the regular research articles, a long list of data citations (sic!). Because I have placed much of my own data into a digital repository, this has opened it up to Google (where don’t they get to nowadays?) for listing (if not actually mining). These citations of themselves actually do not (currently?) contribute to eg the h-index, since currently these entries are not attracting citations by others. And that of course is because doing so is not yet an accepted part of the ritual of preparing a scientific article.

Most scientists must now be pondering what the future holds in terms of how they can bring themselves to the attention of others (in a good way) and hence progress their careers. So I will take Rich’s suggestion one step further. Those scientists who create new data in a process called research, should firstly curate this data properly (via eg a digital repository) and then expect to promote their activity by garnering not only citations for the published narratives (= articles) but also associated published data. Their success as a researcher would be (in part) judged by both. Who knows, as well as famous published narratives, perhaps we will also rank famous published datasets! 


I do the same for the data I use to support many of the posts for this blog.

The demographics of a blog readership.

Sunday, January 20th, 2013

With metrics in science publishing controversial to say the least, I pondered whether to write about the impact/influence a science-based blog might have (never mind whether it constitutes any measure of esteem). These are all terms that feature large when an (academic) organisation undertakes a survey of its researchers’ effectiveness. WordPress (the organisation that provides the software used for this blog) recently enhanced the stats it offers for its users, and one of these caught my eye.

impact-factor

The above represents the demographics for the readership of this blog over the last ten months. In no particular order, I noted the following aspects:

  1. The total number of countries listed was 144.
  2. The country coming third was India.
  3. China came 51st. I noted this since it was recently announced in the news that China leads the world by a significant margin in patents granted for e.g. graphene research, with South Korea third. It would also be fair to say that China also heads the field for the number of chemists with a Ph.D. degree.
  4. There are some interesting gaps in the world map, from where not a single hit was recorded. I really must crack Greenland! 
  5. WordPress does not provide a demographic breakdown for individual posts, which would be an interesting one to see.

One does not get such statistics from conventional scientific publishers, where the number of citations of an article is considered far more important than the demographics of its readership. However, I cannot help but note that access to journals is largely controlled by paid-for subscriptions (the GOLD-OA model has yet to make a big impact in chemistry I fear), whereas blogs in contrast are almost entirely open (although access to them may be restricted in some countries). 


‡  I once listened to a talk by a manager whose mantra was the three e’s:  effective, efficient and economic.  

Digital repositories. An update to the update.

Monday, August 13th, 2012

A third digital repository has been added to the two I described before. Chempound is a free open-source repository which (unlike DSpace and Figshare) was developed specifically for chemistry.

It carries more semantic information (in the form of an RDF triple declaration), which allows SPARQL queries on the entry to be performed.

Our original DSpace repository is also being tweaked to allow additional information to be added to existing entries; in particular if an entry is linked to in a journal publication, the DOI of that article is inserted into the DSpace descriptions. It is also relatively simply to duplicate the information in one repository by re-depositing it into another. Thus it becomes feasible to clone the information about the 600+ entries in our DSpace that have been subsequently published in peer-reviewed journal articles, thus adding a measure of confidence to their provenance.

To compare how the three repositories carry information about the same molecule, invoke any of the links below:

  1. DSpace
  2. Figshare
  3. Chempound 

Digital repositories. An update.

Saturday, July 21st, 2012

I blogged about this two years ago and thought a brief update might be in order now. To support the discussions here, I often perform calculations, and most of these are then deposited into a DSpace digital repository, along with metadata. Anyone wishing to have the full details of any calculation can retrieve these from the repository. Now in 2012, such repositories are more important than ever. 

In the UK, the main funding organisations are increasingly requiring researchers to deposit their primary data in such open archives, and some disciplines are better than others at this (chemistry does not rank very highly in general however in terms of deposition of data). Our DSpace server is a local one running at Imperial College, but a few months back I became aware of Figshare, which aspires to operate on a much wider and more general scale.  So I have injected one of the calculations reported in another post (the IRC for the sodium tolyl thiolate reaction with dichlorobutenone) into Figshare, making use of the API which has recently been developed for this purpose and implemented by  Matt Harvey. As with DSspace, it issues a DOI, which can be then quoted wherever appropriate (and particularly in scientific articles). This particular deposition is 10.6084/m9.figshare.93096

This repository is still undergoing a lot of development, but already one can see many interesting features, such as export to Endnote or Mendeley, and a QR barcode for devices with cameras. I would encourage anyone who regularly generates e.g. computational chemistry data, or knows a group that does, to encourage them to make use of such facilities.

Postscript: If you have a look at this deposition in Figshare you may already notice some of the developments I note above.  Matt Harvey (who, with Mark Hahnel of Figshare, developed our publish script) has added to the entry:

* A data descriptor document URL

* Wikipedia and pubchem links (automatically resolved from Inchi/Key searches)

* Links to chemspider searches

* Links to all other objects in the  Spectra DSpace repository with a common Inchi/Key

Science publishers (and authors) please take note.

Monday, October 24th, 2011

I have for perhaps the last 25 years been urging publishers to recognise how science publishing could and should change. My latest thoughts are published in an article entitled “The past, present and future of Scientific discourse” (DOI: 10.1186/1758-2946-3-46). Here I take two articles, one published 58 years ago and one published last year, and attempt to reinvent some aspects. You can see the result for yourself (since this journal is laudably open access, and you will not need a subscription). The article is part of a special issue, arising from a one day symposium held in January 2011 entitled “Visions of a Semantic Molecular Future” in celebration of Peter Murray-Rust’s contributions over that period (go read all 15 articles on that theme in fact!).

Here I want to note just two features, which I have also striven to incorporate into many of the posts this blog (which in one small regard I have attempted to formulate as an experimental test-bed for publishing innovations). Scalable-Vector-Graphics (SVG) emerged around the turn of the millennium as a sort of HTML for images. To my knowledge, no science publisher has yet made it an intrinsic part of their publishing process (although gratifyingly all modern browsers support at least a sub-set of the format). Until now (perhaps). Thus 10.1186/1758-2946-3-46 contains diagrams in SVG, but you will need to avoid the Acrobat version, and go straight to the HTML version to see them. However, what sparked my noting all of this here was the recent announcement by Amazon that they are adopting a new format for their e-books, which they call Kindle Format 8 or KF8 (the successor to their Mobi7 format). To quote: “Technical and engineering books are created more efficiently with Cascading Style Sheet 3 formatting, nested tables, boxed elements and Scalable Vector Graphics“. This is wrapped in HTML5 to be able to provide (inter alia) a rich interactive experience for the reader. In fairness, there is also the more open epub3 which strives for the same. Other features of HTML5 include embedded chemistry using WebGL and the same mechanisms are being used for the construction of modern chemical structure drawing packages.

It remains to be seen how much of all of this will be adopted by mainstream chemistry publishers. Here, we do get into something of a cyclic argument. I suspect the publishers will argue that few of the authors that contribute to their journals will send them copy in any of these new formats and that it would be too expensive for them to re-engineer these articles with little or no help from such authors. The chemistry researchers who do the writing (perhaps composition might be a better word?) might argue there is little point in adopting innovative formats if the publishers do not accept them (I will point out that my injection of SVG into the above article did have some teething problems). For example, you will not find SVG noted in any of the “instructions for authors” in most “high impact journals” (or, come to that, HTML5).

If one looks at the 25 year old period, in 1986 all chemistry journals were distributed exclusively on paper. My office shelves still show the scars of bearing the weight of all that paper. Move on 25 years, and all journals almost without exception are now distributed electronically. I suspect the outcome in many a reader’s hands is simply that they (rather than the publisher) now bear the printing costs themselves (despite or perhaps because of the introduction of electronic binders such as Mendeley). But it will only be when the article itself grows out of its printable constraints, and hops onto mobile devices such as Kindles and iPads in the promised (scientifically) interactive and data-rich form, that the true revolution will start taking place.

A final observation: you will not readily obtain the interactive features of 10.1186/1758-2946-3-46 on e.g. an iPad or Kindle because the Java-based Jmol is not supported on either. But Jmol has now been ported to Android, and its certainly one to watch.