Posts Tagged ‘Google’

Five things you did not know about (fork) handles.

Tuesday, March 18th, 2014

OK, you have to be British to understand the pun in the title, a famous comedy skit about four candles. Back to science, and my mention of some crystal data now having a DOI in the previous post. I thought it might be fun to replicate the contents of one of my ACS slides here.

Firstly, a DOI is one implementation of a more generic (and quite old) concept known as a Handle. This is one form of a persistent digital identifier. Article DOIs have been in common use for at least ten years now, and even new chemistry students know about them! A DOI points to an article in a journal? Not quite as it happens, but in fact it could be a whole lot more that a DOI could lead to! Let me explain by showing you five examples:

  1. doi.org/10042/26065 resolves to a landing page. Crucially, this is NOT the article itself, which may remain obstinately behind a paywall to which you have no access.
  2. doi.org/10042/26065?locatt=filename:input.gjf resolves to a file input.gjf that may be present off the landing page, and hence allowing a machine action to retrieve it.
  3. doi.org/10042/26065?locatt=mimetype:chemical/x-gaussian-input resolves to the first file matching the MIME type that may be present off the landing page, and hence allowing a machine action to retrieve it.
  4. doi.org/10042/26065?locatt=id:1 resolves to the  first file matching ID=1 that may be present off the landing page, and hence allowing a machine action to retrieve it.
  5. doi.org/api/10042/26065 will return the JSON-encoded full handle record for processing in Javascript, so that a machine now has access to all the information it might need to perform a machine action.

Now, items 2-5 are not generally available; they work only on our servers. We have placed them there to show how item 6 of the Amsterdam Manifesto could be made to work. There are other ways of course. But you can see them in action here[1] (the article is open access, so you should not get any paywall behaviour from the landing page).


Postscript. A few days ago, I asked my group of 1st year undergraduate students how they might go about tracking down a journal article from its authors, the journal name and the page numbers. The most common reply was “Google it”. Next came “go to the library and find it on the shelves”. One replied “from its DOI” (that student had done an internship in a pharma company before joining us). I used to teach a chemical information course here[2] between 1996 – 2010 where this sort of stuff was a staple. That course is no longer taught. Hence the aforementioned replies!

References

  1. A. Armstrong, R.A. Boto, P. Dingwall, J. Contreras-García, M.J. Harvey, N.J. Mason, and H.S. Rzepa, "The Houk–List transition states for organocatalytic mechanisms revisited", Chem. Sci., vol. 5, pp. 2057-2071, 2014. https://doi.org/10.1039/c3sc53416b
  2. "It:lectures-2011 - ChemWiki", 2019. http://doi.org/10042/a3v06

Blasts from the past and present: altmetrics.

Sunday, October 13th, 2013

I reminisced about the wonderfully naive but exciting Web-period of 1993-1994. This introduced the server-log analysis to us for the first time, and hits-on-a-web-page. One of our first attempts at crowd-sourcing and analysis was to run an electronic conference in heterocyclic chemistry and to look at how the attendees visited the individual posters and presentations by analysing the server logs.

all_accesses

You can read all about that analysis here. This is one interesting graphic below, showing the 24-hour distribution of accesses. Remember, this was before Google and its analytics even existed (and yes, we were also doing Google-like searches before they did).

hourly-accesses

But let me get to the actual point of this post. A decade or so ago, all universities in the UK were asked to undertake a quality review exercise of their research outputs. One of the metrics of such outputs is the scientific publication, and each research group leader had to collect their most important four articles published in the previous few years and submit them (as paper) to a review panel. This poor panel was faced with a mountain of paperwork (literally!) when they arrived to do their job. It was soon decided that a better (electronic) system had to be devised. So now we have a product called Symplectic (which as it happens originated in the physics department here at Imperial College), which tirelessly gathers such outputs. More accurately, it gathers the meta-data for research publications, since most publishers do not allow actual reprints to be so harvested! And when it finds a new article, it informs its author, and asks them to check that the meta-data is accurate.

So it was a few days ago that I received such an alert. I checked the meta-data (adding in fact some which associates the scientific work with a particular resource, our High-Performance-Computing unit, and also the NMR systems here) but then the following thumbnail caught my eye. The wonderful Symplectic system had computed this for me. 

altmetrics3

This I had to see. Expanded, it shows as follows. An altmetric measures  attention. And attention (however transient) is apparently itself measured by tweets, facebook, news outlets, science blogs, Mendeley and CiteULike

altmetrics1

Well, things have certainly moved on from the days of analysing server-logs! Now, would an aspiring tenure-track young scientist, presenting an altmetric score of 28 to their head of department expect to get their tenure on this basis? Of course, we are back to the old hoary chestnut. Is attention necessarily good? You cannot tell from the above if we have indeed produced worthy science, or science to be scorned.

Well, the above represents a 20 year period in the evolution of science and how it is communicated. Whether this represents positive progress I leave you to decide. And if one of your altmetric scores is > 28, you have done better than us!


Does the icon look familiar? See here.

A two-publisher model for the scientific article: narrative+shared data.

Sunday, September 15th, 2013

I do go on rather a lot about enabling or hyper-activating[1] data. So do others[2]. Why is sharing data important?

  1. Reproducibility is a cornerstone in science,
  2. To achieve this, it is important that scientific research be open and transparent.
  3. Openly available research data is central to achieving this. It is estimated that less than 20% of the data collected in chemistry is made available in any open manner.
  4. RCUK (the UK research councils) wish increased transparency of publicly funded research and availability of its outputs

But it’s not all hot air, honestly. Peter Murray-Rust and I had started out on a journey to improve reproducibility, openness and transparency in (inter alia) scientific publishing in 1994. In 2001 we published an example of a data-rich article[3] based on CML, and by 2004 the concept had evolved into something Peter termed a datument[4]. Some forty such have now been crafted.[5]

In 2009, the journal Nature Chemistry was starting up, and I approached them with the idea of an interactive data exploratorium on the premise that a new journal might be receptive to new ways of presenting science. It was accepted and published[6] and was followed in 2010 by a second variation.[7] In both cases, these activated-figures were sent to the journal as part of the submission process, and hosted by them (they still are). You can even access them without a subscription to the journal!

Move on to 2012, when David Scheschkewitz had some very exciting silicon chemistry to report, we collaborated on some computational modelling, and sent the resulting article to Nature Chemistry for publication. This included the usual interactive table reporting the modelling and its data. However, it transpired that the production workflows for Nature Chemistry had been streamlined and I was informed that interactive tables could no longer be accepted. This time, we (i.e. the authors) would have to solve the issue of how to host and present the data ourselves.

I was very keen that this table be treated with equal weight to the article itself (citable in its own right) and that it not be downgraded to supporting information (ESI). My objection to ESI is that it is often poorly structured by authors, i.e. it is not prepared in a form which allows the data to be re-used, either by a perceptive human, or a logical machine. As a result it is often given little attention by referees (although bloggers seem to do a far better job) and furthermore can end up being lost behind a pay wall (the two Nature Chem interactive objects noted above can be openly accessed, but only if you know that they exist). So I determined that:

  1. The table should be immediately accessible by non-experts, but not through any convoluted processes of downloading a file, expanding it and finding the correct document within the resulting fileset to view in the correct program, which is how normal ESI is handled.
  2. The table and the data it contained within should be capable of acting as a scientific tool, forming what could be the starting point for a new investigation if appropriate.

To solve this issue, some lateral and quick thinking was needed. The solution was a two-component model in which the original article is treated as a “narrative“, intertwingled with a second, but nevertheless distinct component, the “data“. This data would follow the principles of the Amsterdam Manifesto; it would itself be citable. The two components would become symbiotes (a datument). The narrative[8] could cite this data and the data could back-link to the narrative. The data would inherit trust (i.e. peer review) from that applied to the narrative and the latter would inherit a date stamp and integrity from the data host (in this case Figshare[9]).*

The data itself can have two layers, presentation [9] using a combination of software (Jmol or JSmol for chemistry) which are used to invoke the “raw” data. That data itself is citable[10] (this is just a single example, resident as it happens on a different repository). The reader can choose use just the presentation layer or the underlying data.

The data object can be embedded in other pages; here it is below. The data sources for this table are themselves citable[11].



What are the advantages of such an approach? (the “what’s in it for me” question often asked by research students and their supervisors)

  1. Each of the components is held in an environment optimised for it and so can be presented to full advantage.
  2. The conventional narrative publisher does not necessarily also have to develop their own infrastructures for handling the data. They can choose to devolve that task to a “data publisher”.
  3. The data publisher (Figshare in this case) makes the data open. One does not need an institutional subscription to access it.
  4. “Added value” for each component can be done separately. Thus most narrative publishers would not necessarily wish to develop infrastructures for validating it or subsequently mining such “big data”. Indeed data mining of journals is prohibited by many publishers; it simply is either not possible or rendered so administratively difficult as to be impractical.
  5. Whilst a narrative article must clearly exist as a single instance (otherwise the authors would be accused of plagiarism), data can have multiple instances. Indeed, there exist protocols (SWORD) for moving data from one repository to another as the need arises. Publishing the same data in two or more locations is not currently considered plagiarism!
  6. The data component can be published as part of an article or say as part of a PhD thesis. This way, the creator of the data gets the advantages not of a date stamp associated with a narrative citation but of a much earlier stamp associated more closely with the actual creation of the data. That could easily and usefully resolve many disputes about who discovered what first, leaving the other issue of who interpreted what first to the narrative. I should mention that it is perfectly possible to “embargo” the data deposition so that it only becomes public when the narrative does (although you may choose not to do this).
  7. A data deposition cannot be modified, but a new version (which bidirectionally links back to the old one) can be published if say more data is collected at a future date.
  8. A whole infrastructure devoted just to enhancing the cited data can evolve; one that is unlikely to do so if the narrative publishers are the only stakeholders. For example, synthetic procedural data can be tagged using the excellent chemical tagger.
  9. It is relatively simple (=cheap) to build a pre-processor for publishing data, which for a research student can act as an electronic laboratory notebook, holding meta-data about the deposited/published data and the handles (doi) associated with each deposition. I have been using such an environment now for about seven years as the e-notebook for this blog for example. Thus the task of preparing figures and tables for a publication (or a blog post) is greatly facilitated. The same system is also used by research students and undergraduates for their lab work.
  10. I have noted previously how e.g. Google Scholar identifies data citations along with article citations in constructing an individual research profile. A researcher could become known for their published data as well as their published narratives. Indeed, it seems likely that the person who acquires and publishes the data, i.e. the research student, would then get accolades directly rather them all accruing to their supervisor.

But what can you, gentle reader of this blog, do to help? Well, ask if your institution already has, or plans to create a data repository. It can be local (we use DSpace) or “in-the-cloud” (e.g. Figshare). If not, ask why not! And if you are planning to submit an article for publication in the near future, ponder how you might better share its data.


As first circulated on 28 April, 2011. See 
http://www.epsrc.ac.uk/about/standards/researchdata/Pages/policyframework.aspx

The example given at the start of this post[8] contains only one table processed in this manner; the actual synthetic procedures are still held in more conventional SI.

*This blog uses the excellent Kcite plugin to manage citations.

The good folks at Figshare were extremely helpful in converting this deposition into an interactive presentation. Thanks guys!


References

  1. O. Casher, G.K. Chandramohan, M.J. Hargreaves, C. Leach, P. Murray-Rust, H.S. Rzepa, R. Sayle, and B.J. Whitaker, "Hyperactive molecules and the World-Wide-Web information system", Journal of the Chemical Society, Perkin Transactions 2, pp. 7, 1995. https://doi.org/10.1039/p29950000007
  2. R. Van Noorden, "Data-sharing: Everything on display", Nature, vol. 500, pp. 243-245, 2013. https://doi.org/10.1038/nj7461-243a
  3. P. Murray-Rust, H.S. Rzepa, and M. Wright, "Development of chemical markup language (CML) as a system for handling complex chemical content", New Journal of Chemistry, vol. 25, pp. 618-634, 2001. https://doi.org/10.1039/b008780g
  4. H.S. Rzepa, "Chemical datuments as scientific enablers", Journal of Cheminformatics, vol. 5, 2013. https://doi.org/10.1186/1758-2946-5-6
  5. H.S. Rzepa, "Transclusions of data into articles", 2013. https://doi.org/10.6084/m9.figshare.797481
  6. H.S. Rzepa, "The importance of being bonded", Nature Chemistry, vol. 1, pp. 510-512, 2009. https://doi.org/10.1038/nchem.373
  7. H.S. Rzepa, "The rational design of helium bonds", Nature Chemistry, vol. 2, pp. 390-393, 2010. https://doi.org/10.1038/nchem.596
  8. M.J. Cowley, V. Huch, H.S. Rzepa, and D. Scheschkewitz, "Equilibrium between a cyclotrisilene and an isolable base adduct of a disilenyl silylene", Nature Chemistry, vol. 5, pp. 876-879, 2013. https://doi.org/10.1038/nchem.1751
  9. D. Scheschkewitz, M.J. Cowley, V. Huch, and H.S. Rzepa, "The Vinylcarbene – Cyclopropene Equilibrium of Silicon: an Isolable Disilenyl Silylene", 2013. https://doi.org/10.6084/m9.figshare.744825
  10. H.S. Rzepa, "Gaussian Job Archive for C60H92Si3", 2012. https://doi.org/10.6084/m9.figshare.96410

150,000,000 DFT calculations on 2,300,000 compounds!

Friday, July 5th, 2013

The title of this post summarises the contents of a new molecular database: www.molecularspace.org[1] and I picked up on it by following the post by Jan Jensen at www.compchemhighlights.org (a wonderful overlay journal that tracks recent interesting articles). The molecularspace project more formally is called “The Harvard Clean Energy Project: Large-scale computational screening and design of organic photovoltaics on the world community grid“. It reminds of a 2005 project by Peter Murray-Rust et al at the same sort of concept[2] (the World-Wide-Molecular-Matrix, or WWMM[3]), although the new scale is certainly impressive. Here I report my initial experiences looking through molecularspace.org

The 150,000,000 calculations are released under the the CC-BY license, which is an encouraging (open) start. One does need however to login to the site, which I was able to do using my Google credentials. Shown below is a screenshot of a typical result in a search (of Power conversion efficiency in my case).

CEPDB1

It comes in two parts, the first being the structure (given as a SMILES and 2D layout) with the principle predicted energy levels and predicted photovoltaic performance listed below that. This is then followed by what might be called an annotation with further computed/predicted properties using the algorithms applied by Chemicalize.org. This idea that a data set could accrete via semantically powerful annotations using other tools was also very much part of the concept of the WWMM (the matrix had at its heart a molecule in one dimension and a property, measured or computed in the other. The matrix is of course very sparse, which is why it needs annotation!).

It was at this point however that I started to wonder how I might add other annotations, based perhaps on other types of calculations. But thus far at least, I have not found any trace of something which I could immediately use for my own calculation; 3D coordinates specifically. Thus, the HOMO-LUMO energy gap is the key property which makes molecularspace unique and valuable (to someone working in the field of photovoltaics). But HOMO/LUMO gaps can be calculated in many different ways, and it can always be valuable to calibrate/validate the reported values against other methods. Perhaps if I continue to look, I might find these 3D coordinates (which, for 2,300,000 molecules would be a very valuable resource).  Certainly for example, should  I wish to do so, I could not at the moment readily replicate the calculation for any specific entry on the molecularspace site (which can be regarded as an essential component of scientific validation). When I use the first person, I mean of course either myself as a human or a software agent acting on my behalf (the latter having the endurance to repeat its procedures millions of times if necessary). 

The reader of this blog may have noticed that whenever I report a calculation here, I like to cite its doi (more formally its handle), which links to a digital repository. In my case, the repository certainly carries the 3D coordinates, and also the full wavefunction provided if the reader wishes other properties to be derived from it. Now if molecularspace is able to provide that in the fullness of time, it truly would be an impressive resource.

But the important take-home message from molecularspace is that archiving (under a CC-BY license) the “big” data from any given research in a manner which makes it readily re-usable by others (perhaps from quite different fields of science) is now an essential requisite of doing science. And it is really nice to see good examples of this in practice!


Generally, the calculations I perform for this blog are published in a DSpace repository (the original one, started in 2006[4]), and more recently in Chempound (a project by Peter Murray-Rust and colleagues which emerged out of the WWMM experiments) as well as Figshare[5]. The first and the third assign unique handles (i.e. a doi) to the data; chempound does not (and neither does molecularspace).

References

  1. J. Hachmann, R. Olivares-Amaya, S. Atahan-Evrenk, C. Amador-Bedolla, R.S. Sánchez-Carrera, A. Gold-Parker, L. Vogt, A.M. Brockway, and A. Aspuru-Guzik, "The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid", The Journal of Physical Chemistry Letters, vol. 2, pp. 2241-2251, 2011. https://doi.org/10.1021/jz200866s
  2. P. Murray-Rust, H.S. Rzepa, J.J.P. Stewart, and Y. Zhang, "A global resource for computational chemistry", Journal of Molecular Modeling, vol. 11, pp. 532-541, 2005. https://doi.org/10.1007/s00894-005-0278-1
  3. P. Murray-Rust, S.E. Adams, J. Downing, J.A. Townsend, and Y. Zhang, "The semantic architecture of the World-Wide Molecular Matrix (WWMM)", Journal of Cheminformatics, vol. 3, 2011. https://doi.org/10.1186/1758-2946-3-42
  4. J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
  5. H.S. Rzepa, "Gaussian Job Archive for CLi6", 2013. https://doi.org/10.6084/m9.figshare.739310

Research data and the “h-index”.

Monday, June 24th, 2013

The blog post by Rich Apodaca entitled “The Horrifying Future of Scientific Communication” is very thought provoking and well worth reading. He takes us through disruptive innovation, and how it might impact upon how scientists communicate their knowledge. One solution floated for us to ponder is that “supporting Information, combined with data mining tools, could eliminate most of the need for manuscripts in the first place“. I am going to juxtapose that suggestion on something else I recently discovered. 

Someone encouraged me to take a look at Google Scholar. It is one of those resources that, amongst other features, computes an individual’s h-index and i10-index (the former, having gone through its purple patch, is now apparently at the end of the road, at least for chemists). One reason perhaps why proper curation of research data is not high on most chemists’ list of priorities is that it does not contribute to one’s h-index, and particularly one’s prospects of a successful research career. Thus “supporting information (data)” is one of those things, like styling the citations in a research article, that most people probably prepare through gritted teeth (a rather annoying ritual without which a research article cannot be published). So when I inspected my own Google Scholar profile (you can do the same here) I was rather surprised to find, appended to all the regular research articles, a long list of data citations (sic!). Because I have placed much of my own data into a digital repository, this has opened it up to Google (where don’t they get to nowadays?) for listing (if not actually mining). These citations of themselves actually do not (currently?) contribute to eg the h-index, since currently these entries are not attracting citations by others. And that of course is because doing so is not yet an accepted part of the ritual of preparing a scientific article.

Most scientists must now be pondering what the future holds in terms of how they can bring themselves to the attention of others (in a good way) and hence progress their careers. So I will take Rich’s suggestion one step further. Those scientists who create new data in a process called research, should firstly curate this data properly (via eg a digital repository) and then expect to promote their activity by garnering not only citations for the published narratives (= articles) but also associated published data. Their success as a researcher would be (in part) judged by both. Who knows, as well as famous published narratives, perhaps we will also rank famous published datasets! 


I do the same for the data I use to support many of the posts for this blog.

Research data and the "h-index".

Monday, June 24th, 2013

The blog post by Rich Apodaca entitled “The Horrifying Future of Scientific Communication” is very thought provoking and well worth reading. He takes us through disruptive innovation, and how it might impact upon how scientists communicate their knowledge. One solution floated for us to ponder is that “supporting Information, combined with data mining tools, could eliminate most of the need for manuscripts in the first place“. I am going to juxtapose that suggestion on something else I recently discovered. 

Someone encouraged me to take a look at Google Scholar. It is one of those resources that, amongst other features, computes an individual’s h-index and i10-index (the former, having gone through its purple patch, is now apparently at the end of the road, at least for chemists). One reason perhaps why proper curation of research data is not high on most chemists’ list of priorities is that it does not contribute to one’s h-index, and particularly one’s prospects of a successful research career. Thus “supporting information (data)” is one of those things, like styling the citations in a research article, that most people probably prepare through gritted teeth (a rather annoying ritual without which a research article cannot be published). So when I inspected my own Google Scholar profile (you can do the same here) I was rather surprised to find, appended to all the regular research articles, a long list of data citations (sic!). Because I have placed much of my own data into a digital repository, this has opened it up to Google (where don’t they get to nowadays?) for listing (if not actually mining). These citations of themselves actually do not (currently?) contribute to eg the h-index, since currently these entries are not attracting citations by others. And that of course is because doing so is not yet an accepted part of the ritual of preparing a scientific article.

Most scientists must now be pondering what the future holds in terms of how they can bring themselves to the attention of others (in a good way) and hence progress their careers. So I will take Rich’s suggestion one step further. Those scientists who create new data in a process called research, should firstly curate this data properly (via eg a digital repository) and then expect to promote their activity by garnering not only citations for the published narratives (= articles) but also associated published data. Their success as a researcher would be (in part) judged by both. Who knows, as well as famous published narratives, perhaps we will also rank famous published datasets! 


I do the same for the data I use to support many of the posts for this blog.

Computers 1967-2013: a personal perspective. Part 5. Network bandwidth.

Wednesday, June 5th, 2013

In a time of change, we often do not notice that Δ = ∫δ. Here I am thinking of network bandwidth, and my personal experience of it over a 46 year period.

I first encountered bandwidth in 1967 (although it was not called that then). I was writing Algol code to compute the value of π, using paper tape to send the code to the computer. Unfortunately, the paper tape punch was about 10 km from that computer. The round trip (by van) took about a week, the outcome being often merely to discover that the first line of the code contained a compilation error. I think I got to computing π after about six weeks. That is a bandwidth of about 18 characters (108 bits) in 3628800 seconds, or 0.00003 bits per second.

I did my undergraduate work in 1969, when the distance between the card punch and the computer had reduced to about 50m, and instant turnaround involved circulating in a loop between the punch and the line printer, hoping that neither suffered a paper-wreck. The bandwidth had certainly gone up. On a good day, you could make 20 or so circuits, which did leave one feeling faintly dizzy. 

The next improvement came in 1972, when I was solving non-linear equations for kinetic rate constants, using a 110 bits per second (baud) or ~ 18 characters per second using the 6-bit computers of that era) teletypewriter. This was about 50m from the lab where the kinetic measurements were made (using, if you are interested a scintillation counter. Yes, I was mildly radioactive for most of my PhD, but I do not believe I glowed in the dark). This bandwidth was in fact fine for uploading kinetic data, and receiving the computed rate constant and its standard error. You might note however that this teletypewriter was the only one in the building I occupied, and yet demand for it was small (I was pretty much its only user). 

The next increment occurred in Texas 1974-1977, where I was now doing quantum chemical calculations. Back in time to the card punch and the lineprinter (Texas is big, and so now the distance between them was a 10 minute walk). But in my last year there, a state-of-the-art 300 baud teletypewriter was installed! This was now fast enough to play a computer game (something to do with Dragons and Dungeons I think), and so now there was competition to use it. Particularly from one of my friends, who shall be called George, and who on one occasion spent about 48 virtually contiguous hours trying to get to the last level. The rest of us returned to the card punch to submit the calculations. It was also during this period that the first emails started to be exchanged, but only really as a curiosity: “it would never catch on” was the opinion of most.

Back in the UK by 1977, I was overwhelmed by the speed of the 9.6 kbaud graphics terminal I now had access to, 32 times faster. And the rate continued to multiply, by a further 1000 to attain 10 Mbaud in 1987. But another change occurred during this period. The previous eras had involved transmitting the data no more than ~200m, from one point in the campus to another. But by 1986, if one tried hard enough, one could reach ARPANET. And that was 5000 km away! My first use of such distances was to reach California and download Apple’s system 5.0 for the Macs in the department (I have described elsewhere the role the Mac’s printer port played in this). From then on, we always did have the latest operating system installed on most of the machines (although not always did this subterfuge address the intended issue, which was to stop the computer crashing as often).

These speeds however did not reach beyond the university. Back home, around 1983, I was back to using a 300 baud modem, with an acoustic coupler to the land line. Our young daughter, aged 3 at the time, joined in the data transmission with gusto. Her joyful shrieks were invariably picked up by the acoustic coupler, and translated into a jumble of characters, which were then interleaved into the numbers coming back from quantum calculations. It was sometimes difficult to tell them apart! These domestic modems gradually got faster, probably attaining 9.6 kbaud by about 1993 (during the course of which the acoustic component was replaced by electronics, and oddly, our daughter stopped shrieking in quite the same way). 

Back in the university in 1993, the first 100 megabits per second (100Mbps ≅100 Mbaud) ethernet lines and switches were being installed, but the national and international backbones were still a lot slower. It was in this year that I was approached to be part of a SuperJanet project. We were going to do a molecular videoconference from London to Cambridge and Leeds; a three-way connection, and this needed ~ 20Mbps to transmit the signal from the video camera as well as the 3D images of molecules in real-time (compression techniques were not so advanced in those days). Because BT was sponsoring the project, they naturally wanted some publicity, and so we even got to appear on the national television news that night. But we came within about 1 minute of a disaster. Our 20Mbps connection went through the SuperJanet national backbone, the capacity of which was, you guessed, ~ 20 Mbps. The network operators (located at the Rutherford-Appleton laboratories), who we had not had the foresight to pre-warn, came within 1 minute of isolating Imperial College from the national network because of our bandwidth hogging. I met them a month or so later, and they told me this. I feel I was lucky to escape with my life and body intact from that meeting (or to put it another way, they were not happy bunnies). 

By about 2000, I had achieved 1 Gbps to my desktop computer (and there it has stayed for the past 13 years). What about home? Well, to cut the story short, I recently benchmarked the domestic WiFi connection between a laptop and “the world” at about 65 Mbps (download) and 18 Mbps (upload), a little less than 1 million times greater than 30 years earlier and a 12 orders of magnitude greater than in 1967. I gather however that some lucky inhabitants of Austin Texas (the scene of my 1974-1977 experiments), courtesy of Google, can get 1 Gbps!

I will end by quoting Samuel Butler, writing in 1863I venture to suggest that … the general development of the human race to be well and effectually completed when all men, in all places, without any loss of time, at a low rate of charge, are cognizant through their senses, of all that they desire to be cognizant of in all other places. … This is the grand annihilation of time and place which we are all striving for, and which in one small part we have been permitted to see actually realised” (Quoted in George Dyson, “Darwin amongst the Machines, The Evolution of Global Intelligence”, Addison-Wesley, N.Y., 1997. ISBN 0-201-400649-7).


I just benchmarked my office computer (using only solid-state memory and that 1Gbps connection) and got 58Mbps (download)/75Mbps (upload).

The standard program was NCSA Telnet if  I remember. You made a connection from the computer (using its printer port) to the ARPANET node at University College London (not a widely advertised service), and thence to an Apple FTP site where one could initiate an anonymous file transfer back to one’s computer.  System 5 was about half a Mbyte then, and this took about 1-2 hours to retrieve (unless the connection went down, in which case one started again).

Mobile-friendly solutions for viewing (WordPress) Blogs with embedded 3D molecular coordinates.

Sunday, December 11th, 2011

My very first post on this blog, in 2008, was to describe how Jmol could be used to illustrate chemical themes by adding 3D models to posts. Many of my subsequent efforts have indeed invoked Jmol. I thought I might review progress since then, with a particular focus on using the new generations of mobile device that have subsequently emerged.

  1. Jmol is based on Java, which has been adopted by Google’s Android mobile operating system, but not by Apple’s IOS.
    • An Android version of Jmol was recently released, to rave reviews! I do not know however whether the Jmol on these posts can be viewed via Android. Perhaps someone can post a comment here on that aspect?
    • HP has just announced it will open source WebOS, but it seems Java will not be supported so probably no Jmol there then.
    • Windows 8 Mobile (Metro) also seems unlikely to support it either.
  2. Apple has been prominent in touting HTML5 as a Java replacement. In practice, this means that any molecular viewer would be based on a combination of Javascript and WebGL technologies.  Whereas Java is a compiled language, Javascript is interpreted on-the-fly by the browser. Its viability has been greatly increased by very large improvements in the speeds that browsers interpret Javascript nowadays, but this speed is unlikely to ever match that of Java. The real issue is whether that matters. The other difference is that whereas a signed Java applet allows data to escape from the security Sandbox (and into eg a file system), Javascript is likely to be much more restrictive. These two properties mean that Javascript/HTML5 implementations make a lot of use of server-side functionality; in other words a lot of bytes may have to flow between server and mobile device to achieve a desired effect (and the user may have to pay for these bytes via their data plan).
    • One early adopter of the Javascript/WebGL HTML5 model has been ChemDoodle, which I illustrated on this blog about a year ago. I have tidied up the recipe for invoking it since then, and this is given below for anyone interested in implementing it. As of this moment, one essential component, WebGL, is only available to developers of Apple’s IOS system, but I expect this to become generally available soon. When that happens, ChemDoodle components on this blog will start working.
    • A new entrant is GLmol, an open source molecular viewer for Apple’s IOS. A version is also available for Android. I may give a try at embedding this into the blog.
It seems that the 3D molecular viewing options are certainly increasing, but at the moment there is some uncertainty in performance, compatibility and the ability to extract molecular data from the “sandboxes“. This last comment relates to the re-usability of data, which I particularly value.

Although this post has focussed on embedding and rendering molecular data into a blog post, the same principle in fact applies to other expressions. Perhaps the most interesting is the epub3 e-book format, which also supports Javascript/HTML5, and which seems likely to be adopted for future interactive e-books. Indeed, it should be possible to fully convert an interactive blog created using this technology to a e-book with relatively little effort. I have also illustrated here how lecture notes can be so converted.

If you get the impression that the task of a modern communicator of science and chemistry is not merely that of penning well chosen words to describe their topic, but of having to program their effort, then you may not be mistaken.


Procedure for creating a 3D model in a WordPress blog post using ChemDoodle.

  1. As administrator, go to
    wp-content/themes/default

    (or whatever theme you use) and in the file header.php, paste the following

    <link rel="stylesheet" href="../ChemDoodle/ChemDoodleWeb.css" type="text/css">
      <script type="text/javascript" src="../ChemDoodle/ChemDoodleWeb-libs.js"></script>
      <script type="text/javascript" src="../ChemDoodle/ChemDoodleWeb.js"></script>
       <script type="text/javascript" language="JavaScript">
      function httpGet(theUrl)
       {var xmlHttp = null;
       xmlHttp = new XMLHttpRequest();
       xmlHttp.open( "GET", theUrl, false );
       xmlHttp.send( );
       return xmlHttp.responseText;}
       </script>
  2. From here, get the ChemDoodle components and put them into the directory immediately above the WordPress installation. They are there referenced by the path ../ChemDoodle as in the script above. You can put the folder elsewhere if you modify the path in the script accordingly.
  3. Invoke an instance of a molecule thus;
    <script type="text/javascript">// <![CDATA[
    var transformBallAndStick2 = new ChemDoodle.TransformCanvas3D('transformBallAndStick2', 190, 190);transformBallAndStick2.specs.set3DRepresentation('Ball and Stick');         transformBallAndStick2.specs.backgroundColor = 'white';var molFile = httpGet( 'wp-content/uploads/2011/12/85-trans.mol' );var molecule = ChemDoodle.readMOL(molFile, 2);         transformBallAndStick2.loadMolecule(molecule);
    // ]]></script>
  4. The key requirement is that the body of the script (starting with var) must not contain any line breaks; it must be a single wide line. So that you can see the whole line here, I show it in wrapped form (which you must not use);
    var transformBallAndStick2 = new
    ChemDoodle.TransformCanvas3D('
    transformBallAndStick2', 190,
    190);transformBallAndStick2.specs.
    set3DRepresentation('Ball and Stick');
    transformBallAndStick2.specs.
    backgroundColor = 'white';var molFile =
    httpGet('wp-content/uploads/2011/12/85-trans.mol');
    var molecule =ChemDoodle.readMOL(molFile, 2);
    transformBallAndStick2.loadMolecule(molecule);
  5. The key data will be located in the path wp-content/uploads/2011/12/85-trans.mol which you should upload. Note that only the MDL molfile is supported in this mode (which makes no server-side requests). One can use eg CML, but this must be as a server request.
  6. If you want multiple instances, then you must change each occurrence of the name of the variable, e.g. transformBallAndStick2 to be unique for each.
  7. If you want to annotate the resulting display, server-side requests are again needed. I do not illustrate these here, but there is an excellent tutorial.

A Digital chemical repository – is it being used?

Tuesday, May 4th, 2010

In this previous blog post I wrote about one way in which we have enhanced the journal article. Associated with that enhancement, and also sprinkled liberally throughout this blog, are links to a Digital Repository (if you want to read all about it, see DOI: 10.1021/ci7004737). It is a fairly specific repository for chemistry, with about 5000 entries. These are mostly the results of quantum mechanical calculations on molecules (together with a much smaller number of spectra, crystal structure and general document depositions). Today, with some help (thanks Matt!), I decided to take a look at how much use the repository was receiving.

  1. The first entry in the log dates from 2008-02-05.
  2. The repository is now receiving about 1200 accesses via handle resolutions each day, which comprises
  3. ~150 unique client IPs, and
  4. ~900 unique handles accessed daily

Whilst most of the hits are coming from web spiders by auto-discovery, a fair number (perhaps ~300) of the 5000 entries have also been linked to via journal articles, and of course this blog, and some hits may be presumed to be the result of non-random ping-backs. A breakdown of a typical day (2010-02-10) when 839 unique handles were accessed shows access by, amongst others, five universities, Google/Yahoo, several other information corporations and Microsoft. I had no idea Microsoft was interested in calculations on molecules! You saw that here first!!

Other anecdotal feedback regarding the repository: I often use it to exchange calculations with collaborators, sending them the handle instead of a vast checkpoint or log file. Some collaborators, it has to be said are baffled by the interface presented to them (which was designed in large measure by DSpace, not by us).

It is early days in many ways, and being pretty much the only standards-compliant digital repository operating in chemistry in this manner means that awareness is still low. If anyone reading this blog knows of significant others, please comment.