programmer « Henry Rzepa's blog

Posts Tagged ‘programmer’

OpenCon (2016)

Friday, November 25th, 2016

Another conference, a Cambridge satellite meeting of OpenCon, and I quote here its mission: “OpenCon is a platform for the next generation to learn about Open Access, Open Education, and Open Data, develop critical skills, and catalyze action toward a more open system of research and education” targeted at students and early career academic professionals. But they do allow a few “late career” professionals to attend as well!

I could only attend the morning session, for which the keynote speaker was Erin McKiernan The presentation was entitled How open science helps researchers succeed, presented as an exploration of an article written by Erin and colleagues with the same name and published in eLife[1] Erin has created a support page at http://whyopenresearch.org to augment the presentation and it’s well worth a visit.

One striking point made was the assertion that Open publications get more citations!

As with many metrics of the impacts of the science publication processes, a citation itself lacks the context of why it was made (see this post for further discussion), but the expectation is that a citation is “good”. From my perspective as a chemist, I did wonder why molecular science was missing from the graphic above. Do open chemistry publications also get more citations?

Which brings me to another point made during the talk, the increasingly controversial aspect of (journal) impact factors and the pressure placed on early career researchers to publish only in those with “high” impact factors, and for their careers to be assessed at least in part based on these and the anticipated “h-index”. The audience was indeed encouraged to go visit http://www.ascb.org/Dora/ (Declaration on Research Assessment, or Putting science into the assessment of research). Have you signed it yet?

Another manifestation of the modern trend to analyse impact metrics is the site Impactstory.org. This is a scripted resource that starts from your ORCID identifier and (optionally) your Twitter account (yes, apparently Tweets matter!) to derive a more complex alternative metric of a individual’s impacts. I had not tried this one before and so I submitted my ORCID and my Twitter account, and watched as the system went off to http://orcid.scopusfeedback.com (Scopus is an Elsevier product) to attempt to create my profile. It ground for quite a while, reporting initially that I had no publications! This was followed by an unexpected error; I did not get my impact back! But this experiment served to highlight one aspect that was discussed at the meeting; data and other research objects. The graphic above refers only to the citation of journal articles, it does not yet include the citation of data. However ORCID DOES include data and research objects as works. And because the granularity of my data and research objects is very fine (one molecule = one work), I have quite a few. In fact ~200,000! ORCID gets to about 8000 before it gives up. I suspect http://orcid.scopusfeedback.com queries ORCID, gets back ~8000 entries and crashes. No doubt the programmer tasked with implementing this resource did not anticipate that any individual could accumulate 8000+ entries! Or probably factor in that the vast majority of these would of course not be journal articles but data. If the site gets back to me about the crash I experienced, I will update here.

Simon Deakin was the next speaker with (open) data as the focus and the worries many researchers have in being scooped by others who have re-used your open data without proper attributions. The discussion teased out that if data is properly deposited, it will indeed have full associated metadata and in particular a date stamp that could help protect an author’s interests.

It was really good to meet so many early career researchers who espouse the open ethos. Perhaps, in 20 years time, another graphic akin to the one above might demonstrate that open researchers get more promotions!

References

E.C. McKiernan, P.E. Bourne, C.T. Brown, S. Buck, A. Kenall, J. Lin, D. McDougall, B.A. Nosek, K. Ram, C.K. Soderberg, J.R. Spies, K. Thaney, A. Updegrove, K.H. Woo, and T. Yarkoni, "How open science helps researchers succeed", eLife, vol. 5, 2016. https://doi.org/10.7554/elife.16800

Tags:Academia, author, chemist, City: Cambridge, Company: Twitter, ELife, Erin McKiernan, keynote speaker, Max Planck Society, programmer, Simon Deakin, Social Media & Networking, speaker, Technology/Internet, Wellcome Trust
Posted in Chemical IT, General | 3 Comments »

Metametadata: data about data about (chemical) data.

Saturday, April 16th, 2016

Scientists are familiar with the term data, at least in a scientific or chemical context, but appreciating metadata (meaning "after", or "beyond") is slightly more subtle, in the sense of using it to mean data about data. The challenge lies in clarifying where the boundary between data and its metadata lies and in specifying and controlling the vocabulary used for these metadata descriptions. Items in a chemical metadata dictionary might include e.g. subject classifications such as Organic Molecular Chemistry or identifiers such as InChIkey. But what could metametadata be? Here I briefly show some examples by way of illustration.

Let me start by defining a data repository as a store of both data and the metadata describing it. The metadata is to be exposed in a standard manner which allows it to be aggregated by other agencies. Nowdays, it is becoming common to identify such a data object together with its metadata using a persistent identifier, or DOI. But to decide if any particular repository and the data objects contained therein is generally useful to you, you need information about the metadata itself. Technically, this is defined using a schema[1] describing the metadata (which might e.g. identify any dictionaries used); hence metametadata. Now you need to store the metametadata and so I introduce the concept of a registry which does this. This metametadata object is itself assigned a DOI^‡ and here I list these DOIs for a personal selection of some chemically oriented examples, in this case deriving from the largest registry of research data repositories re3data.org. You can search for your own entry at their site: http://service.re3data.org/search.

Data repository	The repository metametadata DOI^♣	Badge
Figshare	10.17616/R3PK5R[2]
Zenodo	10.17616/R3QP53[3]
Cambridge structure database	10.17616/R36011[4]
Crystallographic open database	10.17616/R37S31[5]
Oxford University Research Archive	10.17616/R3Q056[6]
Open Notebook Science	10.17616/R3859D[7]
Usefulchem	10.17616/R3Z89N[8]
Chemotion	10.17616/R34P5T[9]
Chemspider	10.17616/R38P4P[10]
Chemical Database Service	10.17616/R36P42[11]
Imperial College HPC data repository.	r3d100011965[12],[13]
Imperial College SPECTRa repository.[14]	10.17616/R30316[15]

Not all of the repositories listed in the table above assign formal DOIs to their data collections, meaning that the metadata for their entries cannot be aggregated in a searchable manner using e.g. search.datacite.org/ui (or search.datacite.org/api for the machine version). Currently, the metametadata does not fully carry this information, an aspect which I gather will be rectified in a future revision of the re3data schema.[1]

Importantly, both metadata and (repository) metametadata can be searched using APIs (application programmer interface), ensuring that the entire flow of meta information can be subject to automated software analysis rather than just visual inspections by a human.This should allow a rich and open infrastructure for handling research objects or data to be built up using hierarchical metadata. The examples above indeed show that the chemical space is already the largest component of the Natural Sciences space.

Although the edifice is still largely in its infancy, already I think we can start to see an alternative open approach emerging to "Googling" for data, or the even older traditional bespoke (i.e. non-open) services offered by commercial human-based abstractors of chemical metadata.

^‡This DOI is information about the metametadata, and hence it is metametametadata, or m3data. Sorry! ^♣The citations at the foot of this post are generated entirely automatically (by a WordPress plugin called Kcite) from the m3data associated with each entry, i.e. the DOI listed. Were the persistent identifier for the entry ever to be changed, this would propagate automatically to the citation, unlike the static entries in the table.

References

J. Rücknagel, P. Vierkant, R. Ulrich, G. Kloska, E. Schnepf, D. Fichtmüller, E. Reuter, A. Semrau, M. Kindling, H. Pampel, M. Witt, F. Fritze, S. Van De Sandt, J. Klump, H. Goebelbecker, M. Skarupianski, R. Bertelmann, P. Schirmbacher, F. Scholze, C. Kramer, C. Fuchs, S. Spier, and A. Kirchhoff, "Metadata Schema for the Description of Research Data Repositories", 2015. https://doi.org/10.2312/re3.008
Re3data.Org., "figshare", 2012. https://doi.org/10.17616/r3pk5r
Re3data.Org., "Zenodo", 2013. https://doi.org/10.17616/r3qp53
Re3data.Org., "The Cambridge Structural Database", 2013. https://doi.org/10.17616/r36011
Re3data.Org., "Crystallography Open Database", 2013. https://doi.org/10.17616/r37s31
Re3data.Org., "Oxford University Research Archive", 2014. https://doi.org/10.17616/r3q056
Re3data.Org., "ONSchallenge", 2013. https://doi.org/10.17616/r3859d
Re3data.Org., "UsefulChem", 2014. https://doi.org/10.17616/r3z89n
Re3data.Org., "chemotion", 2013. https://doi.org/10.17616/r34p5t
Re3data.Org., "ChemSpider", 2013. https://doi.org/10.17616/r38p4p
Re3data.Org., "Chemical Database Service", 2012. https://doi.org/10.17616/r36p42
https://doi.org/
H. Rzepa, "Imperial College High Performance Computing Service Data Repository Metadata Schema", 2016. https://doi.org/10.14469/hpc/382
J. Downing, P. Murray-Rust, A.P. Tonge, P. Morgan, H.S. Rzepa, F. Cotterill, N. Day, and M.J. Harvey, "SPECTRa: The Deposition and Validation of Primary Chemistry Research Data in Digital Repositories", Journal of Chemical Information and Modeling, vol. 48, pp. 1571-1581, 2008. https://doi.org/10.1021/ci7004737
Re3data.Org., "SPECTRa Project", 2013. https://doi.org/10.17616/r30316

Tags:Academic publishing, automated software analysis, BASE, chemical context, Chemical Database Service, chemical metadata, chemical metadata dictionary, chemical space, City: Cambridge, Data dictionary, Data management, Identifiers, Knowledge representation, programmer, Registry of Research Data Repositories, search.datacite.org/api, SPECTRa, Technology/Internet
Posted in Chemical IT | No Comments »

Disambiguation/provenance of claimed scientific opinion and research.

Monday, May 5th, 2014

My name is displayed pretty prominently on this blog, but it is not always easy to find out who the real person is behind many a blog. In science, I am troubled by such anonymity. Well, a new era is about to hit us. When you come across an Internet resource, or an opinion/review of some scientific topic, I argue here that you should immediately ask: “what is its provenance?”

In the 350 year history of scientific dissemination[1], provenance has almost always been provided by publishers. Arguably, that was their most important role (and arranging anonymous peer review). Not that they ever met with their authors or always established that a real person or a real group actually existed! But with the explosion of vanity publication and a host of horror stories about articles for sale to authors keen to have a publication to their name, perhaps the role of provenance needs rethinking.

ORCiD is a project that seems to be gaining serious momentum in achieving a mechanism for disambiguation and provenance of researchers. Thus Brian Kelly (who has played an important role in the modern internet in the UK since 1993 or earlier) encourages all researchers to sign up (although I cannot help noting, rather cheekily, that he does not add his own ORCiD as provenance for his blog). ResearcherID was in fact an earlier organisation to offer such a service, but it is run by a commercial publisher and it is hosted at a “.com“. ORCiD at least claims to be an open (.org)anisation, and carries an open source license. It seems that some UK Universities (home to some researchers) have decided to sign up to ORCiD and most I suspect are planning to deploy these resources amongst their researchers, and quite possibly their students as well (postgraduate initially, maybe even undergraduate eventually).

I jumped the gun somewhat, getting mine more than a year ago. Better the devil you know, etc etc! It is orcid.org/0000-0002-8635-8390. What happened next? Well, I publish data@Figshare, who themselves signed up to be an early member of ORCiD. This gives them access to the API (application programmer interface), and so by supplying my ORCiD to Figshare, I can gain access by proxy to the ORCiD features on offer. The most immediate impact is that ORCiD lists all the data-objects I have published at Figshare, thus establishing a trust between them and my ORCiD identity. Mind you, no-one at ORCiD has ever met me, or checked on who I am. I think that task is going to be delegated eventually to e.g. my university (I am not absolutely certain how the linkage between my ORCiD and my employer, who clearly know me since they pay my salary, will be formalised). Because my employer has also now become an ORCiD member, we will be adding ORCiD API access to our own SPECTRa-DSpace data repository shortly, so that the data held there will also be added to my ORCiD lists.

And as the major journal publishers start to do the same, a formal linkage between my identity (perhaps as verified by my employer), journal-published articles (narratives) and my data publications (via the identifiers known as DOIs) will come into being.

How, you might reasonably ask, is this in the least useful? In truth, I am not sure anyone really knows exactly where this is heading. For example, impactstory.org/about is one added value site which attempts to gather altmetrics about the impact your research is having. But hey, the although the preceding link tells you who founded this organisation, you do not get the kind of provenance I am describing above; none of the founders cite their ORCiDs! You do get their @Twitter accounts though; I wonder what that tells us about the modern interpretation of provenance? Well, my impact can be seen here; in truth it’s not quite the impact I imagined my scientific career was having, but I suppose this is early days. What I am pleased to tell you is that ImpactStory does tell you not only about the impact articles I have published has had, but also the data. Two data sets are described as both discussed and highly viewed. Although as usual, you do not get to learn why the data is being discussed!

Where next? Well, to go back to the start of this post; blogs. It would be nice to formally link this blog to my ORCiD ID (this is not done simply by quoting it here, but via the ORCiD API). If/when I work out how to do this, I will no doubt post the event!

References

H. Oldenburg, "Epistle dedicatory", Philosophical Transactions of the Royal Society of London, vol. 1, pp. i-ii, 1665. https://doi.org/10.1098/rstl.1665.0001

Tags:0000-0002-8635-8390, added value site, API, Internet resource, ORCiD, programmer, United Kingdom
Posted in Chemical IT | No Comments »

Computers 1967-2011: a personal perspective. Part 1. 1967-1985.

Thursday, July 7th, 2011

Computers and I go back a while (44 years to be precise), and it struck me (with some horror) that I have been around them for ~62% of the modern computing era (Babbage notwithstanding, ~1940 is normally taken as the start of the modern computing era). So indulge me whilst I record this perspective from the viewpoint of the computers I have used over this 62% of the computing era.

1967: I encountered (but that term has to be qualified) my first computer, suggested to me as an alternative to running quarter marathons on Wimbledon common at school by an obviously enlightened teacher! I wrote a program (in Algol) on paper tape, put the tape in an envelope, and sent it off to Imperial College (by van) to run, on an IBM 7094. A week later, printed output showed you had made a mistake on line 1 of the program. As I recollect, after about eight weeks of this, I got the program to run (and calculated π to 5 decimal places).
1970: By now I was a student (again at Imperial College), and was introduced to Fortran, then a radical new innovation to a chemistry degree. The delightfully named pufft compiler combined with the 7094 again, but this time with punched Holerith cards as input and line printer output. I cannot remember what we were asked to program. I do remember that the punched cards were produced by a pool of punch card operators, working from code pages written by the programmer. Some students (not me!) thought it great fun to give their Fortran variables naughty names (which the punch card operators then refused to punch, thus causing the student to fail the course!).
1971: I really liked this programming lark, so when instant-turnaround was introduced that year, I decided to do a proper program. It was called NLADAD (yes, I was no good at names, even then), which stood for non-linear-analysis of donor-acceptor complexes. The idea was to take recorded NMR chemical shifts, and fit them to an equilibrium A+B ⇔ AB+B ⇔ AB₂using non-linear regression analysis. It must have been all of 200 lines of code (OK, I did not write the matrix inversion routine myself)! Instant turnaround was also great, you got to punch your own cards this time, and had the great excitement of feeding them into a card reader yourself. You then walked about 5 yards to the line printer and waited agog. No waiting one week, this was less than a minute. Or it would have been if the line printer did not paper-wreck every two minutes! (I might add that I have a dim recollection of a member of the computer centre staff standing by to recover these paper wrecks. He, by the way, is now the director of the ICT division here!).
1972: I am now doing a PhD (yes, boringly, yet again at Imperial College). I had found the one and only teletypewriter in the chemistry department. The crystallographers had secreted it away in their empire, but were very dismayed to find me occupying it constantly. Instant was now even more instant. I was now connecting to a time-sharing CDC 6400 computer, at the dazzling speed of 110 baud (or bytes per second). These were small bytes by the way, since the CDC used 6 bits per byte. The result was that one did everything in UPPER CASE, since a 6-bit byte only allows 64 characters! My (still Fortran) programs reached probably 1000 lines of code now, and I was engrossed in deriving non-linear analyses of steady state chemical kinetics (about four different kinds of rate equation as I recollect). Ah, the joys of covariance analysis, and propagation of errors (I was in a kinetics lab, and all the other students plotted graphs on graph paper, and if pressed, plotted gradients of graphs, the so-called Guggenheim plots. I thought this the dark ages, but no-one volunteered to join me in this single teletypewriter room. Not even the attractive girls in the group. I was the geek of my time, no doubt about that. My kinetic analysis did however have one upside. Its how I meet my wife to be a few years later!).
1974: PhD completed, I was now ready to go to Texas, where everything is bigger (and in terms of computers, slightly better, a CDC 6600 now and a 300 baud teletypewriter!). I had been computing now for seven years, and finally I actually got to SEE the device for the very first time. My mentor, Michael Dewar, had a sort of special relationship with the university. His students (and possibly only his students) were allowed to go into the depths of the machine room, where behind plate glass you could see the CDC 6600. I soon learnt how to get even closer. It was not particularly exciting however. I was more entranced with the CALCOMP flatbed plotter, which was located next to the 6600. Pictures at last (you probably do not want to know that to convert my kinetics in 4 above to pictures, I got quite expert in using a french curve. Look it up before you jump to conclusions). Part of the pact I negotiated was that I was only allowed into the inner sanctum at 03:00 in the morning (sic!). Still a geek then! Oddly, I was one of the few students in Dewar’s group using the CALCOMP, but at least we now had pictures of the molecules I was now calculating (using MINDO/3). To put the computing power into context, in 1975, Paul Weiner, another group member, announced that he had completed a full geometry optimisation of LSD, this having taken about 4 days to do on that over-worked 6600. The entire group went out to celebrate. Many pitchers of beer were drunk that nite.
Computer graphics from 1976.
1977: Back to Imperial, where we might have also now had a CDC 6600. And a Tektronix terminal running at the dizzying (hardwired end-to-end) speed of 9600 baud. I learnt to Word process on this device (using a word processor, written in Fortran, although not by me) and I wrote three review articles by this means, using a fancy phototypesetter as the printer. My next program, STEK, probably ran to about 5000 lines of code, and it persuaded the Tektronix to plot all sorts of things, ball&stick diagrams, isometric potential surfaces, molecular orbitals, and the like (and jumping ahead, my experience with this program eventually led to CML, and Peter Murray-Rust, but that is indeed jumping ahead). I think I also managed to gain access to the Imperial machine room, that inner sanctum, yet again. But for reasons I will not go into, it was not as interesting as the Texan machine room.
Chemistry Computer graphics, circa 1977-85.
1979: I encountered a Cray 1 computer, and probably also 8-bit bytes (and yes, lower case printer outputs) for the first time at the University of London Computing Centre.
1980: Remember that teletypewriter, encountered earlier. Well these were now running at 2400 baud and I started to organise the deployment of a chemistry department computer network to sprinkle several such terminals around the department. The controller was a PAD, and in that year, we introduced STN ONLINE using this network. It was the first time we could search CAS online ourselves (previously, it was a service offered by the library). Literature searching has not been the same since.
1980: I finally again encountered a real computer, which one could happily listen to without creeping into machine rooms in the middle of the night. It was the data system on a Bruker Spectrospin 250 MHz superconducting NMR spectrometer. I had many adventures on this system. It was installed, by the way, on more or less the same day as the birth of my first daughter Joana. It had a hard drive (5 Mbytes as I recollect, and cost an absolute fortune, around £10,000 if I remember correctly).
Combining Quantum mechanics and NMR.

Computer graphics 1982, from NMR spectrometer.
1982: More networks, this time a curious computer known as the Corvus Concept, using a networked hard drive (possibly as big as 20 Mbytes by now), and a large screen.
1985: Enter the Mac (OK, the IBM PC came a little earlier, but it was not entrancing). Now one really had a tactile computer that made noises (not always nice), produced smoke signals occasionally, and ejected its floppy disk incessantly. Yet another revolution to cope with. As I type this, I look down on that Mac, which is still underneath my desk. Wonder if its worth anything on ebay?

Well, a second consecutive blog, with (almost) no pictures or molecules. And I have only gotten to the half way stage of my story. Better break off then.

Tags:chemical shifts, chemistry department computer network, controller, director, fancy phototypesetter, Fortran, GBP, Guggenheim, Historical, IBM, ICT, Imperial College, Joana, London Computing Centre, Michael Dewar, obviously enlightened teacher, Paul Weiner, Peter Murray-Rust, programmer, steady state chemical kinetics, Tektronix, Texas, University of London, University of London Computing Centre, Wimbledon, word processor
Posted in Chemical IT | 5 Comments »

Henry Rzepa's blog

Posts Tagged ‘programmer’

Metametadata: data about data about (chemical) data.

References

Disambiguation/provenance of claimed scientific opinion and research.

References

Computers 1967-2011: a personal perspective. Part 1. 1967-1985.

Recent Posts

Archives

Blogroll

Meta