Posts Tagged ‘Amsterdam’

FAIR Data in Amsterdam – FAIR data points.

Wednesday, July 18th, 2018

FAIR is one of those acronyms that spreads rapidly, acquires a life of its own and can mean many things to different groups. A two-day event has just been held in Amsterdam to bring some of those groups from the chemical sciences together to better understand FAIR. Here I note a few items that caught my attention.

  1. Fairsharing.org was the basis for several presentations. It serves as “a curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.” It promotes establishing metrics which strive to quantify how FAIR any given resource is.[1] Any site which achieves a good FAIR metric can be described as a FAIR data point (a term new to me), and which can serve as an exemplar of what FAIR data aspires to.
  2. Intrigued, I offered this page and hope to establish its FAIR metric in the near future, if only to understand how to improve its “score” so that future pages can be improved. It is based on the following Figure[2] which appeared in a recent article and appears to be a publishing “first” in as much as the figure contains hyperlinks directly to the data sources upon which it is based. The putative FAIR data point takes this a little further by wrapping the figure with visualisation tools which take the FAIR data and convert it to interactive models with the help of an added toolbox.
  3. Another topic for discussion was spectroscopy and a veritable file format for its distribution, JCAMP-DX. One emerging theme is the idea of promoting two types of spectral distribution. The first is the use of a common standard format (JCAMP-DX) which strives to eliminate much of the proprietary character associated with data emerging from instruments. At the other extreme is to to offer to readers the raw instrumental data,[3] which has the advantage of having none of the inevitable loss of information when transforming to standard formats. The downside is that it almost always can only be processed using proprietary software provided by the instrument vendor. One way of avoiding this is Mpublish (the topic of an earlier blog) and we heard interesting updates on progress from MestreLabs, the originators of this procedure. It is still my hope that more vendors (both of instruments and of software) will adopt such a model.
  4. A further topic was metadata, which is at the heart of each of the terms in FAIR (F = findable, A = accessible, I = interoperable and R = re-usable), which are all defined in part at least by the metadata associated with any item. The state of metadata associated with research data is often dire, and often too little resource has been assigned to its improvement. I presented an example of how richer metadata might be injected. The below is a snippet of the metadata associated with one entry in a data repository (download the metadata here and open the file with a text editor). An advantage of doing this is that rich searches against these terms become enabled.
  5. Finally, I note events such as e.g. Harnessing FAIR data are starting to spring up. This one is at Queen Mary University of London on 3rd September 2018, for which “PhDs and Post Docs from a range of disciplines” are welcomed, they of course being the pre-eminent generators of  data and often the ones in charge of making it “FAIR”.

References

  1. M.D. Wilkinson, S. Sansone, E. Schultes, P. Doorn, L.O. Bonino da Silva Santos, and M. Dumontier, "A design framework and exemplar metrics for FAIRness", Scientific Data, vol. 5, 2018. https://doi.org/10.1038/sdata.2018.118
  2. S. Arkhipenko, M.T. Sabatini, A.S. Batsanov, V. Karaluka, T.D. Sheppard, H.S. Rzepa, and A. Whiting, "Mechanistic insights into boron-catalysed direct amidation reactions", Chemical Science, vol. 9, pp. 1058-1072, 2018. https://doi.org/10.1039/c7sc03595k
  3. J.B. McAlpine, S. Chen, A. Kutateladze, J.B. MacMillan, G. Appendino, A. Barison, M.A. Beniddir, M.W. Biavatti, S. Bluml, A. Boufridi, M.S. Butler, R.J. Capon, Y.H. Choi, D. Coppage, P. Crews, M.T. Crimmins, M. Csete, P. Dewapriya, J.M. Egan, M.J. Garson, G. Genta-Jouve, W.H. Gerwick, H. Gross, M.K. Harper, P. Hermanto, J.M. Hook, L. Hunter, D. Jeannerat, N. Ji, T.A. Johnson, D.G.I. Kingston, H. Koshino, H. Lee, G. Lewin, J. Li, R.G. Linington, M. Liu, K.L. McPhail, T.F. Molinski, B.S. Moore, J. Nam, R.P. Neupane, M. Niemitz, J. Nuzillard, N.H. Oberlies, F.M.M. Ocampos, G. Pan, R.J. Quinn, D.S. Reddy, J. Renault, J. Rivera-Chávez, W. Robien, C.M. Saunders, T.J. Schmidt, C. Seger, B. Shen, C. Steinbeck, H. Stuppner, S. Sturm, O. Taglialatela-Scafati, D.J. Tantillo, R. Verpoorte, B. Wang, C.M. Williams, P.G. Williams, J. Wist, J. Yue, C. Zhang, Z. Xu, C. Simmler, D.C. Lankin, J. Bisson, and G.F. Pauli, "The value of universally available raw NMR data for transparency, reproducibility, and integrity in natural product research", Natural Product Reports, vol. 36, pp. 35-107, 2019. https://doi.org/10.1039/c7np00064b

The Amsterdam Manifesto and crystal structures.

Tuesday, March 18th, 2014

I have mentioned the Amsterdam manifesto before on these pages. It is worth repeating the eight simple principles:

  1. Data should be considered citable products of research.
  2. Such data should be held in persistent public repositories.
  3. If a publication is based on data not included with the article, those data should be cited in the publication.
  4. A data citation in a publication should resemble a bibliographic citation and be located in the publication’s reference list.
  5. Such a data citation should include a unique persistent identifier (a DataCite DOI recommended, or other persistent identifiers already in use within the community).
  6. The identifier should resolve to a page that either provides direct access to the data or information concerning its accessibility. Ideally, that landing page should be machine-actionable to promote interoperability of the data.
  7. If the data are available in different versions, the identifier should provide a method to access the previous or related versions.
  8. Data citation should facilitate attribution of credit to all contributors

I just gave a talk at the ACS meeting in Dallas which touched upon the need to emancipate data according to these principles. My talk, in case you are interested, focused particularly upon item 6 above.[1]

Just after my talk I heard that crystallographic data was about to be emancipated (my phrase) and so I was interested to find out what this might mean, and how many of the above principles were being adhered to. Indeed, it is an interesting test to apply to any chemistry data that you might find out there. Thus 10.5517/cc10ftfp[2] is the DOI of a recently published crystal data structure. This adheres to points 1-3 and 5 above, and probably also 8. As I have already noted, 6 is the interesting one! So let’s go to the landing page and see what we find.

doi-x1

 

Firstly, note that you do not need any sort of access code to get to this page, it is open to all. But it is after all just a landing page, not actual data. Next, click on the Download button, and you get asked to identify yourself by providing a name, email address and affiliation as mandatory fields, as well as agreeing to conditions of use. I reproduce these conditions here:

Individual CIF data sets are provided freely by the CCDC on the understanding that they are used for bona fide research purposes only. They may contain copyright material of the CCDC or of third parties, and may not be copied or further disseminated in any form, whether machine-readable or not, except for the purpose of generating routine backup copies on your local computer system“.

As with most such conditions, it is what one cannot do that is most interesting.

  1. Teach, as for example incorporating the data into lecture notes
  2. Make a copy, e.g. to place into this blog (is this for research purposes?)
  3. Do bona fide research purposes in fact allow a copy to be made, or does the second sentence over-ride the first in this regard, since it lists exclusions and research copying is not an exclusion.
  4. Judging from the landing page, it is pretty much impossible for any machine action to take place (item 6 in the Amsterdam manifesto). Even though the data is machine actionable, the landing page pretty much prevents this from happening. 

What did cause my eyebrows to shoot up was that I have to reveal my full identity and affiliation (which appears not to be actually checked) in order to get the data. Think about this. Do journals ask for this information when you download an article from them? (OK, they probably know your affiliation). Which scientist is reading which article (or viewing which data) could be construed as sensitive information after all. So why in order to acquire crystal data do you have to provide personal information? Surely, looking at data should be a private process if one wants it to be?

doi-x2

The release of crystal data in this manner, with a decent partial adherence to the Amsterdam Manifesto is an excellent start; this data after all is well curated and of high value. But I must call upon CCDC to rethink that landing page, the conditions of use and the mandatory gathering of personal information. Not quite there yet!

References

  1. "Digital data repositories in chemistry and their integration with journals and electronic laboratory notebooks", 2014. http://doi.org/10042/a3uza
  2. Sowa, Michał., Ślepokura, Katarzyna., and Matczak-Jon, Ewa., "CCDC 936802: Experimental Crystal Structure Determination", 2014. https://doi.org/10.5517/cc10ftfp

The Amsterdam Manifesto on Data Citation Principles

Wednesday, July 31st, 2013

The Amsterdam manifesto espouses the principles of citable open data. It is a short document, and it is worth re-stating its eight points here:

  1. Data should be considered citable products of research.
  2. Such data should be held in persistent public repositories.
  3. If a publication is based on data not included with the article, those data should be cited in the publication.
  4. A data citation in a publication should resemble a bibliographic citation and be located in the publication’s reference list.
  5. Such a data citation should include a unique persistent identifier (a DataCite DOI recommended, or other persistent identifiers already in use within the community).
  6. The identifier should resolve to a page that either provides direct access to the data or information concerning its accessibility. Ideally, that landing page should be machine-actionable to promote interoperability of the data.
  7. If the data are available in different versions, the identifier should provide a method to access the previous or related versions.
  8. Data citation should facilitate attribution of credit to all contributors.

The manifesto itself is dated  20 March 2013, but the principles above go back far earlier, and most of the articles above have been implemented in this blog for a little while now. But its best illustrated with an example[1]. Here I have used the excellent wordpress extension Kcite to adhere to points 1-5 and 8 above. Point 6 is the most interesting perhaps, and it is illustrated below. If you click on the graphic, it will load the log file associated with the calculation described in the previous post to convert the data to a rotatable 3D model, using Jmol.

Click for  3D

Click to load from Figshare

This actually represents a departure from how I normally invoke data on this blog. I have hitherto done it by uploading the data to the WordPress uploads directory (it is thus in effect a local copy of the data). Here in this post I have not done that; it’s coming directly from the citable data repository, using e.g.

onclick="jmolInitialize('../Jmol/',true);jmolSetAppletColor('white');jmolApplet([500,500],'load http://files.figshare.com/1134372/logfile.log;frame 27;vectors on;vectors 4;vectors scale 5.0; color vectors orange; vibration 10;animation mode loop;');"

Well, almost. The URL of the actual dataset (http://files.figshare.com/1134372/logfile.log) is derived from the doi (10.6084/m9.figshare.757728) by an internal process which has no exposed algorithm. As it happens, the DSpace digital repository[2] is a bit better in this regard.

Click for  3D

Click to load from DSpace

This is loaded using

https://spectradspace.lib.imperial.ac.uk:8443/dspace/bitstream/handle/10042/24916/logfile.log;

which is directly derived from the handle itself (although again that algorithm has to be worked out by a human).

Point 7 above is also implemented (sort of). If you look at the Figshare repository, you will notice an additional link there which points to the Dspace repository. As it happens, the two datasets are identical (they are not different versions) and these semantics are currently NOT handled well. You could probably work this out from the date stamps of the two depositions (which are respectively Published on 29 Jul 2013 – 05:58 (GMT) and 2013-07-29T05:58:11Z for Figshare and DSpace respectively). 

Whereas the Amsterdam manifesto is here implemented on a blog post, I think the grander aspiration is that the principles are to be followed in ALL scientific publications. We are some way away yet from achieving this. But watch this space for an upcoming example! 

So can I urge all scientists who care about data to promulgate the principles of the Amsterdam manifesto, and wherever possible to practice what they preach! 

References

  1. H.S. Rzepa, "Gaussian Job Archive for ClF3", 2013. https://doi.org/10.6084/m9.figshare.757728
  2. "Cl 1 F 3", 2013. http://hdl.handle.net/10042/24916