Posts Tagged ‘Singular spectrum analysis’

Data-free research data management? Not an oxymoron.

Tuesday, May 24th, 2016

I occasionally post about "RDM" (research data management), an activity that has recently become a formalised essential part of the research processes. I say recently formalised, since researchers have of course kept research notebooks recording their activities and their data since the dawn of science, but not always in an open and transparent manner. The desirability of doing so was revealed by the 2009 "Climategate" events. In the UK, Climategate was apparently the catalyst which persuaded the funding councils (such as the EPSRC, the Royal Society, etc) to formulate policies which required all their funded researchers to adopt the principles of RDM by May 2015 and in their future researches. An early career researcher here, anxious to conform to the funding body instructions, sent me an email a few days ago asking about one aspect of RDM which got me thinking.

The question related to the divide between data as a separate research object (and which therefore has to be managed), and data as an inseparable part of the article narrative, which is of course ostensibly managed by the journal publication processes. Such data may often be the description of a process rather than simply tables of numbers or graphs. In chemistry it may include chemical names and chemical terms as part of an experimental procedure. For one nice illustration of such embedded data, go look at the chemical tagger page. Here the data is blending with the semantics, and the two are not easily separated. So, when such separation is not easily achieved, should the specific processes required by RDM as illustrated in the five bullet points below actually be followed?

  1. Specify a data management plan to be followed, as for example points 2-5 below.
  2. Decide upon a location for your data, separated into one for "live" or working data (the purpose simply being to ensure it is properly backed up) and the other for a sub-set of formally "published data" which has to be available for at least ten years after its publication.
  3. Use 2 to gather metadata (see 6-14 below) and in return get a DOI representing the location of the combined metadata + data, from a suitable registration authority such as DataCite.
  4. Quote this DOI(s) in the article describing the results of analysing the data and presenting hypotheses, and conversely once the article itself is allocated its own DOI from a registration authority such as CrossRef, update the metadata in item 3 so as to achieve a bidirectional link between the data and its narrative (and we assume that DataCite and CrossRef will also increasingly exchange the metadata they each hold about the items).
  5. Add both the data and the article DOIs to any institutional CRIS or current research information system (parenthetically, I regard this last stipulation as rather redundant if items 3 and 4 are working effectively, but its a good interim measure whilst the overall system matures).

So, should step 2 be included if the data itself is inextricably intertwined with the narrative and cannot be separated? The slightly surprising advice I would suggest is yes! And the answer is that it IS possible to generate metadata (data about the, possibly entwined, data) which CAN be processed in such a step. What forms would such metadata take?

  1. Identification of the researcher(s) involved. This would nowadays take the form of an ORCID (Open Research and Collaborator Identifier).
  2. Identification of the hosting institution where the data has been produced. There is currently no equivalent to an ORCID for institutions, but it is very likely to come in the future.
  3. A date stamp formalising when the (meta)data is actually deposited.
  4. A title for the project being described. Here we see a blurring between the narrative/article and the data; a title is the shortest possible description of the narrative/article, and it may also apply to the data object(s) or it could have its own title.
  5. A slightly fuller abstract of the project being described. Here we see further blurring between the narrative and data objects.
  6. One can include "related identifiers", in particular the DOIs of any other relevant articles that might have been published which may expand the context of the data, and also the DOIs of any other relevant datasets which may have been allocated in step 2 above.
  7. It is also beneficial to include "chemical identifiers". These can take the form of InChI strings and InChI keys, which allow discretely defined molecular objects which were the object of the research to be tracked and which relate to both the narrative and any other data objects.
  8. If specific software has been used to analyse data, it too can be included as a "related identifier" (e.g. [1]
  9. Potentially at least, if a well-defined instrument has been involved, it too could be included with its own "related identifier". With both 13-14, other issues may need addressing, such as versioning etc, but this no doubt will be sorted in due course.
  10. etc.

So items such as 6-14 can be collected and sent to e.g. DataCite with a DOI received in return as part of item 2 of the RDM processes. No "pure" data need be involved, only metadata. Nonetheless such metadata can only increase the visibility and discoverability of the research, as illustrated in how such metadata can be searched for the components described above.

References

  1. H.S. Rzepa, "KINISOT. A basic program to calculate kinetic isotope effects using normal coordinate analysis of transition state and reactants.", 2015. https://doi.org/10.5281/zenodo.19272

Discovery based research experiences: gauche effects in group 16 elements.

Wednesday, March 2nd, 2016

The upcoming ACS national meeting in San Diego has a CHED (chemical education division) session entitled Implementing Discovery-Based Research Experiences in Undergraduate Chemistry Courses. I had previously explored what I called extreme gauche effects in the molecule F-S-S-F. Here I take this a bit further to see what else can be discovered about molecules containing bonds between group 16 elements (QA= O, S, Se, Te). 

OO-SQ

The search definition is shown above, with DIST1 being the QA-QA bond length, the QA-QA bond being acyclic, each QA bearing only two bonded atoms and NM being any non-metal. The first result shown is for QA=S.

S-S

  1. The first discovery is that the most common torsion (red-hot spot) is about 90°, but there appears to be a statistically significant distortion towards longer S-S distances as the torsion deviates from this angle. For those who are so inclined it would perhaps be worth improving my term "appears to be" with a more formal numerical analysis of the distribution shown above and its significance. Any offers?
  2. The other discovery worth exploring is the number of occurences with an angle of 180°. With F-S-S-F itself (not a solid), I had previously noted that this angle actually represented a transition state in the torsion! So what might be inferred from these examples?

The next search includes a further constraint that the temperature the data was recorded at be <140K. This reduces vibrational "noise" and so should increase the significance. S-S-140

  1. Here we discover the same "V"-shaped distribution as before, possibly more significant statistically than the previous search. Again, a proper statistical analysis of the significance of this result is desirable.

The next search is for QA = Se or Te. X-X

  1. The Se and Te distributions can clearly be distinguished, with a weak "V-shape" visible for Se, but absent for Te. Again, those hits at 180!
  2. There are a few instances "in-between" the two distributions, which appear to be  Se-Te systems.

Finally, QA=QB = O.

O-O

  1. The discovery here is the apparent absence of any "V-shaped" distribution.
  2. The hot spot now occurs at 180°, but with a tail down to 60° or less. Clearly, the definition of "NM" as any non-metal probably needs to be explored further for specific instances to see what influence the nature of NM has. NM for example could be another O, which might be a severe perturbation. 

So here I have tried to tease out seven directions for further discovery. I am attending/presenting at the session I noted at the top and will report back on any interesting observations.