Metadata. Why?

I have had some interesting discussions recently regarding metadata. What emerges is that it can be quite a broadly defined concept and it is clear that a variety of answers might be obtained when asking the simple question “what is it useful for?” Here I set out some of my answers to that question.

  1. Metadata vs Data. Questions such as where is the continuum between data/metadata and whether the metadata is fine-grained or more broadly-grained.
  2. What is its ultimate destination? Should metadata reside inside a complete package or container of data, serving the purpose of succinctly describing what to expect in that package? Or should it reside entirely separately from the data package in some sort of metadata store (MDS)?
  3. Are there issues of trust or provenance? Thus, how was the metadata created, by a person or a process and when? Has it been changed since it was created? If so, what are the revisions? Does the metadata adhere to a specified structure and has it been been validated against that structure.

Some context needs to be applied before answering such questions (context is perhaps a synonym for metadata!)

  1. Firstly, I am going to use metadata here in the context of describing data itself (i.e. rather than other research objects such as journal articles). This would include answers to questions such as:
    1. who created both the data and its metadata.
    2. when were both created and perhaps modified.
    3. where the data is stored
    4. what are its defined internal structures (sometimes also called  MEDIA types).
    5. who its “publisher” is (the organisation where the data was produced or is curated).
    6. what are the access and re-use rights associated with the data.

    These are broad-grained provenance if you like.

  2. Next, metadata describing the specific the context of the data, e.g. in my case the chemistry associated with it.
    1. Is it about a molecule?
    2. if so what is the nature of the molecule?
    3. Is it computational data about a molecule.
    4. If so, what software was used for the computations and its parameters, inputs and outputs.
    5. Might it be instrumental data recorded for a molecule?
    6. If the latter, does it record the instrument and its settings?

    We are now moving into fine-grained metadata, and perhaps even crossing the boundary into data itself, since the parameters for either software or instruments can be large and complex and are often so heavily mixed into the data itself that their extrication may be a challenge.

  3. Finally, what is the purpose of creating and storing such metadata.
    1. Here the context is of “discoverability” (of the data itself) and perhaps also
    2. Reusability” and/or “Interoperability (of the data itself).
    3. These attributes are nicely summarised by the acronym FAIR, where discoverability is specified by both Findability and Accessibility.

Before introducing examples based on metadata with the focus on discoverability, I want to distinguish between locally packaged metadata and separated metadata (Qu. 2 above). The examples below relate purely to the latter, which has been created as a separate entity by registration with an agency such as DataCite. Such registration also addresses Qu. 3 above about trust. This external agency adds trust by recording the identity of the person (or a process or workflow initiated by a person) registering the metadata together with the registration date (the Datestamp) and also monitors any changes to the metadata (which is allowed) by keeping its version history. Interestingly, there seems to be no mechanism to record any processes or workflows used to create  metadata so as to learn how the metadata itself was assembled. Nor have I seen much discussion of this aspect; one for the future I fancy.

I now introduce some examples of discoverability. The descriptions are quite short and are meant to be used in conjunction with a “reverse-engineering” of the (somewhat) human readable search query. These queries are also deposited as  “data”,  at DOI: 10.14469/hpc/5920

Entry Description Elasticsearch query
1 Media (MIME) type*
2 Combining Media with the DataCite Subject*+AND+subjects.subjectScheme:inchikey+AND+subjects.subject:XZYDALXOGPZGNV-UHFFFAOYSA-M+AND+media.media_type:chemical/x-gaussian*
3 Combining ORCID with Media*0000-0002-8635-8390+AND+media.media_type:chemical/x-mnpub*
4 Exploiting Subject”-39.946176″
5 Exploiting Subject with range query[\-649.1 TO \-649.8]
6 Nested search with two Subjects”-1082.980914″)+AND+(subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:KTOSDSJYNBIDCN-UHFFFAOYSA-N)
Nested search with two Subjects transposed”-1082.980914″)
7 Two different Media types*+AND+media.media_type:chemical/x-mnpub*
8 License type”Creative Commons Public Domain Dedication (CC0 1.0)”
9 Exploiting subjectscheme*+AND+subjects.subjectScheme:NMR_Nucleus+AND+subjects.subject:1H
10 Exploiting subjectscheme*+AND+subjects.subjectScheme:NMR_Pulse+AND+subjects.subject:1D
11 Simple PID query*10.14469/hpc*
12 Combining ORCID with PID query*0000-0002-8635-8390)+AND+(identifier:*10.14469/hpc*)
13 Combing researcher name with PID query*10.14469/hpc*)+AND+(contributors.contributor.contributorName:Henry+Rzepa)
14 Entries in specific repository (Imperial) referencing specific Journal*)+AND+(identifier:*10.14469/hpc*)
15 Entries in specific repository (Cambridge) referencing specific Journal*)+AND+(identifier:*10.17863/cam*)
18 Entries in specific repository (Cambridge) referencing all publisher journals*)+AND+(identifier:*10.17863/cam*)
16 Entries in all repositories except one referencing specific Journal*)+NOT+(identifier:*10.5517*)
17 Entries in specific repository referencing one publisher*)+AND+(identifier:*10.5517*)
19 Entries in all publisher journals, excluding one data repository*)+NOT+(identifier:*10.5517*)
20 Entries in Institutional repository referencing datasets*10.14469/spiral*)+AND+(identifier:*)+AND+(types.resourceTypeGeneral:Dataset)

The examples above reveal a somewhat a not entirely human-friendly syntax; with each of them some effort at “de-bugging” was needed to make them work. I gather from the  PIDForum that a more friendly GUI to achieve this is on their radar. As I develop or discover more examples of such searches I will add them to the list above at DOI: 10.14469/hpc/5920. Meanwhile, if  you want to use any of the above as a template for your own searches do please explore.


2 Responses to “Metadata. Why?”

  1. Mike Turner says:

    Derek Lowe’s blog yesterday:
    This highlights the potential of machine processing properly curated information in natural language (journal abstracts in this case) to provide useful inputs to research. If metadata could routinely stitch the two together, computers would suddenly become much more useful.

  2. Henry Rzepa says:

    ContentMine has been doing this for a little while. Natural language (in chemistry) is at best around 95% accurate, and a fair bit more has to be done to render the results more reliable.

    I agree good metadata combined with natural (trained) language searching has lots of potential. Interestingly, whereas the introduction of Google has revolutionised how humans search for information, new generations of search engine such as Elasticsearch are leading the way for embedding into AI-engines. I note that the metadata for FAIR data is indexed by DataCite using ElasticSearch. So we may well expect some revolutionary stuff based on natural language in combination with Elastic metadata to emerge in the next few years.

Leave a Reply