Why do metadata standards matter?
Undoubtedly, data plays an important role in eliminating poverty. It can help us track if girls and boys have equal access to education, and observe trends in child and mother mortality at birth in developing countries, for example. In doing so, it can help us to target interventions for maximum impact and to ensure that no one is left behind.
However, data without context – such as who published that data, where it is published and in what format, when or how the data was collected – is incomplete and can be misleading. This context is added by supplying metadata: structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use and manage data. For this context to be read by machines as well as humans, metadata needs to be standardised across datasets and data portals. In other words there is a need for metadata to speak the same language (common metadata standards).
In the era of digital data, where machine-readable files make it possible to complete in seconds tasks that would once have been arduous and manual, the layer of machine- and human-readable metadata must come hand in hand. Why? Because such metadata standards increase the discoverability of datasets and make it easier for the user to search for what they are looking for across multiple platforms.
The Development Data Hub is Development Initiatives (DI)’s flagship online resource for the discovery of financial and resource flow data. The tool brings together multiple datasets and through interactive visualisations allows the user to understand how resources designated for development are spent.
With a click of a button it is possible to visualise where the poorest 20% of people are in the world or unbundle official development assistance (ODA) data by sector, recipient, donor, form or channel. The Data Hub allows the user to compare ODA across sectors, countries, and channels. How? The tool collates data from a variety of sources and combines it to produce dynamic visualisations. These complex visualisations are accompanied by a dynamically created downloadable dataset.
As with the vast majority of databases like this, the Data Hub would benefit from an additional machine-readable layer of context that could direct the user to the relevant information on the data source that produces the dynamic visualisation. Since the Data Hub is dynamic, in an ideal world, so too would be the metadata standard that provides the information on when the data was published, who published it, when it was last updated, where it can be downloaded and how it was generated.
The World Wide Web Consortium (W3C)’s ‘Data on the Web Best Practices’ states that providing metadata is a ’fundamental requirement when publishing data on the Web’ and advises that that metadata should be provided in both human- and machine-readable format. Machine-readable format is a crucial requirement, as this allows computer applications to process it.
Development Initiatives is a leading curator of value-added joined-up data and is committed to improving the interoperability of all development-related and humanitarian data. Can DI adhere to the W3C principle of publishing metadata in a human and machine-readable format?
DI discovers and collects empirical and processed data from a variety of sources. These range from global datasets – maintained by institutions such as the World Bank, International Monetary Fund, UN Statistics Division and Organisation for Economic Co-operation and Development – to national statistics and new collections of emerging data manually curated by its analysts.
The relevant data from these sources are loaded into the Development Data Warehouse, which uses a collection of generic data models to integrate, where possible, data from these disparate sources into standardised, joined-up database schema.
This is used to create datamarts containing purpose built joined-up datasets, each potentially containing data derived from a range of sources that drive context-specific visualisations designed by DI’s analysts for the Development Data Hub.
Figure 1 presents a simplified architecture of the system described here.
This growing complex of interconnected data serving a range of digital products poses three problems:
- Firstly, how does DI, and how do Data Hub users, keep track of what data is available and whether it is up to date?
Secondly, the intellectual credibility of DI’s work depends on metadata that explains the provenance and methodology of its analysts’ calculations. In paper reports you can find this in small-print footnotes, but how do you replicate this for dynamically produced datasets that have been generated from an interactive visualisation?
Thirdly, the joined-up ‘raw’ databases in the warehouse will, in future, become a public good with an open API. How will third-party developers wanting to make use of this repository access the metadata they will need to accompany the data they extract?
What’s the answer?
Why do we need metadata standards to handle this problem? The transitions from data source to data warehouse to data series through datamart and finally to dataset cannot just be comprehensible from a human point of view. The logic needs to be encoded in a machine-readable way so that machines can point a data-user back to the original source of the transformed data and allow the discoverability and searchability of related datasets. At its heart, this is a joined-up (meta)data standards problem. Our latest discussion paper explores how this problem could be addressed.
Joined-up Data Standards will present this technical discussion paper at the W3C and VRE4EIC-organised workshop on ‘Smart Descriptions & Smarter Vocabularies’ in Amsterdam on 30–1 December 2016.