In The Netherlands, Open Data is increasingly prevalent. At the time of writing, December 2020, over 15,000 datasets are listed at the government data register. Additionally, various providers in industries such as energy and healthcare, publish open datasets regularly. Many of these adhere to the FAIR principles — meaning they are Findable, Accessible, Interoperable and Reusable.
The FAIR principles, formalized in 2016, are a case in point for a focal shift in many open movements including Open Government and Open Science. A shift from having the data theoretically available, into attempting to make it practically useful.
The NL Open Data project is built with that same focus in mind, by creating a bridge between the existing open datasets and the domain experts who do not possess the technical skills necessary to take advantage of them.
Currently, in order to utilize the existing open datasets and traversing the steps from data, through information into knowledge and hopefully wisdom — a wide technical skillset is needed, along with a relatively heavy investment of time or funds (and often both). This creates a nearly insurmountable barrier for too many domain experts to capitalize on the already available open data. At the same time, these domain experts are often best situated to ‘ask the right questions’ from these datasets in order to eventually create a real impact in our society.
To put it differently, imagine you want to investigate, say, the relationships per neighborhood between employment status, percentage widowed households and the need for mental care. That is a relatively simple task if you have sufficient resources to look for possible data sources and perform the data integration, prior to actually addressing the question through data analysis. These steps of data understanding and data preparation require specific skills (programming, data engineering) as well the hardware resources to be able to perform it. If you are, however, a policy advisor (in Dutch — ‘beleidsmedewerker’) supporting a group of general practitioners, that task becomes substantially more complicated, and often rendered practically impossible.
The NL Open Data project dramatically lowers the bar to meaningfully interact with open datasets up to a point where no designated hardware is needed, and only basic SQL knowledge is required. It does so by uploading open datasets from various sources into the Google Cloud Platform (GCP) where the data can be queried using standard SQL.
This presents a single access point for anyone interested in querying data from various (open) data sources and being able to fulfill the promise of Open Data — providing the ability for domain experts to mix and match different data sources in order to answer meaningful questions and positively impact society.
Use case - analysis of trends for general practitioners
Wielinq is a consultancy group operating across several industries, among them healthcare. They help organizations address their complex challenges, and identify new opportunities and creative ways of working. In that spirit, they have experimented with external open data with the goal of sharing valuable insights with their organizations, looking to surprise them as well as challenge them to see the value of (open) data. However, Wielinq currently has no designated personnel or hardware to generate and interpret such insights.
They are now looking to investigate the relations between the density of General Practitionrs (GP’s) and care expenses, in various areas in The Netherlands. To explore these relations the consultant needs to, at the very least, collect data regarding the region’s population, GP’s employment and care expenditure, equalize it on some geographical grain and then combine it. In order to get started, the consultant does some research, and locates several possibly relevant datasets which contain information that might be useful, from several sources.
Without NL Open Data, the following steps would include downloading and properly deciphering the relevant datasets, storing them somewhere, combining them appropriately into one or several data warehouses, and attempting to interact with it — using a variety of tools, according to the source format. Having no programming or data engineering experience, this is likely to result in anywhere between a costly success creating an expensive, unstable setup for a one-time answer, a benign failure of simply giving up, or even a harmful failure — producing erroneous results without realizing.
With NL Open Data, these different source datasets would be available for query or download, from a single source and using the same system and language — SQL. Anyone with basic SQL knowledge could reach out to this information, and make simple JOINs on different datasets. Downloading from the original source has already been taken care of, and the data is periodically updated. The geographical comparison is handled automatically, as are other issues at various stages, such as flattening of normalized tables. According to the problem at hand and the technical knowledge of the consultant, they could:
- Use a pre-existing datamart created around a topic
- Create their own SQL query from scratch
- Make use of NL Open Data’s query generator to help them.
Additionally, other datasets than the consultant was aware of, or could retrieve, might exist in NL Open Data, providing a richer datasphere to engage with.
Note — this is a brief overview — a more in-depth technical description is in the making.
The technical inspiration for this project is a combination of two main concepts: Functional Data Engineering, and Cloud Based Data Warehousing. Working in tandem, these two allow NL Open Data to be an inexpensive, easy to use, reliable source of information.
The latter concept, also known (somewhat deceptively) as Serverless Data Warehousing, determines where and how the data is ultimately stored and can be accessed. In the current implementation, Google Cloud Platform (GCP) is used, and can abstractly be understood using the following diagram from Google’s documentation ( NOTE — while the current implementation is based on GCP, other cloud providers provide comparable solutions):
The left side, Source Systems, represents the various sources (i.e. CBS). That is where open data currently persists, and is already available through varying methods. The middle part represents a staging area which could be a local or virtual machine(s). That is where the actions take place to get the data from the source, make any strictly necessary alterations (format change, compression, etc.) and upload it to GCP — the third, right side of the drawing. This is where the data is accumulated (Cloud Storage) and where it can be accessed using simple SQL queries (BigQuery). Within BigQuery, various data marts can be created, according to the user’s needs — that is the point of contact with the domain experts.
If Cloud-Based Data Warehousing determines where everything should be and what are the relationships between all the pieces — Functional Data Engineering sets the way in which this is achieved. All the previously discussed actions and more — getting the data from a source, naming it appropriately and linking its metadata, compressing, matching grains, etc. — these are all done with a functional approach in mind. That means these are all created to be deterministic and idempotent — which in plain words means that when given the same input they always produce the same output. This approach presents several advantages, one of them being that it lends itself quite naturally to automatization and parallelization, towards which NL Open Data takes advantage of Prefect.
In our current data landscape, where the cutting edge is continuously pushed further into the horizon and understandably draws attention and funding, we sometimes overlook a simple truth we all know. A significant social impact is within grasp through simplifying and making accessible existing tools for the right audience.
NL Open Data aims to do just that. By utilizing modern, open source, infrastructure, alongside user friendly cloud based solutions, it creates the next step in Open Data. A fully managed, serverless data warehouse with its users in mind — the domain experts of the social and public domain.