Aleksey Bilogur

What next for open data?

06/17/2018

I’ve been doing some thinking lately about open data, and specifically about open data as a product.

Open data, if you’re not familiar with it, is a recent phenomenon in government and academia encouraging institutions and municipalities to publish the data that they use. After all, these institutions exist for the public benefit, so their releasing the data that they collect (subject to restrictions on things like privacy and security) makes tons of sense.

The radical openness of open data, early proponents promised us, would feed the next generation of tech products solving real problems our communities. So-called "citizen data scientists" would come out of the woodwork, seize onto the data, and carry it off into their communities. It would be used to hold local and federal governments accountable, and to inform decision making by smaller local stakeholders (like nonprofits and community groups).

In this post I opine on the two generations of open data platforms on the market right now, and offer one vision of what open data platforms might look like in the future.

First-generation open data products

The first generation of open data products were, as you might expect, a mess.

Some vocabulary. An open data producer is an entity that creates and publicly publishes novel datasets, e.g. datasets that never existed or had never been public before. Most open data publishers are cities, municipalities, or other government entities (for example, New York City Open Data), but some institutions and other organizations also publish (for example, Harvard Open Data).

To provide their data publicly, open data producers need a piece of software known as an open data portal. Open data portal technologies compete on their selection of user-facing features: things like search, filtering, and overall look and feel. Some data publishers set up their own open-source portals, using open source software like CKAN, but at least in the United States most license the software from open data portal vendors.

In the United States the market leader portal (and vendor) is Socrata. I will use Socrata to illustrate the problems with these first-generation products, as it is the product I am most familiar with.

Socrata suffers from two different classes of issues that badly hamper its usability. The first set of issues are technical.

Socrata serves portal data from a central location, with each Socrata portal merely providing a view on the data contained therein. Thus San Francisco, New York City, Chicago, and Houston are all serving data from the same monolith. This architecture is very easy to set up and administrate, and hence attractive to a start-up, but it is extremely suboptimal performance-wise. True to form, downloads from Socrata are really slow.

More important are product issues in the web tier. Searching for something on Socrata is a crapshoot because not only is data badly indexed, it’s also of highly uneven quality; best you can do is enter the right keywords and hope that at least one of the results you get hits the mark. An early product focus around built-in browser-based data visualization was a dead end, as data quality is usually too poor to support meaningful browser-based interaction without prior cleaning. And other issues of a similar nature abound.

The second set of problems are policy-based.

Once you have a portal, you need to populate it with data, and that data has to come from somewhere. The configuration of open data publishing incentives and resources varies from publisher to publisher. To take an example, if you study taxi administrators in New York City and in Chicago you will find very similarly organized regulatory agencies. Taxi management are not a new problem, so cities have established a consensus on how best to service them.

Open data is about a decade old. No such consensus exists yet. As a result, every data publisher does things differently.

This makes life much harder for vendors. Socrata relies on its contract with New York City and other cities like it to keep the lights on. Every publisher comes to the table with its own goals and opinions, and hence its own list of demands; and because of how different these organizations are, the things they want tend to be very different as well.

To satisfy all of these demands, Socrata has to pull its resources and its product in a hundred different directions, resulting in a product that does everything crappily instead of a few things well. Vendors are disincentivized to focus on things invisible at the negotiating table, like performance and availability, in favor of shiny toy features.

To move faster, either Socrata or one of its competitors has to pull a Steve Jobs and force a well thought-out consensus on what actually matters in an open data portal onto its customers, or it needs its customers need to organically reach a consensus on this matter themselves. Neither of these things will happen anytime soon.

Second-generation open data products

First-generation open data portals are not the best product-wise, and it almost doesn’t matter.

The past couple of years have seen the emergence of what, for the purposes of this blog post, I will call open data aggregators. The two archetypal examples are Kaggle Datasets and data.world. These portals do not, as a rule, post their own newly created data; instead they rely on users uploading datasets themselves.

This simple change removes dataset producers from the equation. The developers managing the aggregator platforms are not held down by contracts and support lines, making it vastly easier to do what’s right for the product. Kaggle Datasets and data.world can afford to break things that vendors are too embattled to even touch.

As a result, even though Socrata and its peers are now around ten years old, and Kaggle Datasets and data.world are both around two, the latter are very obviously the technically superior product. They have made strides on solving many of the issues that still plague first-generation portals. Features like topic tags, manual curation, and vibrant forums have made searching for high-quality datasets easier. Integrations with fully-featured data science tools have led to more effective and communicative exploratory data visualization. User dataset uploads are treated as first-class citizens (obviously) and not awkward tack-ons. And so on.

However, technically better though they may be, a replacement for first-generation open data portals they are not.

For one, these platforms still rely on open data publishers implicitly. If the publishers stop opening up new datasets, aggregators will have no raw input to work with.

Second of all, open data publishers are stakeholders in a particular arena—New York City, say, or Harvard University, or something else. They care about providing comprehensive resources in that arena. The users that aggregators rely on are more mercenary. They will upload only the datasets that they are interested in, and they will not invest time in keeping said datasets maintained once they lose interest in it. Thus aggregators tend to accumulate multiple copies of the same dataset published at different times by different users. Preventing this from happening requires meaningfully shifting user behavior, a hard problem indeed!

Finally, and most importantly, there is a difference in audience. Data publishers want to see their data end up in the hands of local communities and organizations. They want to see their data used by constituents to make informed decisions about their lives.

Kaggle Datasets and data.world, meanwhile, are used primarily by an international community of technology enthusiasts. Most users there are learning or practicing data science skills or creating portfolio pieces. The focus is very much on the code; the data is mostly an incidental input to portfolio projects. Very little of it ends up back in the community.

This critical difference is a source of quiet friction between these second-generation open data platforms and data producers. I’ve heard it said that these products "poach" open data for their own goals, without providing anything back for the dataset producers. I’ve also heard them described as a "gray market" for data.

These are fair criticisms.

But these platforms have created a huge amount of exposure for open data, and they’ve done so internationally. These platforms have successfully evangelized open data beyond hardcore enthusiasts and put it into the hands of more incidental practitioners. More people doing more things can only lead to strictly better outcomes downstream. DataKind gets a fatter input pipe. More people show up to Obama Foundation events. And so on. The returns are not immediate, but they’re there, and they’re very worth it.

And they have finally created a space where user-created and "small producer" datasets are a first-class citizen. First-generation portal platforms either don’t allow users to upload data or don’t treat them on equal footing as producer-published data. This has given small and individual data publishers a place to share their data that’s "better than GitHub": an invaluable resource.

That being said, personally I would love to see more collaboration between data producers and second-generation open data platforms. Both sides are interested, and it is starting to happen—witness Kaggle’s Data Science for Good Challenges, now in their second inning. But it’s challenging. The producers and platforms are very different, and no one’s found a working model of collaboration yet.

Ultimately I think that in today’s world we need both first-generation and second-generation open data portals. But the further we go into the future, I foresee more of the conversation happening on the latter platforms than the former.

The next generation

Looking forward, what will the next generation of open data products look like?

A fundamental weakness of portals as a product concept is that they by definition partition data access. Each provider obviously has to store the data they host, and then has to maintain code for accessing that data quickly when needed. If a dataset gets posted to the New York City open data portal, then reposted to Kaggle and data.world, you now have the same file living in three different locations.

Architecturally speaking, portal front-end and portal data services are strongly coupled. But the front-end is where all of the features lie. If the canonical copy of the dataset gets updated, these three file copies will eventually go out of sync. In my experience, more often than not this is exactly what happens!

But what if there is a common underlying architecture? What if datasets were some kind of network resource? This is appealing because it solves a swatch of problems in today's ecosystem.

For one, open data aggregators provide a limited subset of the total data space: just the datasets that users have deemed most useful. This works to the exclusion of a lot of data. In particular, "helper" datasets that are not interesting on their own but useful in combination with other datasets are severely under-provisioned.

It also creates a versioning problem. For all of their feature weight, when it comes to datasets that continue to receive updates, open data aggregators do not, as a rule, keep up with data producers. Users will upload a version of the dataset that was recent at the timestamp they were working on it, but will generally not invest in maintaining it past the end date of their project.

In the short term, portal providers are moving to make data pipelines more robust, and data maintenance more user-friendly and less chore-like. But user behavior is hard to change. In my opinion, an architectural change is needed.

Suppose that instead of posting datasets to their own servers, dataset producers seed their datasets onto a set of servers maintained by someone else. These servers act as a network resource. Data publishers can subscribe to particular datasets, and when a download request comes in from a user, the publisher can reach into the network and forward that resource to the user.

Decoupling the storage layer from the presentational layer in this way eliminates the versioning problem. There is now just one canonical copy, and every application or portal that wants to use that database can subscribe to that and use it as its source. Publishing a new dataset would now mean pushing a new file into the common network. Updating the dataset would require no work whatsoever on the portal provider’s part!

In this world open data portals could now feasibly provide as much data as they want, not just a manually curated selection thereof, and the data that they provide could always be the latest version, not a backdated and under-maintained copy. The portals themselves become web interfaces on top of this common resource, which was always the part that mattered to users anyway.

There are several different technologies on the market right now that could be used in this way. Quilt is one that I like, and DataHub.io is another example. These services are optimized around fast download speeds and well-designed metadata schemas. They're architecturally a good fit for the job of being the open data world's backing store. Of course, the hard part is getting there!

Another, more ambitious attempt at a solution is datBase. datBase is a very ambitious asynchronous dataset distribution model, one based on an experimental new protocol called dat. datBase is part of the asynchronous web movement, which is still in its infancy, so datBase isn't even in alpha yet. But it's a terrifically interesting new technology that I think is worth watching.

Leaf-reading new technology is a notoriously hard thing to do. Regardless of what happens, there’s a lot of innovation going on in this space right now. I’m excited to see what comes next.