Adventures in data warehousing

Data warehousing is exactly what it sounds like: create a central storage space for large amounts of data, so that it can be accessed by many different people and applications. I had an opportunity recently to work on a data warehouse, so I thought I’d write up a bit about the experience.

Here are three practical principles for data warehousing that everyone already knows.

#1: Model and plan, because if you build a warehouse for data that doesn’t exist (or connectors for sources that have changed), the project will fail.

#2: Talk to end-users, because if you build a warehouse that no one will use, the project will fail.

#3: Involve your sponsors and stakeholders, because if you don’t have the money to finish the warehouse, the project will fail.

Side note: compare these principles to the main technical challenge of data integration: brokering the data so that it can be consumed by the maximum number of downstream applications.

According to Gartner, over 50% of data warehousing projects fail or achieve only limited acceptance. A lot of them will fail even if you enact these best practices (and resolve the technical challenge). Running a warehousing project is a bit like crossing the Atlantic in a wooden boat; you can stock your larders, swab your decks, and keep fresh lookouts, but at any time an angry, CEO-faced storm could blow you into the sea. Whether you’re a data warehousing veteran or a poor schmuck impressed into a warehousing team, best practices can only go so far. So what do you do?

A useful exercise?

Best practices are all well and good, but it’s hard, in the middle of a project, to see their relevance. So let’s go through and try to apply those three principles above in an actual example.

Currently, I’m part of a group that is installing an Internet-of-Things (IoT) testbed for the city of Boston, and my job is to create a database for the data (and metadata) we’ll be receiving from the sensors as well as any applications that will be running off the sensor data. In theory, that means I should

#1: Model and plan, but… I have no idea what sensors I’ll have, where the sensors will be located, what kind of co-dependencies (foreign keys) there might be, or what kind of data they’re going to provide. The sensors could range anywhere from light-field cameras to soil-moisture sensors. Maybe the data will be incredibly simple: one table, no foreign keys, just a big global relation. In that case, we don’t really need to worry about the structure of the warehouse too much, just the access portion [update: this was in fact the case]. On the plus side, these are sensors (as opposed to transactional database systems) so I shouldn’t have to perform complex procedural transformations on the data values before loading them into the warehouse. On the other hand, I might have to warehouse a bunch of application-generated data [update: this was not the case].

#2: Talk to end-users, but… I don’t know who’s going to be using this data; the list of potential users is extremely broad. The flip side of this question is that I don’t know if this data is useful at all, since I don’t have a list of potential users! To be fair, the main goal of this project is to figure out what that list of potential users would be. Maybe some of these “users” will be mash-ups and visualizations meant to be displayed at the testbed itself. Some of my co-organizers want the data warehouse, or some version of it, to be opened up to the public;but why does that need a warehouse, as opposed to a big flat file dump? Others want it to be tailored to “city planners”. Still other have in mind startups, or the local business community.

#3: Involve your sponsors and stakeholders, but… we’re doing data exploration, so it’s unclear who actually “has a stake” (again, part of the project is to find a list of potential sponsors and stakeholders) besides the City of Boston. But the City of Boston doesn’t know what they want to do with the data—they seem more interested in things like best-practices (e.g., things like this list!).

Going through the exercise, it seems pretty clear that what I should be worrying about is not the technical aspects of building and structuring the data warehouse, but the lack of clearly-specified users!