By Jordan Kellerstrass
For most of its history, international development has been an inexact science. Validation of development interventions to improve health or economic outcomes was generally unfounded. With the increasing reliance on data to prove that a program is effective, however, the field is entering a new era. Data-centric evidence is becoming the lead arbiter of whether an intervention is renewed or scaled.
The continual improvement of technology and the digitization of vast amounts of survey, sensor, network, and other numeric and textual data are promising more reliable, timely information for development actors. At the same time, the ability to use this data well is hampered by the lack of consistency in tools and methodologies for end-to-end data management. The Mezuri team — which includes researchers from UC Berkeley’s Technology and Infrastructure for Emerging Regions (TIER) group, University of Washington, University of Michigan, and Portland State University — are addressing this issue. Together, they are building a cloud-based data management platform to support the entire pipeline — from raw data collection to meaningful, actionable results in the form of visualizations and statistics.
Eric Brewer is a professor of computer science at UC Berkeley, where he leads the TIER group and is Mezuri’s principal investigator; he is also vice president of infrastructure at Google. To provide a deeper sense of the story behind the development of the Mezuri Data Platform, we asked him the following questions.
1. What inspired the idea for Mezuri?
The economic development space has a checkered history, which has led to more intensive efforts to “prove” that an intervention is effective and thus should be scaled up — these broadly fall under the phrase “measurement and evaluation,” or M&E for short. This has led to techniques like randomized controlled trials (RCTs) that really aim to meet this bar of proof. However, such an approach implies a significant data management problem: How do you collect, manage, and protect the data you need?
The tools have been rudimentary, typically just laptops and Excel. The consequences are that it is easy to lose data, hard to share it, and even hard to know what exactly was done to the data since you got it.
Making data management and analysis easy and accurate is the essence of Mezuri.
2. How does your background help guide Mezuri development?
A key aspect of Mezuri is leveraging Cloud computing, which is an area of long-time interest for me (and roughly what I do at Google). The Cloud brings reliable data storage with access control, unlimited processing power, and the ability to share not only data but also best practices. Finally, done well, it provides ease of use, as users only need web access to participate and not their own servers or even data centers.
3. Who all is developing Mezuri? How did they come together?
I tried to pull together the best groups I could from around the country to form the whole solution. University of Washington has done a great job with Open Data Kit, which is the basis for the survey aspects of Mezuri. Evan Thomas at Portland State has done the most work with real users (e.g., economists and NGOs) and real data with his SweetSense data collection system; his field experience is particularly valuable. Colleagues at University of Michigan [Lab11] are experts in novel sensors and high-volume, real-time data collection. Bringing these elements together is not easy, but I love the team we have.
4. Why is this possible now? How is this project part of the story of computer science?
The two big enablers for us are the Cloud for scalable data storage and computing, and the rise of mobile phones, especially smart phones, which enable high-quality surveys and data collection pretty much anywhere in the world. These two together will change not only the practice of “development” but are also one of the greatest shifts not only in computing, but in the history of the world. The impact of phones has already been remarkable, but I think the Cloud will be bigger (unless they are viewed as one transition in the end).
5. What is Docker and what role does it play?
Docker, at its core, is a change in the level of abstraction of Cloud computing. Traditionally, the basic unit of abstraction was the “virtual machine” — it is as though you have a raw server and you need to install an OS and applications, and then maintain the OS. This is a high burden, even for computer scientists. The new abstraction is the “container,” which is a bundle of applications that fit together well and that can be created once and then reused easily by many. The details of the machine and OS matter less, and users can pick containers that have the software they need (and already know how to use).
6. Who will benefit from a system like Mezuri?
The immediate target audience is social scientists and economists that need to manage data well as part of their research. This should lead initially to better data management and later to more aggressive evaluations that include more kinds of data and that mix surveys and sensors well. However, the real goal is to benefit those in developing regions, by making better investment decisions due to better data.
7. How does Mezuri enable or encourage data sharing?
First, just having data safely in the Cloud is a good start — that is the easiest path to sharing, similar in spirit to Dropbox and Google Docs. Because some of this data has important privacy risks, the key to sharing is actually being able to limit the sharing to just those that should have access. We can also share data at different levels of processing and summarization: often we can share the anonymized or aggregated data, but not the raw data. Finally, it is equally important to share the workflows, that is, the template of processing steps necessary to convert raw data into knowledge. Sharing workflows enables formal review of processes, sharing of best practices, and also enables repeatability.
8. What are the short and long term visions for Mezuri?
In the short term, we need to get real users to collect, manage, and analyze data using the system. This will teach us a great deal about what the real requirements are and what we need to do to make the system sufficiently useful in practice. Long term, I hope to see Mezuri emerge as the de facto way to do data management for the social sciences and a key part of better decisions around how to best spend the world’s development money.
Jordan Kellerstrass is a PhD student in computer science, a researcher in the Technology and Infrastructure for Emerging Regions (TIER) group, and a member of the DIL Idea Team.