Journalism in the Age of Data: Q&A with Dhrumil Mehta


  • Maisie O'Brien

Ash Center Technology and Democracy Fellow Dhrumil Mehta discusses his role as a database journalist at the data-driven news site FiveThirtyEight. Dhrumil uses uses an impressive digital toolkit to turn the plethora of harvested public information into usable data for data-driven stories on politics. In this interview, Dhrumil speaks to the connections and tensions between data analysis and content creation, and emphasizes the importance of transparency and data availability for database journalists. 

Q: Tell us a little about yourself and your day job with Nate Silver?

A: I’m a database journalist at FiveThirtyEight, which is a data-driven news site.  We started out as a political blog and polling aggregation site run by Nate Silver and housed at the New York Times.  Since then, we’ve expanded into areas like economics, science, lifestyle, politics, and sports, and now we’re owned by ESPN.  Even if we’re covering the latest movies or winning sports team or whatever it may be, there’s some data-driven element to our reporting— that’s really the unique aspect of FiveThirtyEight.

In my role, I build and maintain the databases that underpin our political reporting, and I curate FiveThirtyEight’s polling database that we use for data analysis and predictions.  Along with that, I write articles based on insights that I find in our databases.

Q: Journalism and programming are two very distinct skill sets.  How did you get interested in journalism and data scraping?  Which came first?

A: I studied philosophy and computer science at Northwestern University.  As a student, I was really interested in language analysis and I took a course with a professor who was studying metaphor, which led me to undertake a project trying to determine whether computers can understand metaphors.  I needed a language data set for my project and I found that the Sunlight Foundation, which focuses on transparency-related issues in DC, had parsed the Congressional Record into an XML format, making it a really easy-to-use natural language data set.  Analyzing this data set is what sparked my interest in politics.

After graduating from college, I became a software developer at Amazon and eventually found this position at FiveThirtyEight.  I’d also hung around the Knight lab at Northwestern, which is an innovation lab in journalism.  That’s how I met a lot of journalists and realized that there are people doing real-time data analysis for the public good, which was definitely inspiring to me.

Q: What is it like working at FiveThirtyEight?  Describe the grind in producing both data analysis and content?

A: Coming from a software development background, the deadlines at FiveThirtyEight are must faster.  In software, there’s this idea that you build a MVP, a minimal viable product, and you iterate on it.  You make it better and better over time.  In journalism, you have a piece of news, you do your analysis, you publish it, and you move onto the next thing.  You’re not really iterating on a story.  You have to get it right the first time because it stops being news at the end of the day

At FiveThirtyEight, accuracy is always our first priority.  The content has to be accurate, then it has to be fast, then everything else.  And ‘everything else’ in the coding world, means that your functions are clean and that there aren’t huge blocks of code that are illegible. In my case, I’m willing to sacrifice architectural flourish for code that is accurate and timely.

Q: At FiveThirtyEight, does the story drive the data or does the data drive the story?  What do you seek out first?

A: I think we take both approaches.  We often find ourselves looking at things going on in the world and asking, “What data can we access to shed light on these events?”  But sometimes we find ourselves doing the opposite.  If a huge data set drops, we’re trying to determine what we can understand from it.

For example, we recently did a project on Uber drivers in Manhattan that combined both approaches.  One of our reporters said, “Hey, I wonder what Uber’s impact on Manhattan is?”  He submitted a freedom of information request to the New York City government and we got back a giant dump of files.  90 million rows of Uber data.  Then, we had a host of new questions driven by that information.  It wasn’t just how can we answer our initial question, but what other questions can we explore based on the data that we find.

The Uber data was great because it generated a set of stories.  It required more sophisticated analysis than we usually do.  A lot of the projects we take on are one-offs.  You write it up and move on.  This was something where we had to learn how to analyze geospatial data sets and integrate them into a database. There was a large amount of data that went into our Uber reporting.

Q: Can you describe other memorable stories?

A: A really interesting piece that I wrote recently was on Bobby Jindal’s campaign financing.  I spoke with a professor who created an algorithm to classify the ethnicity of names based on phonetics with some accuracy.  Using this algorithm, we tried to track whether South Asians have stopped contributing to Bobby Jindal’s campaigns over the years.

The broad conclusion that I drew from it is that the GOP can’t just put up minority candidates and expect to get minority votes as a result.  It was a really cool story because it wasn’t solely about the data analysis and the methodology.  It was about how the data analysis could contribute to a complex, bigger picture discussion.

Q: What challenge to democratic governments do you wish more people were thinking about or working on?

A: I don’t think it’s sexy, but just getting the data is so important.  There’s this weird cat and mouse game that happens between the government and journalists where journalists are always inventing new pieces of software to parse some weird format that the government has decided to use.

Almost every year, I go to a conference organized by The National Student Computer Assisted Reporting (NICAR).  And every time I go, I learn about a new tool to use machine learning to parse data out of PDFs.  There are all these strange things that we have to invent to make sense of what is coming out of government.  I think a lot of it is just the importance of making things clear and transparent.  If you have an old C-program that works for your government agency, that doesn’t necessarily mean that it’s serving a public purpose.  Making data available and accessible to journalists and citizens is vitally important to the health of democracy.