Tradeoffs Between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale An

Seminar:

Applied Mathematics

Event time:

Tuesday, November 6, 2012 - 11:00am to 12:00pm

Location:

AKW 200

Speaker:

Daniel Abadi

Speaker affiliation:

Yale University

Event description:

As the demand for analyzing data sets of increasing variety and scale continues to explode, the software options for performing this analysis are beginning to proliferate. No fewer than a dozen new parallel database systems have been created in the last decade to meet this demand. At the same time, MapReduce-based options, such as the open source Hadoop framework are becoming increasingly popular, and there have been a plethora of research publications in the past five years that demonstrate how MapReduce can be used to accelerate and scale various data analysis tasks.

Both parallel databases and MapReduce-based options have strengths and weaknesses that a practitioner must be aware of before selecting an analytical data management platform. In this talk, I will describe some experiences in using these systems, and the advantages and disadvantages of the popular implementations of these systems. I then discuss a hybrid system that we are building in my lab, called HadoopDB, that attempts to combine the advantages of both types of platforms.