Tradeoffs Between Parallel Database Systems, Hadoop, and HadoopDB as Platforms for Petabyte-Scale An

Seminar: 
Applied Mathematics
Event time: 
Tuesday, November 6, 2012 - 11:00am to 12:00pm
Location: 
AKW 200
Speaker: 
Daniel Abadi
Speaker affiliation: 
Yale University
Event description: 

As the demand for analyzing data sets of increasing variety and scale continues to explode, the software options for performing this analysis are beginning to proliferate. No fewer than a dozen new parallel database systems have been created in the last decade to meet this demand. At the same time, MapReduce-based options, such as the open source Hadoop framework are becoming increasingly popular, and there have been a plethora of research publications in the past five years that demonstrate how MapReduce can be used to accelerate and scale various data analysis tasks.

Both parallel databases and MapReduce-based options have strengths and weaknesses that a practitioner must be aware of before selecting an analytical data management platform. In this talk, I will describe some experiences in using these systems, and the advantages and disadvantages of the popular implementations of these systems. I then discuss a hybrid system that we are building in my lab, called HadoopDB, that attempts to combine the advantages of both types of platforms.