07 May The History of Hadoop – The Engine Which Drives Big Data
Depending on how one defines its birth, Hadoop is now 10 years old. In that decade, Hadoop has gone from being the hopeful answer to Yahoo’s search-engine woes to a general-purpose computing platform that’s poised to be the foundation for the next generation of data-based applications.
Alone, Hadoop is a software market that IDC predicts will be worth $813 million in 2016 (although that number is likely very low), but it’s also driving a big data market the research firm predicts will hit more than $23 billion by 2016. Since Cloudera launched in 2008, Hadoop has spawned dozens of startups and spurred hundreds of millions in venture capital investment since 2008.
Almost everywhere you go online now, Hadoop is there in some capacity. Facebook, eBay, Etsy, Yelp , Twitter,Salesforce.com — you name a popular web site or service, and the chances are it’s using Hadoop to analyze the mountains of data it’s generating about user behavior and even its own operations. Even in the physical world, forward-thinking companies in fields ranging from entertainment to energy management to satellite imagery are using Hadoop to analyze the unique types of data they’re collecting and generating.
Everyone involved with information technology at least knows what it is. Hadoop even serves as the foundation for new-schoolgraph and NoSQL databases, as well as bigger, badder versions of relational databases that have been around for decades. But it wasn’t always this way, and today’s uses are a long way off from the original vision of what Hadoop could be.
When the seeds of Hadoop were first planted in 2002, the world just wanted a better open-source search engine. So then-Internet Archive search director Doug Cutting and University of Washington graduate student Mike Cafarella set out to build it. They called their project Nutch and it was designed with that era’s web in mind.
Looking back on it today, early iterations of Nutch were kind of laughable. About a year into their work on it, Cutting and Cafarella thought things were going pretty well because Nutch was already able to crawl and index hundreds of millions of pages. “At the time, when we started, we were sort of thinking that a web search engine was around a billion pages,” Cutting explained to me, “so we were getting up there.” There are now about 700 million web sites and, according to Wired’s Kevin Kelly, well over a trillion web pages. But getting Nutch to work wasn’t easy. It could only run across a handful of machines, and someone had to watch it around the clock to make sure it didn’t fall down.
“I remember working on it for several months, being quite proud of what we had been doing, and then the Google File System paper came out and I realized ‘Oh, that’s a much better way of doing it. We should do it that way,’” reminisced Cafarella. “Then, by the time we had a first working version, the MapReduce paper came out and that seemed like a pretty good idea, too.”
Google released the Google File System paper in October 2003 and the MapReduce paper in December 2004. The latter would prove especially revelatory to the two engineers building Nutch. “What they spent a lot of time doing was generalizing this into a framework that automated all these steps that we were doing manually,” Cutting explained.
Raymie Stata, founder and CEO of Hadoop startup VertiCloud (and former Yahoo CTO), calls MapReduce “a fantastic kind of abstraction” over the distributed computing methods and algorithms most search companies were already using:“Everyone had something that pretty much was like MapReduce because we were all solving the same problems. We were trying to handle literally billions of web pages on machines that are probably, if you go back and check, epsilon more powerful than today’s cell phones. … So there was no option but to latch hundreds to thousands of machines together to build the index. So it was out of desperation that MapReduce was invented.”
Parallel processing in MapReduce, from the Google paper
Over the course of a few months, Cutting and Cafarella built up the underlying file systems and processing framework that would become Hadoop (in Java, notably, whereas Google’s MapReduce used C++) and ported Nutch on top of it. Now, instead of having one guy watch a handful of machines all day long, Cutting explained, they could just set it running on between 20 and 40 machines that he and Cafarella were able to scrape together from their employers.
Anyone vaguely familiar with the history of Hadoop can guess what happens next: In 2006, Cutting went to work with Yahoo, which was equally impressed by the Google File System and MapReduce papers and wanted to build open source technologies based on them. They spun out the storage and processing parts of Nutch to form Hadoop (named after Cutting’s son’s stuffed elephant) as an open-source Apache Software Foundation project and the Nutch web crawler remained its own separate project.
“This seem like a perfect fit because I was looking for more people to work on it, and people who had thousands of computers to run it on,” Cutting said.
Cafarella, now an associate professor at the University of Michigan, opted to forgo a career in corporate IT and focus on his education. He’s happy as a professor — and currently working on a Hadoop-complementary project called RecordBreaker — but, he joked, “My dad calls me the Pete Best of the big data world.”
Ironically, though, the 2006-era Hadoop was nowhere near ready to handle production search workloads at webscale — the very task it was created to do. “The thing you gotta remember,” explained Hortonworks Co-founder and CEO Eric Baldeschwieler (who was previously VP of Hadoop software development at Yahoo), “is at the time we started adopting it, the aspiration was definitely to rebuild Yahoo’s web search infrastructure, but Hadoop only really worked on 5 to 20 nodes at that point, and it wasn’t very performant, either.”
Stata recalls a “slow march” of horizontal scalability, growing Hadoop’s capabilities from the single digits of nodes into the tens of nodes and ultimately into the thousands. “It was just an ongoing slog … every factor of 2 or 1.5 even was serious engineering work,” he said. But Yahoo was determined to scale Hadoop as far as it needed to go, and it continued investing heavy resources into the project. It actually took years for Yahoo to moves its web index onto Hadoop, but in the meantime the company made what would be a fortuitous decision to set up what it called a “research grid” for the company’s data scientists, to use today’s parlance. It started with dozens of nodes and ultimately grew to hundreds as they added more and more data and Hadoop’s technology matured. What began life as a proof of concept fast became a whole lot more.
“This very quickly kind of exploded and became our core mission,” Baldeschwieler said, “because what happened is the data scientists not only got interesting research results — what we had anticipated — but they also prototyped new applications and demonstrated that those applications could substantially improve Yahoo’s search relevance or Yahoo’s advertising revenue.” Shortly thereafter, Yahoo began rolling out Hadoop to power analytics for various production applications. Eventually, Stata explained, Hadoop had proven so effective that Yahoo merged its search and advertising into one unit so that Yahoo’s bread-and-butter sponsored search business could benefit from the new technology.
And that’s exactly what happened, because although data scientists didn’t need things like service-level agreements, business leaders did. So, Stata said, Yahoo implemented some scheduling changes within Hadoop. And although data scientists didn’t need security, Securities and Exchange Commission requirements mandated a certain level of security when Yahoo moved its sponsored search data onto it. “That drove a certain level of maturity,” Stata said. “… We ran all the money in Yahoo through it, eventually.”
The transformation into Hadoop being “behind every click” (or every batch process, technically) at Yahoo was pretty much complete by 2008, Baldeschwieler said. That meant doing everything from these line-of-business applications to spam filtering to personalized display decisions on the Yahoo front page. By the time Yahoo spun out Hortonworks into a separate, Hadoop-focused software company in 2011, Yahoo’s Hadoop infrastructure consisted of 42,000 nodes and hundreds of petabytes of storage.
However, although Yahoo was responsible for the vast majority of development during its formative years, Hadoop didn’t exist in a bubble inside Yahoo’s headquarters. It was a full-on Apache project that attracted users and contributors from around the world. Guys like Tom White, a Welshman who actually wrote O’Reilly Media’s book Hadoop: The Definitive Guidedespite being what Cutting describes as a guy who just liked software and played with Hadoop at night.
Up in Seattle in 2006, a young Google engineer named Christophe Bisciglia was using his 20 percent time to teach a computer science course at the University of Washington. Google wanted to hire new employees with experience working on webscale data, but its MapReduce code was proprietary, so it bought a rack of servers and used Hadoop as a proxy.
Explained Bisciglia: “It was somewhat challenging as an interviewer when you’re talking to these undergrads who are really smart kids and you ask them to come up with an algorithm or do some data structures and then you say, ‘Well, what would you do with a thousand times as much data?’ and they just go blank. And it’s not because they’re not smart, it’s just, well, what context do they really have to think about it at that scale?”
Only, the course didn’t really scale because it wasn’t feasible to deploy Hadoop clusters at universities across the country. So Google teamed with IBM and the National Science Foundation to buy a soon-to-be decommissioned data center, install 2,000 Hadoop nodes in it, and offer up grants to researchers and universities instead. Managing this cluster made Bisciglia realize how hard it was to manage Hadoop at any real scale, and how much he wished there was someone he could call to help.
“That was kind of when the light went off that that company didn’t exist and it needed to be started,” he said.
With Google’s blessing, Bisciglia spent time thinking about his idea and incorporated a company called Cloudera in March 2008. He reconnected with open-source acquaintance and now-Cloudera CEO Mike Olson shortly thereafter and the two took the idea to Accel Partners, where Ping Li connected them with former Facebook data engineer Jeff Hammerbacher and former Yahoo engineering VP Amr Awadallah, both of whom were doing entrepreneur-in-residence stints. The four of them officially founded Cloudera in August 2008 and it closed its first funding round in April 2009.
Collaborating to Serve the Enterprise
A group of Hadoop and big data application vendors, system integrators and end-users are forming the Open Data Platform association to create a common core big data kernel to eliminate fragmentation in the space.
The Open Data Platform (ODP) will work directly with specific Apache projects while adhering to the Apache Software Foundation’s guidelines for contributions. The members note that ODP will help them collaborate across various Apache projects and other open source-licensed big data projects to meet enterprise-class requirements.
“The best way to accelerate innovation and adoption of platform technologies like Hadoop is through an open source model,” says Shaun Connolly, vice president of Corporate Strategy at Hortonworks, a Platinum member of ODP.
“The Open Data Platform initiative will rally both enterprise end users and vendors around a well-defined common core platform against which big data solutions can be qualified. This will free up the broader big data ecosystem to focus on data-driven applications that deliver proactive insights for business,” Connolly says.
The initial Platinum members of ODP include GE, Hortonworks, IBM, Infosys, Pivotal and SAS. Initial Gold members include Altiscale, Capgemini, EMC, Verizon Enterprise Solutions and VMware. Pivotal’s Madra says he expects membership to grow rapidly.
“Infosys is seeing rapid adoption of open source software in the world’s largest enterprises across all major industry segments,” says Navin Budhiraja, head of Architecture and Technology at Infosys.
“As all businesses strive to become digital, they see an increasing need for a platform that can support real-time and actionable insights, self-service exploration and fluid data schemas to quickly adapt to the dynamic business needs,” Budhiraja says. “This will require them to deploy new web-scale architectures and the adoption of these modern architectures can be greatly accelerated if they are based on open standards and easy access to trained talent. Open Data Platform will create such an ecosystem, preserving the rapid innovation cycles of open source software while still providing the benefits of broad vendor support and interoperability.”