Sessions

MORE SESSIONS TO BE ADDED – CHECK BACK SOON.

Filter by Track

Tuesday, November 8


8:30am — 10:00am

General Session — Keynote Speakers

Michael Olson, Cloudera
Larry Feinsmith, JPMorgan Chase & Co
Hugh Williams, eBay

In compliance with JPMorgan Chase & Co, Larry Feinsmith’s keynote presentation will not be made available as a Hadoop World 2011 resource.

Hugh Williams, eBay Keynote
Hugh Williams will discuss building Cassini, a new search engine at eBay which processes over 250 million search queries and serves more than 2 billion page views each day. Hugh will trace the genesis and building of Cassini as well as highlight and demonstrate the key features of this new search platform. He will discuss some of the challenges in scaling arguably the world’s largest real-time search problem, including the unique considerations associated with e-commerce and eBay’s domain, and how Hadoop and HBase are used to solve these problems

Larry Feinsmith Keynote
Larry Feinsmith will talk about the challenges and successes of embracing Hadoop in a large-scale global banking enterprise. Larry will map the path taken to introduce Hadoop to the firm’s technology environment and the obstacles that lie ahead for big enterprises.

Mike Olson Keynote
Now in its fifth year, Apache Hadoop has firmly established itself as the platform of choice for organizations that need to efficiently store, organize, analyze, and harvest valuable insight from the flood of data that they interact with. Since its inception as an early, promising technology that inspired curiosity, Hadoop has evolved into a widely embraced, proven solution used in production to solve a growing number of business problems that were previously impossible to address. In his opening keynote, Mike will reflect on the growth of the Hadoop platform due to the innovative work of a vibrant developer community and on the rapid adoption of the platform among large enterprises. He will highlight how enterprises have transformed themselves into data-driven organizations, highlighting compelling use cases across vertical markets. He will also discuss Cloudera’s plans to stay at the forefront of Hadoop innovation and its role as the trusted solution provider for Hadoop in the enterprise. He will share Cloudera’s view of the road ahead for Hadoop and Big Data and discuss the vital roles for the key constituents across the Hadoop community, ecosystem and enterprises.



10:15am — 11:05am

Building Web Analytics Processing on Hadoop at CBS Interactive

Michael Sun, CBS Interactive

We successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties at CBS Interactive. After I introduce Lumberjack, the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release, I will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, we achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).


10:15am — 11:05am

Hadoop’s Life in Enterprise Systems

Y Masatani, NTT DATA

NTT DATA has been providing Hadoop professional services for enterprise customers for years. In this talk we will categorize Hadoop integration cases based on our experience and illustrate archetypal design practices how Hadoop clusters are deployed into existing infrastructure and services. We will also present enhancement cases motivated by customer’s demand including GPU for big math, HDFS capable storage system, etc.


10:15am — 11:05am

Building Realtime Big Data Services at Facebook with Hadoop and HBase

Jonathan Gray, Facebook

Facebook has one of the largest Apache Hadoop data warehouses in the world, primarily queried through Apache Hive for offline data processing and analytics. However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. This includes several significant contributions to existing projects as well as the release of new open source projects.


10:15am — 11:05am

Completing the Big Data Picture: Understanding Why and Not Just What

Sid Probstein, Attivio

It’s increasingly clear that Big Data is not just about volume – but also the variety, complexity and velocity of enterprise information. Integrating data with insights from unstructured information such as documents, call logs, and web content is essential to driving sustainable business value. Aggregating and analyzing unstructured content is challenging because human expression is diverse, varies by location, and changes over time. To understand the causes of data trends, you need advanced text analytic capabilities. Furthermore, you need a system that provides direct, real-time access to discover hidden insights. In this session, you will learn how united information access (UIA) uniquely completes the picture by integrating Big Data directly with unstructured content and advanced text analytics, and making it directly accessible to business users.


10:15am — 11:05am

Hadoop in a Mission Critical Environment

Jim Haas, CBS Interactive

Our need for better scalability in processing weblogs is illustrated by the change in requirements – processing 250 million vs. 1 billion web events a day (and growing). The Data Waregoup at CBSi has been transitioning core processes to re-architected hadoop processes for two years. We will cover strategies used for successfully transitioning core ETL processes to big data capabilities and present a how-to guide of re-architecting a mission critical Data Warehouse environment while it’s running.


11:15am — 12:05pm

The Blind Men and the Elephant

Matthew Aslett, The 451 Group

Who is contributing to the Hadoop ecosystem, what are they contributing, and why? Who are the vendors that are supplying Hadoop-related products and services and what do they want from Hadoop? How is the expanding ecosystem benefiting or damaging the Apache Hadoop project? What are the emerging alternatives to Hadoop and what chance do they have? In this session, the 451 Group will seek to answer these questions based on their latest research and present their perspective of where Hadoop fits in the total data management landscape.


11:15am — 12:05pm

Storing and Indexing Social Media Content in the Hadoop Ecosystem

Lance Riedel, Jive Software

Jive is using Flume to deliver the content of a social web (250M messages/day) to HDFS and HBase. Flume’s flexible architecture allows us to stream data to our production data center as well as Amazon’s Web Services datacenter. We periodically build and merge Lucene indices with Hadoop jobs and deploy these to Katta to provide near real time search results. This talk will explore our infrastructure and decisions we’ve made to handle a fast growing set of real time data feeds. We will further explore other uses for Flume throughout Jive including log collection and our distributed event bus.


11:15am — 12:05pm

Building Relational Event History Model with Hadoop

Josh Lospinoso, University of Oxford

In this session we will look at Reveal, a statistical network analysis library built on Hadoop that uses relational event history analysis to grapple with the complexity, temporal causality, and uncertainty associated with dynamically evolving, growing, and changing networks. There are a broad range of applications for this work, from finance to social network analysis to network security.


11:15am — 12:05pm

Hadoop Troubleshooting 101

Kate Ting, Cloudera

Attend this session and walk away armed with solutions to the most common customer problems. Learn proactive configuration tweaks and best practices to keep your cluster free of fetch failures, job tracker hangs, and other common issues.


11:15am — 12:05pm

The Hadoop Stack – Then, Now and In The Future

Charles Zedlewski, Cloudera
Eli Collins, Cloudera

Many people refer to Apache Hadoop as their system of choice for big data management but few actually use just Apache Hadoop. Hadoop has become a proxy for a much larger system which has HDFS storage at its core. The Apache Hadoop based “big data stack” has changed dramatically over the past 24 months and will change even more over the next 24 months. This session will explore the trends in the evolution of the Hadoop stack, change in architecture and changes in the kinds of use cases that are supported. It will also review the role of interoperability and cohesion in the Apache Hadoop stack and the role of Apache Bigtop in this regard.


1:15pm — 2:05pm

Raptor – Real-time Analytics on Hadoop

Soundar Velu, Sungard

Raptor combines Hadoop & HBase with machine learning models for adaptive data segmentation, partitioning, bucketing, and filtering to enable ad-hoc queries and real-time analytics.
Raptor has intelligent optimization algorithms that switch query execution between HBase and MapReduce. Raptor can create per-block dynamic bloom filters for adaptive filtering. A policy manager allows optimized indexing and autosharding.
This session will address how Raptor has been used in prototype systems in predictive trading, times-series analytics, smart customer care solutions, and a generalized analytics solution that can be hosted on the cloud.


1:15pm — 2:05pm

Hadoop Trends & Predictions

Vanessa Alverez, Forrester

Hadoop is making its way into the enterprise, as organizations look to extract valuable information and intelligence from the mountains of data in their storage environments. The way in which this data is analyzed and stored is changing, and Hadoop has become a critical part of this transformation. In this session, Vanessa will cover the trends we are seeing in the enterprise in regards to Hadoop adoption and how it’s being used, as well as predictions on where we see Hadoop and Big Data in general, going as we enter 2012.


1:15pm — 2:05pm

Unlocking the Value of Big Data with Oracle

Jean-Pierre Dijcks, Oracle

Analyzing new and diverse digital data streams can reveal new sources of economic value, provide fresh insights into customer behavior and identify market trends early on. But this influx of new data can create challenges for IT departments. To derive real business value from Big Data, you need the right tools to capture and organize a wide variety of data types from different sources, and to be able to easily analyze it within the context of all your enterprise data. Attend this session to learn how Oracle’s end-to-end value chain for Big Data can help you unlock the value of Big Data.


1:15pm — 2:05pm

Lily: Smart Data at Scale, Made Easy

Steven Noels, Outerthought

Lily is a repository made for the age of Data, and combines CDH, HBase and Solr in a powerful, high-level, developer-friendly backing store for content-centric application with ambition to scale. In this session, we highlight why we choose HBase as the foundation for Lily, and how Lily will allow users to not only store, index and search vast quantities of data, but also to track audience behaviour and generate recommendations, all in real-time.


1:15pm — 2:05pm

Security Considerations for Hadoop Deployments

Jeremy Glesner, Berico Technologies
Richard Clayton, Berico Technologies

Security in a distributed environment is a growing concern for most industries. Few face security challenges like the Defense Community, who must balance complex security constraints with timeliness and accuracy. We propose to briefly discuss the security paradigms defined in DCID 6/3 by NSA for secure storage and access of data (the “Protection Level” system). In addition, we will describe the implications of each level on the Hadoop architecture and various patterns organizations can implement to meet these requirements within the Hadoop ecosystem. We conclude with our “wish list” of features essential to meet the federal security requirements.


2:15pm — 3:05pm

The State of Big Data Adoption in the Enterprise

Tony Baer, Ovum IT Software

As Big Data has captured attention as one of “the next big things” in enterprise IT, most of the spotlight has focused on early adopters. But what is the state of Big Data adoption across the enterprise mainstream? Ovum recently surveyed 150 global organizations in a variety of vertical industries with revenue of $500 million+ and manage large enterprise data warehouses. We will share the findings from the research in this session. We will reveal similarities in awareness, readiness, and business drivers when compared.


2:15pm — 3:05pm

Building a Model of Organic Link Traffic

Brian David Eoff, Bit.ly

At bitly we study behaviour on the internet by capturing clicks on shortened URLs. This link traffic comes in many forms yet, when studying human behaviour, we’re only interested in using ‘organic’ traffic: the traffic patterns caused by actual humans clicking on links that have been shared on the social web. To extract these patterns, we employ Python/Numpy, streaming Hadoop and some Machine Learning to create a model of organic traffic patterns based on bitly’s click logs. This model lets us extract the traffic we’re interested in from the variety of patterns generated by inorganic entities following bitly links.


2:15pm — 3:05pm

Life in Hadoop Ops – Tales From the Trenches

Eric Sammer, Cloudera
Gregory Baker, AT&T Interactive
Karthik Ranganathan, Facebook
Nicholas Evans, AOL Advertising

This session will be a panel discussion with experienced Hadoop Operations practitioners from several different organizations. We’ll discuss the role, the challenges and how both these will change in the coming years.


2:15pm — 3:05pm

HDFS Name Node High Availability

Aaron Myers, Cloudera
Suresh Srinivas, Hortonworks

HDFS HA has been a highly sought after feature for years. Through collaboration between Cloudera, Facebook, Yahoo!, and others, a high availability system for the HDFS Name Node is actively being worked on. This talk will discuss the architecture and setup of this system.


2:15pm — 3:05pm

Hadoop Network and Compute Architecture Considerations

Jacob Rapp, Cisco

Hadoop is a popular framework for web 2.0 and enterprise businesses who are challenged to store, process and analyze large amounts of data as part of their business requirements. Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. This session reviews the state of Hadoop enterprise environments, discusses fundamental and advanced Hadoop concepts and reviews benchmarking analysis and projection for big data growth as related to Data Center and Cluster designs. The session also discusses network architecture tradeoffs, and the advantages of close integration between compute and networking.


3:30pm — 4:20pm

Data Mining in Hadoop, Making Sense Of It in Mahout!

Michael Cutler, British Sky Broadcasting

Much of Hadoop adoption thus far has been for use cases such as processing log files, text mining, and storing masses of file data — all very necessary, but largely not exciting. In this presentation, Michael Cutler presents a selection of methodologies, primarily using Mahout, that will enable you to derive real insight into your data (mined in Hadoop) and build a recommendation engine focused on the implicit data collected from your users.


3:30pm — 4:20pm

WibiData: Building Personalized Applications with HBase

Aaron Kimball, Odiago
Garrett Wu, Odiago

WibiData is a collaborative data mining and predictive modeling platform for large-scale, multi-structured, user-centric data. It leverages HBase to combine batch analysis and real time access within the same system, and integrates with existing BI, reporting and analysis tools. WibiData offers a set of libraries for common user-centric analytic tasks, and more advanced data mining libraries for personalization, recommendation, and other predictive modeling applications. Developers can write re-usable libraries that are also accessible to data scientists and analysts alongside the WibiData libraries. In this talk, we will provide a technical overview of WibiData, and show how we used it to build FoneDoktor, a mobile app that collects data about device performance and app resource usage to offer personalized battery and performance improvement recommendations directly to users.


3:30pm — 4:20pm

Integrating Hadoop with Enterprise RDBMS Using Apache SQOOP and Other Tools

Guy Harrison, Quest Software
Arvind Prabhakar, Cloudera

As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative.

We’ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we’ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.


3:30pm — 4:20pm

Hadoop and Graph Data Management: Challenges and Opportunities

Daniel Abadi, Yale University + Hadapt

As Hadoop rapidly becomes the universal standard for scalable data analysis and processing, it is increasingly important to understand its strengths and weaknesses for particular application scenarios in order to avoid inefficiency pitfalls. For example, Hadoop has great potential to perform scalable graph analysis if it is used correctly. Recent benchmarking has shown that simple implementations can be 1300 times less efficient than a more optimal Hadoop-centered implementation. In this talk, Daniel Abadi will give an overview of a recent research project at Yale University that investigates how to perform sub-graph pattern matching within a Hadoop-centered system that is three orders of magnitude faster than a more simple approach. In his talk Daniel will highlight how the cleaning, transforming, and parallel processing strengths of Hadoop are combined with storage optimized for graph data analysis. He will then discuss further changes that are needed in the core Hadoop framework to take performance to the next level.


3:30pm — 4:20pm

The Hadoop Award for Government Excellence

Bob Gourley, Crucial Point LLC

Federal, State and Local governments and the development community surrounding them are busy creating solutions leveraging the Apache Foundation Hadoop capabilities. This session will highlight the top five picked from an all star panel of judges. Who will take home the coveted Government Big Data Solutions Award for 2011? This presentation will also highlight key Big Data mission needs in the federal space and provide other insights which can fuel solutions in the sector.


4:30pm — 5:20pm

I Want to Be BIG – Lessons Learned at Scale

David "Sunny" Sundstrom, SGI

SGI has been a leading commercial vendor of Hadoop clusters since 2008. Leveraging SGI’s experience with high performance clusters at scale, SGI has delivered individual Hadoop clusters of up to 4000 nodes. In this presentation, through the discussion of representative customer use cases, you’ll explore major design considerations for performance and power optimization, how integrated Hadoop solutions leveraging CDH, SGI Rackable clusters, and SGI Management Center best meet customer needs, and how SGI envisions the needs of enterprise customers evolving as Hadoop continues to move into mainstream adoption.


4:30pm — 5:20pm

Next Generation Apache Hadoop MapReduce

Mahadev Konar, Hortonworks

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization. We will be presenting the architecture and design of the next generation of map reduce and will delve into the details of the architecture that makes it much easier to innovate. We will also be presenting large scale and small scale comparisons on some benchmarks with MRV1.”


4:30pm — 5:20pm

Data Mining for Product Search Ranking

Aaron Beppu, Etsy

How can you rank product search results when you have very little data about how past shoppers have interacted with the products? Through large scale analysis of its clickstream data, Etsy is automatically discovering product attributes (things like materials, prices, or text features) which signal that a search result is particularly relevant (or irrelevant) to a given query. This attribute-level approach makes it possible to appropriately rank products in search results- even if those products are brand new and one-of-a-kind. This presentation discusses Etsy’s efforts to predict relevance in product search, in which Hadoop is a central component.


4:30pm — 5:20pm

Hadoop and Netezza Deployment Models and Case Study

Krishnan Parasuraman, Netezza
Greg Rokita, Edmunds

Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. In this session, you will learn about the similarities and differences of Hadoop and parallel data warehouses, and typical best practices. Edmunds will discuss how they increased delivery speed, reduced risk, and achieved faster reporting by combining ELT and ETL. For example, Edmunds ingests raw data into Hadoop and HBase then reprocesses the raw data in Netezza. You will also learn how Edmunds uses prototyping to work on nearly raw data with the company’s Analytics Team using Netezza.


4:30pm — 5:20pm

From Big Data to Lives Saved: HBase in HealthCare

Doug Meil, Explorys
Charlie Lougheed, Explorys

Explorys, founded in 2009 in partnership with the Cleveland Clinic, is one of the largest clinical repositories in the United States with 10 million lives under contract.

HBase and Hadoop are at the center of Explorys. The Explorys healthcare platform is based upon a massively parallel computing model that enables subscribers to search and analyze patient populations, treatment protocols, and clinical outcomes. Already spanning billions of anonymized clinical records, Explorys provides uniquely powerful and HIPAA compliant solutions for accelerating life saving discovery.

Wednesday, November 9


8:30am — 9:45am

General Session — Keynote Speakers

Doug Cutting, Cloudera
James Markarian, Informatica

James Markarian Keynote: The Future of the Data Management Market, and What that Means to You
James Markarian will discuss historical trends and technology shifts in data management and how the data deluge has contributed to the emergence of Apache Hadoop.  James will showcase examples of how forward-looking organizations are leveraging Hadoop to maximize their Return on Data to improve insight and operations. By sharing his perspective about this next major analytics platform, James will discuss why Hadoop is poised to change the face of analytics and data management.  Finally he will challenge the Hadoop ecosystem to work together to close the remaining technology gaps in Hadoop.

Doug Cutting Keynote
Doug Cutting, co-creator of Apache Hadoop, will discuss the current state of the Apache Hadoop open source ecosystem and its trajectory for the future. Doug will highlight trends and changes happening throughout the platform, including new additions and important improvements. He will address the implications of how different components of the stack work together to form a coherent and efficient platform. He will draw particular attention to Big Top, a project initiated by Cloudera to build a community around the packaging and interoperability testing of Hadoop-related projects with the goal of providing a consistent and interoperable framework. Cutting will also discuss the latest additions in CDH4 and the platform roadmap for CDH.



10:00am — 10:50am

Big Data Analytics – Data Professionals: the New Enterprise Rock Stars

Martin Hall, Karmasphere

In this session, we will explore how Hadoop and Big Data are re-inventing enterprise workflows and the pivotal role of the Data Analyst. We will examine the changing face of analytics and the streamlining of iterative queries through evolved user interfaces. We will explain how combining Hadoop and SQL-based analytics help companies discover emergent trends hidden in unstructured data, without having to retrain or hire data miners. We will illustrate how analysts can now connect to Big Data platforms, assemble working data sets from disparate sources, and analyze and mine that data for actionable insight. These analysts can then publish the results and feed their reporting tools, and use the results in company workflows – all without touching the command line.


10:00am — 10:50am

Architecting a Business-Critical Application in Hadoop

Stephen Daniel, NetApp

NetApp is in the process of moving a petabyte-scale database of customer support information from a traditional relational data warehouse to a Hadoop-based application stack. This talk will explore the application requirements and the resulting hardware and software architecture. Particular attention will be paid to trade-offs in the storage stack, along with data on the various approaches considered, benchmarked, and the resulting final architecture. Attendees will learn a range of architectures available when contemplating a large Hadoop project and some of the process used by NetApp to choose amongst the alternatives.


10:00am — 10:50am

Hadoop and Performance

Todd Lipcon, Cloudera
Yanpei Chen, Cloudera

Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.


10:00am — 10:50am

Preview of the New Cloudera Management Suite

Henry Robinson, Cloudera
Phil Zeyliger, Cloudera
Vinithra Varadharajan, Cloudera

This session will preview what is new in the latest release of the Cloudera Management Suite. We will cover the common problems we’ve seen in Hadoop management and will do a demonstration of several new features designed to address these problems.


10:00am — 10:50am

BI on Hadoop in Financial Services

Stefan Groschupf, Datameer

This session is designed for banking and other financial services managers with technical experience and for engineers. It will discuss business intelligence platform deployments on Hadoop including cost performance, customer analytics, value-at-risk analytics and IT SLA’s.


11:00am — 11:50am

Hadoop vs. RDBMS for Big Data Analytics… Why Choose?

Mingsheng Hong, HP Vertica

When working with structured, semi-structured, and unstructured data, there is often a tendency to try and force one tool – either Hadoop or a traditional DBMS – to do all the work. At Vertica, we’ve found that there are reasons to use Hadoop for some analytics projects, and Vertica for others, and the magic comes in knowing when to use which tool and how these two tools can work together. Join us as we walk through some of the customer use cases for using Hadoop with a purpose-built analytics platform for an effective, combined analytics solution.


11:00am — 11:50am

Advancing Disney’s Data Infrastructure with Hadoop

Matt Estes, The Walt Disney Company

This is the story of why and how Hadoop was integrated into the Disney data infrastructure. Providing data infrastructure for Disney’s, ABC’s and ESPN’s Internet presences is challenging. Doing so requires cost effective, performant, scalable and highly available solutions. Information requirements from the business add the need for these solutions work together; providing consistent acquisition, storage and access to data. Burdened with a heavily laden commercial RDBMS infrastructure, Hadoop provided an opportunity to solve some challenging use cases at Disney. The deployment of Hadoop helped Disney to address growing costs, scalability, and data availability. In addition, it provids our businesses with new data driven business to consumer opportunities.


11:00am — 11:50am

Big Data Architecture: Integrating Hadoop with Other Enterprise Analytic Systems

Tasso Argyros, Aster Data

Recent research has pointed out the complementary nature of Hadoop and other data management solutions and the importance of leveraging existing systems, SQL, engineering, and operational skills, as well as incorporating novel uses of MapReduce to improve analytic processing. Come to this session to learn how companies optimize the use of Hadoop with other enterprise systems to improve overall analytical throughput and build new data-driven products. This session covers: ways to achieve high-performance integration between Hadoop and relational-based systems; Hadoop+NoSQL vs Hadoop+SQL architectures; high-speed, massively parallel data transfer to analytical platforms that can aggregate web log data with granular fact data; and strategies for freeing up capacity for more explorative, iterative analytics and ad hoc queries.


11:00am — 11:50am

Leveraging Hadoop to Transform Raw Data into Rich Features at LinkedIn

Abhishek Gupta, LinkedIn

This presentation focuses on the design and evolution of the LinkedIn recommendations platform. It currently computes more than 100 billion personalized recommendations every week, powering an ever growing assortment of products, including Jobs You May be Interested In, Groups You May Like, News Relevance, and Ad Targeting.
We will describe how we leverage Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIn’s 100 million member base, how we use Lucene to do real-time recommendations, and how we marshal Lucene on Hadoop to bridge offline analysis with user-facing services.


11:00am — 11:50am

HDFS Federation

Suresh Srinivas, Hortonworks

Scalability of the NameNode has been a key issue for HDFS clusters. Because the entire file system metadata is stored in memory on a single NameNode, and all metadata operations are processed on this single system, the NameNode both limits the growth in size of the cluster and makes the NameService a bottleneck for the MapReduce framework as demand increases. HDFS Federation horizontally scales the NameService using multiple federated NameNodes/namespaces. The federated NameNodes share the DataNodes in the cluster as a common storage layer. HDFS Federation also adds client-side namespaces to provide a unified view of the file system. This presentation will describe the features and implementation of HDFS Federation scheduled for release with Hadoop-0.23.


1:00pm — 1:50pm

HBase Roadmap

Jonathan Gray, Facebook

This technical session will provide a quick review of the Apache HBase project, looking at it from the past to the future. It will cover the imminent HBase 0.92 release as well as what is slated for 0.94 and beyond. A number of companies and use cases will be used as examples to describe the overall direction of the HBase community and project.


1:00pm — 1:50pm

Replacing RDB/DW with Hadoop and Hive for Telco Big Data

Jason Han, NexR
Ja-Hyung Koo, Korea Telecom

This session will focus on the challenges of replacing existing Relational DataBase and Data Warehouse technologies with Open Source components. Jason Han will base his presentation on his experience migrating Korea Telecom (KT’s) CDR data from Oracle to Hadoop, which required converting many Oracle SQL queries to Hive HQL queries. He will cover the differences between SQL and HQL; the implementation of Oracle’s basic/analytics functions with MapReduce; the use of Sqoop for bulk loading RDB data into Hadoop; and the use of Apache Flume for collecting fast-streamed CDR data. He’ll also discuss Lucene and ElasticSearch for near-realtime distributed indexing and searching. You’ll learn tips for migrating existing enterprise big data to open source, and gain insight into whether this strategy is suitable for your own data.


1:00pm — 1:50pm

Proven Tools to Simplify Hadoop Environments

Joey Jablonski, Dell
Vin Sharma, Intel

Do you see great potential in Hadoop but you also have questions or challenges to overcome? Come to this session to get answers to your questions and advice. Dell Big Data Architect Joey Jablonski and Intel Enterprise Software Strategist Vin Sharma will answer frequently asked questions about Hadoop, and share proven ways you can overcome challenges in deploying, managing, and tuning Hadoop environments. The discussion topics will include Hadoop operations, configuration management, upgrades and lifecycle management, monitoring and managing power and heat, and Hadoop performance tuning, testing, and optimization. The presenters will also discuss how rapid Hadoop deployment makes life easier for administrators, and talk about Crowbar, an open source Operations Framework.


1:00pm — 1:50pm

Radoop: A Graphical Analytics Tool for Big Data

Gábor Makrai, Radoop

Hadoop is an excellent environment for analyzing large data sets, but it lacks an easy-to-use graphical interface for building data pipelines and performing advanced analytics. RapidMiner is an excellent open-source tool for data analytics, but is limited to running on a single machine.In this presentation, we will introduce Radoop, an extension to RapidMiner that lets users interact with a Hadoop cluster. Radoop combines the strengths of both projects and provides a user-friendly interface for editing and running ETL, analytics, and machine learning processes on Hadoop. We will also discuss lessons learned while integrating HDFS, Hive, and Mahout with RapidMiner.


1:00pm — 1:50pm

How Hadoop is Revolutionizing Business Intelligence and Advanced Data Analytics

Dr. Amr Awadallah, Cloudera

The introduction of Apache Hadoop is changing the business intelligence data stack. In this presentation, Dr. Amr Awadallah, chief technology officer at Cloudera, will discuss how the architecture is evolving and the advanced capabilities it lends to solving key business challenges. Awadallah will illustrate how enterprises can leverage Hadoop to derive complete value from both unstructured and structured data, gaining the ability ask and get answers to previously un-addressable big questions. He will also explain how Hadoop and relational databases complement each other, enabling organizations to access the latent information in all their data under a variety of operational and economic constraints.


2:00pm — 2:50pm

Data Ingestion, Egression, and Preparation for Hadoop

Sanjay Kaluskar, Informatica
David Teniente, Rackspace

One of the first challenges Hadoop developers face is accessing all the data they need and getting it into Hadoop for analysis. Informatica PowerExchange accesses a variety of data types and structures at different latencies (e.g. batch, real-time, or near real-time) and ingests data directly into Hadoop.  The next step is to parse the data in preparation for analysis in Hadoop.  Informatica provides a visual IDE to deploy pre-built parsers or design specific parsers for complex data formats and deploy them on Hadoop.  Once the analysis is complete,  Informatica PowerExhange delivers the resulting output to other information management systems such as a data warehouse.  Learn in this session from Informatica and Rackspace, how to get all the data you need into Hadoop, parse a variety of data formats and structures, and egress the resultant output to other systems.


2:00pm — 2:50pm

Advanced HBase Schema Design

Lars George, Cloudera

While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.


2:00pm — 2:50pm

Large Scale Log Data Analysis for Marketing in NTT Communications

Kenji Hara, NTT Communications

NTT Communications built a log analysis system for marketing using hadoop, which explore the internet users’ interests or feedback about specified products or themes from access log, query/click log and CGM data.
Our system provides three features: sentiment analysis, co-occuring keyword extraction, and user interests estimation. For large scale analysis, we use Hadoop with customized functions, which push down the shuffle size by amplifying map-side processing. We also show the features of our Hadoop cluster.


2:00pm — 2:50pm

Practical Knowledge for Your First Hadoop Project

Mark Slusar, NAVTEQ
Boris Lublinsky, NAVTEQ
Mike Segel, NAVTEQ

A collection of guidelines and advice to help a technologist successfully complete their first Hadoop project. This presentation is based on our experiences in initiating and executing several successful Hadoop projects. Part 1 focuses on tactics to “sell” Hadoop to stakeholders and senior management, including understanding what Hadoop is and what is its “sweet” spots, alignment of goals, picking the right project, and level setting expectations. Part 2 provides some recommendations on running a successful Hadoop development project. Topics covered include preparation & planning activities, training and preparing development teams, development & test activities, and deployment & operations activities. Also included are talking points to help with educating stakeholders.


2:00pm — 2:50pm

Hadoop Hadoop 0.23

Arun Murthy, Hortonworks

Apache Hadoop is the de-facto Big Data platform for data storage and processing. The current stable, production release of Hadoop is “hadoop-0.20″. The Apache Hadoop community is preparing to release “hadoop-0.23″ with several major improvements including HDFS Federation and NextGen MapReduce. In this session, Arun Murthy, who is the Apache Hadoop Release Master for “hadoop.next”, will discuss the details of the major improvements in “hadoop-0.23″.


3:20pm — 4:10pm

Practical HBase

Ravi Veeramchaneni, Informatica

Many developers have experience in working on relational databases using SQL. The transition to No-SQL data stores, however, is challenging and often time confusing. This session will share experiences of using HBase from Hardware selection/deployment to design, implementation and tuning of HBase. At the end of the session, audience will be in a better position to make right choices on Hardware selection, Schema design and tuning HBase to their needs.


3:20pm — 4:10pm

Leveraging Hadoop for Legacy Systems

Mathias Herberts, Credit Mutuel Arkea

Since many companies in the financial sector still relies on legacy systems for its daily operations, Hadoop can only be truly useful in those environments if it can fit nicely among COBOL, VSAM, MVS and other legacy technologies. In this session, we will detail how Crédit Mutuel Arkéa solved this challenge and successfully mixed the mainframe and Hadoop.


3:20pm — 4:10pm

Changing Company Culture with Hadoop

Amy O'Connor, Nokia

We are living in a time of tremendous convergence, convergence of mobile, cloud and social… This convergence is forcing companies to change. At Nokia, we are changing the way we make decisions, from a manufacturing model to a data driven one. Yet making cultural changes is one of the hardest things to accomplish. In this talk, Amy O’Connor will highlight the journey Nokia is taking to evolve its culture – from building a platform for cultural evolution on top of Hadoop, to the administration of Nokia’s data, to how the company conducts the analysis that is enabling Nokia to compete with data.


3:20pm — 4:10pm

Leveraging Big Data in the Fight Against Spam and Other Security Threats

Wade Chambers, Proofpoint

In 2004, Bill Gates told a select group of participants in the World Economic Forum that “two years from now, the spam issue will be solved.” Eight years later, the spam problem is only getting worse, with no sign of relief. Big Data technologies such as Hadoop, MapReduce, Cassandra, and real-time stream processing can be leveraged to develop new approaches to fight spam, phishing, and other email-borne threats more effectively than ever before. This session will focus on the development of radical new “spam anomalytics” techniques whereby billions of messages and message-related events are analyzed daily to find statistical norms- and identify deviations from those norms- in order to better detect and defend against email threats as they emerge.


3:20pm — 4:10pm

SHERPASURFING – Open Source Cyber Security Solution

Wayne Wheeles, Novii Design

Every day billions of packets, both benign and some malicious, flow in and out of networks. Every day it is an essential task for the modern Defensive Cyber Security Organization to be able to reliably survive the sheer volume of data, bring the NETFLOW data to rest, enrich it, correlate it and perform. SHERPASURFING is an open source platform built on the proven Cloudera’s Distribution for Apache Hadoop that enables organizations to perform the Cyber Security mission and at scale at an affordable price point. This session will include an overview of the solution and components, followed by a demonstration of analytics.


4:20pm — 5:10pm

Indexing the Earth – Large Scale Satellite Image Processing Using Hadoop

Oliver Guinan, Skybox Imaging

Skybox Imaging is using Hadoop as the engine of it’s satellite image processing system. Using CDH to store and process vast quantities of raw satellite image data enables Skybox to create a system that scales as they launch larger numbers of ever more complex satellites. Skybox has developed a CDH based framework that allows image processing specialists to develop complex processing algorithms using native code and then publish those algorithms into the highly scalable Hadoop Map/Reduce interface. This session will provide an overview of how we use hdfs, hbase and map/reduce to process raw camera data into high resolution satellite images.


4:20pm — 5:10pm

Hadoop as a Service in Cloud

Junping Du, VMware
Richard McDougall, VMware

Hadoop framework is often built on native environment with commodity hardware as its original design. However, with growing tendency of cloud computing, there is stronger requirement to build Hadoop cluster on a public/private cloud in order for customers to benefit from virtualization and multi-tenancy. This session discusses how to address some the challenges of providing Hadoop service on virtualization platform such as : performance, rack awareness, job scheduling, memory over commitment, etc, and propose some solutions.


4:20pm — 5:10pm

Extending the Enterprise Data Warehouse with Hadoop

Jonathan Seidman, Orbitz Worldwide
Rob Lancaster, Orbitz worldwide

Hadoop provides the ability to extract business intelligence from extremely large, heterogeneous data sets that were previously impractical to store and process in traditional data warehouses. The challenge now is in bridging the gap between the data warehouse and Hadoop. In this talk we’ll discuss some steps that Orbitz has taken to bridge this gap, including examples of how Hadoop and Hive are used to aggregate data from large data sets, and how that data can be combined with relational data to create new reports that provide actionable intelligence to business users.


4:20pm — 5:10pm

The Powerful Marriage of R and Hadoop

David Champagne, Revolution Analytics

When two of the most powerful innovations in modern analytics come together, the result is revolutionary.

This session will cover:

  • An overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization.
  • The ways that R and Hadoop have been integrated.
  • A use case that provides real-world experience.
  • A look at how enterprises can take advantage of both of these industry-leading technologies.

4:20pm — 5:10pm

Gateway: Cluster Virtualization Framework

Konstantin Shvachko, eBay

Access to Hadoop clusters through dedicated portal nodes (typically located behind firewalls and performing user authentication and authorization) can have several drawbacks — as shared multitenant resources they can create contention among users and increase the maintenance overhead for cluster administrators. This session will discuss the Gateway system, a cluster virtualization framework that provides multiple benefits: seamless access from users’ workplace computers through corporate firewalls; the ability to failover to active clusters for scheduled or unscheduled downtime, as well as the ability to redirect traffic to other clusters during upgrades; and user access to clusters running different versions of Hadoop.