Data Eng Weekly

Hadoop Weekly Issue #115

05 April 2015

This issue marks a new milestone for Hadoop Weekly—it's the first issue sent out to over 5,000 subscribers! Thanks to everyone who has helped get the word out, and all of those who have published the content covered in this newsletter. The theme of this week issue is the ecosystem's shift to streaming and real-time—coverage of Kafka and Spark, new releases of Samza and Drill, and articles on the Databricks Cloud and the demand for real-time processing.


These slides cover two presentations from the recent Kafka meetup at LinkedIn. The first is about offset management in Kafka consumers—how this is implemented with Zookeeper and Kafka as well as trade-offs of the two. The second covers the data pipeline at Netflix, which is changing to make heavier use of Kafka. Specifically, the talk covers their move from Chukwa to Kafka, how they split up Kafka clusters for different retention/durability requirements, how they've implemented resilience, and how they are running Kafka in AWS (provisioning, health checks, and challenges)

Databricks has a post describing two areas of improvement in the Kafka integration with Spark Streaming in Spark 1.3. First, there's a new "Direct API for Kafka" which ensures exactly once semantics by mapping each RDD to discrete ranges of offsets in Kafka. Second, there are improvements to the Python API for Kafka, which make the API much simpler.

The Cloudera blog has the second part of a series on tuning Apache Spark jobs. This post focusses on tuning resource allocation (particularly for Spark on YARN), parallelism (also describing how defaults are typically chosen by the framework), serialization, and data formats.

The SequenceIQ blog has a post on the alerts feature in the soon to be released Apache Ambari 2.0. With the new alert framework, Ambari can alert on any metric exposed by components of the Hadoop stack. The post describes how to build an alert, how the metrics are collected and alerts triggered, and how Ambari alerts will be integrated with Periscope (SequenceIQs auto-scaling system for Hadoop).

The upcoming 3.8 release of Hue brings a new editor for Oozie workflows. A post on the Hue blog contains a video of the new feature in action and a walkthrough (with screenshots) of the main features. It also adds support for HiveServer2 and Spark actions.


Barron's has an interview with this year's ACM Turning Award winner, Michael Stonebreaker. Stonebreaker is a database researcher at MIT and has built a number of companies around his research. In the interview, he spoke about the history of databases, the rise of in-memory databases, the future of NoSQL and Hadoop, and more. He notes that NoSQL databases are rapidly attempting to add full SQL support and that Hadoop is likely to merge with data warehousing (as SQL becomes more important there, too).

Spark just celebrated its fifth birthday as an open-source project, having been open-sourced in March of 2010. A post on the Databricks blog celebrates the growth of the community, discusses some of the philosophies of the project ("keep the Spark engine small and compact," "focus on simple, stable APIs"), and mentions growth in Spark's standard libraries (Spark Streaming, MLlib, etc).

Alation came out of stealth with a product to help add context to how people are using data within a DB or data warehouse. Their tool includes support for Hadoop (HDFS and Hive) as well as traditional data warehouse and RDMS. It uses machine learning and a software agent to monitor access logs and metadata to build context about how data is being accessed and used.

Hortonworks has announced availability of their HDP Certified Developer exam. The hands-on exam requires participants to execute tasks across data ingestion, data transformation, and data analysis on a three-node cluster.

Amazon Web Services announced the new D2 instance type this week. This is the first in the new generation of AWS instances with a large amount of SATA storage. Silicon Angle notes that these instances are a good fit for the Hadoop-on-AWS market, and the article covers some of the features of the new instances.

HBaseCon is in just over a month, and the Cloudera blog has a preview of several talks at the conference. These include presenters from Salesforce, Hortonworks, Cloudera, Google, Pinterest, and Bloomberg.

The O'Reilly Radar blog has a post describing several real-time data processing tools. It mentions a number of items from the Hadoop stack (Flume, Kafka, Spark, HBase) as well as tools from cloud providers and database vendors. If nothing else, the post shows that there's no shortage of options when it comes to picking real-time software.

InfoWorld has an article describing Databricks' flagship product, Databricks Cloud. The author relays first-hand experience with the project (noting that it's still immature) from a recent training session. The post also describes how Databricks cloud fits into the company's business model and the risk of challengers building a similar product if it's shown to be in-demand.

Datanami has a rather bearish article on the Hadoop industry. It notes that vendors continue to lose money, and that Hadoop's slice of the big data market will hover near 1% for the next few years. In trying to investigate why Hadoop isn't gaining more revenue, the author points to several surveys that note Hadoop has slow adoption and often is augmenting existing systems rather than replacing them.


OptioPay has open-sourced their golang client for Kafka. The github project has more information on how to use it.

IBM has released version 4.0 of their distribution, IBM Open Platform with Apache Hadoop (BigInsights). The distribution bundles several open-source projects found in most distributions, and there are a few add-on packages aimed at business analysts, data scientists, and more.

Apache Drill 0.8 was released this week. The blog has an overview of the new features in the release. Highlights include support for large records (>128KB), support for many new SQL features, and improved performance and reliability.

MapR has announced support for Drill 0.8 as part of their distribution.

Apache Samza version 0.9.0 was released this week. Highlights of the release include improved RocksDB performance, a switch to Kafka, and integrating of container logs with ELK. A post on the Apache blog has details on all the new features, progress of the community, and features planned for future releases.


Curated by Datadog ( )



SF Spark Hackers: Kick-off Meeting (San Francisco) - Monday, April 6

Self-Service Data Exploration and Nested Data Analytics on Hadoop (Montecito) - Tuesday, April 7

April SF Hadoop Users Meetup (San Francisco) - Wednesday, April 8

Business Need => Tableau + R + MapR + DW = Amazing Outcomes (Milpitas) - Thursday, April 9


Spark after Dark: Advanced Analytics, Streaming Data, Machine Learning (Chicago) - Tuesday, April 7


Stream Processing and Data Acquisition in Real-Time with Apache Storm (Ann Arbor) - Saturday, April 11


All about the Mesos: YARN, Spark, and Streaming (Baltimore) - Thursday, April 9

New York

SQL and Machine Learning on Hadoop using HAWQ (New York) - Tuesday, April 7

War Stories from the Hadoop Trenches (New York) - Wednesday, April 8

Apache Drill Workshop (New York) - Wednesday, April 8


April Presentation Night (Cambridge) - Wednesday, April 8


Scalable Big Graph Data Processing in Spark (Vancouver) - Monday, April 6


Hadoop Research at KTH: Hops and Flink Streaming (Stockholm) - Wednesday, April 8


Indexing 3-Dimensional Trajectories: Apache Spark and Cassandra Integration (Barcelona) - Wednesday, April 8


Pivotal Open Source Night (Singapore) - Thursday, April 9


Spark DataFrames for Large-Scale Data Science with Spark Machine Learning (Melbourne) - Tuesday, April 7