Data Eng Weekly

Hadoop Weekly Issue #161

13 March 2016

This week's issue is short and sweet (although there sure are a lot of events!). In terms of long reads, there are very interesting posts on Kafka Streams and Flafka at Vodafone. On the release front, MapR, Apache Flink, and Apache Phoenix all had big releases. Congrats to the Flink team on achieving version 1.0!


The Azure Data Lake blog has a post demonstrating Scala implicits with Spark. Using the example of adding a saveToAzureSql method to a DataFrame, the post shows how to write a implicit conversion method along with the necessary JDBC code.

Spark's GraphX is a graph processing library that extends Spark RRDs. This introductory post gives some basic examples of the API and dives into some more advanced features (such as PageRank and Pregel-like calculations). The post is full of example code, which should be sufficient for getting going as a new user.

This presentation describes a cable company's migration from an Oracle exadata-based data warehouse to a Hadoop-based system for handling petabytes of data. During the transition, they tried out Phoenix, Impala, and Titan. From experience rolling out and productionizing Titan atop HBase, the post describes several lessons learned.

The Confluent blog has a post about Kafka Streams, a feature of the upcoming Kafka 0.10 (and also in a preview release of the Confluent Platform). Kafka Streams is a lightweight, "hipster" processing framework built to fill a gap realized by the LinkedIn team that built Apache Samza. It provides a lot of out-of-the-box support (such as joins and stateful processing) with a simple API and without requiring a separate distributed computing framework like YARN. The post dives pretty deep into why Kafka Streams is important and what type of use-cases it's built to solve.

The Altiscale blog has a post with tips for running scheduled Hadoop jobs from cron, and it motivates using Apache Oozie as a better alternative. Given that Oozie is built specifically for the Hadoop ecosystem, it supports features like Kerberos and also has data dependency support via coordinator actions.

The Cloudera blog has a post about how Vodafone UK uses Flume with Kafka for event transport in their data infrastructure. The post describes their multi-datacenter architecture and several types of performance tuning that they performed. Using a three-node Kafka cluster and two Flume agents, they're able to process over 1 million events/sec (end-to-end).


Dell and BlueData, makers of the EPIC software for provisioning docker-based Hadoop clusters, announced a partnership this week.


MapR 5.1 shipped this week. It includes Hadoop, Spark, MapR streams (general availability), and more. MapR touts the first-class support for JSON across real-time event streaming, MapR-DB, and other parts of the system. Other features include security enhancements (access control expressions and selective auditing), SSD optimizations, and improved Docker support. The MapR blog has many more details, and CIO has more coverage of the improved container support.

Apache Flink 1.0.0 was released this week. Key highlights include public API compatibility for 1.x releases, support for complex event processing, improved support for high-memory operations, and improved monitoring.

The Hortonworks blog has details of the features of Apache Ambari 2.2, which is part of HDP 2.4. The most notable features are automated upgrades, simplified security options, and additional troubleshooting information.

Apache Apex Malhar 3.3.1-incubating was released this week. Malhar is the development library with prebuilt connectors/operators/etc for Apex. In this release, the team has fixed a number of bugs.

Apache Kudu (incubating) released version 0.7.1. This fixed a handful of high-priority bugs.

Apache Phoenix, the SQL-on-HBase system, announced version 4.7 this week. The new release includes beta support for ACID transactions, enhanced consistency guarantees for secondary indexes, improved improved performance, and over 150 bug fixes.


Curated by Datadog ( )



#OCBigData Meetup (Irvine) - Wednesday, March 16

Big Data Application Meetup (Palo Alto) - Wednesday, March 16

Malhar & Geode Integration; Ingest: Kafka to Hadoop with Apex & Results Into Geode (San Jose) - Thursday, March 17


Big Data and Retail: Building Shopping Lists and Data Processing Engines (Bellevue) - Wednesday, March 16


Cleveland Big Data and Hadoop User Group (Mayfield Village) - Monday, March 14

North Carolina

Spark vs. Hadoop for Big Data (Durham) - Tuesday, March 15


Apache Spark Proof of Technology by IBM (McLean) - Tuesday, March 15

Analyzing Event Streams Using Spark and GraphX w/ Myles Baker (Richmond) - Tuesday, March 15

Real-Time Aggregations, Approximations, Similarities, and Recommendations (McLean) - Tuesday, March 15


LVTech TechTalk: Big Data, Hadoop, and All That (Bethlehem) - Tuesday, March 15

New Jersey

How to Build a Recommendation Engine Using Spark 1.6 and HDP (Princeton) - Thursday, March 17

New York

Introduction to Hadoop (Syracuse) - Wednesday, March 16

Integrating Apache Flink and Apache NiFi (New York) - Wednesday, March 16


Hybrid Solution Analysis of Streaming Sensor Data with Spark Streaming & Kafka (Boston) - Tuesday, March 15

St. Patty's Day Meet-Up on an Introduction to Apache Kudu (Boston) - Thursday, March 17


Apache Flink Real-World Use Cases with Slim Baltagi (Sao Paulo) - Thursday, March 17


Configuring the Layered Cake of Hadoop + Scaling Remote Engineering Teams (Sevilla) - Thursday, March 17


Data Munging with Spark, Part I (Toulouse) - Tuesday, March 15


Office Hours with Holden Karau (Amsterdam) - Monday, March 14


"Extreme" Apache Spark (Copenhagen) - Tuesday, March 15


Real-Life Apache Spark: Tips and Tricks from the Trenches (Zurich) - Monday, March 14


Drilling into Data with Apache Drill + Stream-based Microservice Architecture (Milano) - Thursday, March 17


Understanding and Building Big Data Architectures (Hyderabad) - Saturday, March 19