Data Eng Weekly

Hadoop Weekly Issue #182

14 August 2016

Welcome to a double-issue of Hadoop Weekly. There's tons of content this week—lots of technical posts covering Impala, Spark, Metron, Kafka, Hadoop, and more. In news, Hortonworks announced earnings, MapR announced Series E financing, and there are several upcoming conferences. Finally, there are eight releases (including two new open-source projects from Blizzard and Uber).


The Cloudera blog has an overview of speed-ups in the recently released Apache Impala (incubating) 2.6. The largest jump in performance is with encryption enabled, and there are also speedups due to improvements in runtime filters, dynamically sized bloom filters, improved codegen for top-N queries, faster decimal arithmetic, and more.

Another big part of the Impala 2.6 release is better support for data stored in Amazon S3. This post describes some of the optimizations in and architectural changes to Impala's read and write pipelines to make the most of S3. There are some benchmarks showing the doubling the size of an Impala cluster roughly doubles throughput.

Apache Spark has the momentum of a large developer community and many big companies behind it. But there are several other big data systems that are emerging with technical differentiators—Apache Apex, Apache Flink, Heron (from Twitter), and Onyx.

Mobius is a .NET framework for Apache Spark. This post gives an introduction to the library and provides an example of using the API from F#.

While stream processing is getting a lot of attention, real-time data not always needed. Sometimes a near real-time (i.e. > 5 minutes latency) is good enough, which is the case for many applications at Uber. With relaxed latency requirements, you can build a much more efficient (both in terms of CPU resources and developer time) pipeline by using short-lived, batch processing applications that operate on "mini batches." Given that Uber is operating at tremendous scale, this post offers an interesting take grounded in practical experience.

Confluent's monthly "Log Compaction" post has an overview of Apache Kafka and related news. it describes a number of features (e.g. APIs to managed ACLS, query-able state for Kafka Streams) that are in discussion or under development for future Kafka releases.

Paige Roberts has published an interview with Ryan Merriman, an architect of the Apache Metron project. Metron is a real-time information security platform built on open-source tools like Apache Kafka and Apache Storm. The first part of the interview contains a good overview of Metron's architecture from a high-level, and the second part includes a discussion of why the project chose Apache Storm instead of other tools like Apache Flink and Apache Spark.

CPU scaling is a mechanism for making processors more power efficient. It's often a great idea, but it can cause mysterious performance problems when working with a big data system like Hadoop (speaking from experience). This post describes how to determine if CPU scaling is enabled and how to disable it.

This post describes how to detect similar items across two lists using TF-IDF and cosine-similarity. From there, it explains how to use numpy and pyspark to do matrix multiplication and find the common items. There's some example code for this pipeline.

Apache Commons Crypto is a library that takes advantage of the AES-NI (Advanced Encryption Standard New Instructions) instruction sets on modern hardware for improving performance of encrypting/decrypting data. A new initiative is using this library to add encryption to data shuffled by Spark. The post includes some performance benchmark's that show little overhead compared to unencrypted shuffle.

The AWS Big Data Blog has a crash course in Apache Bigtop (the packaging system for the Hadoop ecosystem) with an eye towards adding another application to an Amazon EMR cluster via bootstrap actions.

Hadoop clusters have traditionally married compute with storage to increase performance by taking advantage of data locality. But with improvements in network throughput and the adoption of the cloud, more deployments are decoupling compute and storage. The Pivotal blog has a post about this topic including some recommendations of when to use network attached storage rather than disk attached storage.

The post describes the Hadoop daemon code that is part of the 3.x branch, and the flexibility that the new shell code presents. It includes an example of adding support for launching the daemons with cgexec (for linux cgroups).

Some posts end up being too hypothetical/not grounded in real world examples. To that end, I'm not entirely sure what to make of this post that's based on a hypothetical Facebook feature from an Onion News article. It does seem to be an interesting application of Apache Kafka Streams though—it illustrates a number of features like KTables, joins, and windowing.


The DBMS2 blog has two posts covering Databricks, the company's relationship with the Spark project, and Spark. The first captures some recent progress—Spark as a replacement for MapReduce, the success of Databricks' cloud platform, and Databricks role in the Apache Spark community. The second captures details of recent improvements to Spark and the Databricks cloud.

The Apache Beam (incubating) project recently celebrated its sixth month of incubation. This post highlights the progress of the project, including accomplishments like contributions from 45 contributors, a new Flink runner, and work on a Spark 2.0 runner.

Hortonworks announced Q2 results last week. They reported a loss of $64.2 million, but a 46% year-over-year jump in revenues. They have published two posts reflecting on their momentum and describing how some of their customers are using Hortonworks.

MapR announced that they've raised $50 million in a Series E round of financing.

Doug Meil, co-founder of the Cleveland Big Data meetup has written about the evolution of the meetup over the years. The post is full of tips for how to organize good talks and engage the meetup community.

The schedule for FlinkForward, which takes place September 12-14 in Berlin, has been posted.

Mesosphere and Confluent announced a partnership to bring the Confluent Platform to the Mesosphere Enterprise Datacenter Operating System. Mesosphere is calling this "container 2.0" because it's defined with systems (like Kafka) that store state in mind.

A half-day HBaseCon East has been announced. It takes place in New York on September 26th. The call for proposals is open until September 4th.

Hadoop Summit Melbourne takes place in just over two weeks, on August 31st and September 1st.


Uber has open-sourced their Kafka replication system, uMirrorMaker. The system is built on Apache Helix and addresses some of the shortcomings of Apache Kafka's MirrorMaker, such as faster rebalancing, dynamic topic addition/removal, and no restarts. uMirrorMaker has been running inside Uber for eight months.

Amazon EMR 5.0 was released. This version upgrades to Spark 2.0, Hive 2.1, makes Tez the default execution engine for Hive and Pig, and more.

Apache Flink released version 1.1.0 (and a fast follow-up 1.1.1 to fix a dependency issue). The release has a number of fixes, improvements, and new features. They include a support for Amazon Kinesis, an Apache Cassandra sink, a table API (and SQL support), a Scala version of the CEP library, and more.

Apache Kafka was released. It includes a number of bug fixes and improvements.

Blizzard has open sourced node-rdkafka, a Node.js library for Apache Kafka built on librdkafka. The readme has a quick introduction to building and using the library.

Apache Samza 0.10.1 was released this week. It includes a number of enhancements (including host affinity when restoring a worker) and performance improvements. The release announcement also describes some future enhancements (REST API, disk quotas, high-level language for Samza).

Apache Storm 1.0.2 was released. It includes bug fixes and performance enhancements.

Amazon Web Services announced a new Amazon Kinesis Analytics, which is a new hosted stream processing system built with SQL. This post details how to get started by writing and deploying a streaming query.


Curated by Datadog ( )



Getting Hot and Cold with Spark and Big Data (San Diego) - Tuesday, August 16

53rd Bay Area Hadoop User Group Meetup (Sunnyvale) - Wednesday, August 17

Interactive Applications on Spark: Big and Small Lessons Learned (Redwood City) - Wednesday, August 17


Performance in Spark 2.0 (Portland) - Thursday, August 18


Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data with Mike Percy (Greenwood Village) - Wednesday, August 17

Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data with Mike Percy (Broomfield) - Thursday, August 18


Spark Meetup 2016 #3 (Mason) - Wednesday, August 17


Streaming Design Patterns, Revolutionizing Architectures Using Kafka (Jacksonville) - Wednesday, August 17


Cutting Edge with HBase (Atlanta) - Wednesday, August 17

District of Columbia

Cloudera on Microsoft Azure for Real-Time Workloads (Washington) - Wednesday, August 17

New York

Apache NiFi - MiNiFi: Taking Dataflow Management to the Edge (New York) - Thursday, August 18


Apache NiFi - MiNiFi: Taking Dataflow Management to the Edge (Boston) - Wednesday, August 17


A Voldemort Introduction by a LinkedIn Engineer (Montreal) - Tuesday, August 16


Big Data Science Applications (Mexico City) - Wednesday, August 17


Big Data, No Fluff: Let’s Get Started with Hadoop #8 (Oslo) - Thursday, August 18


Big Data Infrastructure (Dubai) - Tuesday, August 16


Spark & Solr Integration (Bangalore) - Saturday, August 20