Data Eng Weekly

Hadoop Weekly Issue #197

18 December 2016

This week's issue is likely that last of 2016, and it's a good one. There's a post form Uber about their Kafka auditing system, a post on Blue Apron's Google BigQuery-based data platform, and a great interview about Apache Apex and Apache Flink. As always, if you come across any great articles, please send them my way!


Vessel is a new tool for writing MapReduce programs in Elixir (a language for the Erlang VM). It makes use of Hadoop Streaming and presents an API very similar to that of Hadoop's Java APIs.

Chaperone is Uber's system for auditing events as they pass through Kafka clusters. It's used to detect data loss, monitor end-to-end latency, and detect data duplication. Chaperone has an interesting architecture to audit each event exactly once and write the audit information back to Kafka for loading into monitoring dashboards and into an a database.

This post shows how to connect an MapR cluster with Apache Drill in Microsoft Azure to the Azure blobstore and Azure SQL database.

The Cloudera blog has a post describing many different resource management knobs for Apache Impala (incubating). These include admission control, which limits resource utilization across the cluster, per-query memory limits, and dynamic pools. The article also has examples of various types of pools (e.g. analytics and real-time user facing) and recommended settings for max memory, default memory limit, max running queries, and queue timeout.

Joins in streaming data have, until recently, been very difficult to implement. With higher-level APIs for streaming systems, like Apache Kafka Streams and Apache Beam, it's become much easier. To that end, Amazon Kinesis Analytics supports joins with static data in S3 and between streams. This post has tutorials for those two types of joins as well as an example of enriching data on an Amazon Kinesis stream by joining it with metadata in Amazon DynamoDB using AWS Lambda.

BlueApron has written about their move from a postgres-based data warehouse to Google BigQuery. They are using Apache AirFlow (incubating) to coordinate dumps from postgres into BigQuery and manage schema changes. For querying the data by data partition, they use a nifty trick of creating tables with the data partition included in the table name.

This post, by way of talking through the goals of microservices, introduces the data dichotomy: "Data systems are about exposing data. Services are about hiding it." Essentially, while services solve some problems they introduce others (like what if the service doesn't support batch analytics queries). Apache Kafka and stream processing are one way to fight this dichotomy, by making streams and stateful processing easy.

If you want to get started with Sparklyr, the Spark bindings for R/dplyr, then Cloudera Director offers some automation to spin up a cluster in AWS. This post walks through the necessary steps.

Apache Spark 2.1 (which is not yet released) will get a speedup in planning phase of executing queries by only fetching metadata from the Hive metastore that's relevant to partitions being queried. For large tables with many partitions, this can lead to a massive speedup. This post describes the changes and how they were implemented.


SyncSort has a three-part interview with Ted Dunning on stream processing with Apache Flink and Apache Apex, how these compare to Apache Storm, and a bit about open-source communities. The first part has a great explanation of how Flink and Apex do snapshotting and how that relates to exactly-once delivery semantics.

ZDNet has an article about how MapR is a rebel in the big data industry—by shipping proprietary instead of open-source systems. The other big competitors have a different stance: Cloudera is mostly open-source, and Hortonworks is 100% open-source. The article argues, though, that open-source isn't as big of a differentiator as it used to be, especially as there are lots of incompatibilities among open-source projects (the number of SQL engines is given as an example).

Registration is now open for DataWorks Summit and Hadoop Summit, which takes place in Munich in April.


Apache Phoenix 4.9.0 was released earlier this month. The new release adds support for atomic upsert and fixes over 40 bugs.

Version 2.1 of Hortonworks DataFlow, which is powered by Apache NiFi, Apache Kafka, and Apache Storm, was released. The release adds butter support for cloud services (including Amazon Kinesis and Microsoft Azure blob store and data lake, improves easy of use, and has new access control features.


Curated by Datadog ( )



Machine Learning and the Future of Big Data (San Francisco) - Monday, December 19


Big Data Meetup 2016.4 (Las Condes) - Monday, December 19


Speak Spark SQL 2.0 for Much Better Performance (Wroclaw) - Monday, December 19


Data Christmas 2016 (Budapest) - Monday, December 19


The Case for Kudu from Cloudera (Tel Aviv) - Monday, December 19

View to Big Data and Real-Time Analytics (Tel Aviv) - Tuesday, December 20