Data Eng Weekly

Hadoop Weekly Issue #216

14 May 2017

Lots of great technical content this week, including a post on running a data team and the SIGMOD paper on Amazon Aurora. In news, Kafka Summit was last week, and Confluent announced a new Kafka-as-a-Service offering. This week is Apache Big Data North America, so stay tuned for coverage of some great presentations next week (and please send any you see my way!).


There are lots of posts on workflow tools and example data pipelines, but this post describes some of the bigger ticket issues that are important for running a successful data team. Advice spans the technical ("instrument your pipeline"), the cultural ("tech each other how to fish"), and the larger picture ("know how the business has changed").

This post from Databricks shows how powerful Spark's Structured Streaming APIs are for doing windowed aggregations with support for late data/watermark calculations. The post describes and visualizes, at a high-level, the logic that is being abstracted by these APIs.

Uber is a big user of Apache Spark, and they recently worked on and deployed a Locality Sensitive Hashing (LSH) implementation for Spark for applications such as detecting fake accounts and payment fraud. The post has an example of using LSH for finding similar articles form the Wikipedia Extraction dataset. The post has quite a bit of example code, performance test results, and a look at next steps.

The Hortonworks blog has an overview of YARN's support for running Docker containers via its "LinuxContainerExecutor." There is a DockerLinuxContainerRuntime under development, and an example application (distributed shell) is demonstrated in the article. The post has a list of future improvements, including volume support, service discovery, and image management.

At almost 30 pages, this post offers a complete and detailed overview of the Apache Beam model, which is built on streams and tables. To explain the finer points, it covers the what, where, when, and how of streaming data.

Druid is a system for interactive queries on large datasets. While it's great for answering a certain class of questions, it doesn't aim to support joining of data or queries that require scanning large portions of the dataset. For these types of queries, Apache Hive can query data stored in Druid. The Hortonworks blog has details on this integration and describes some plans for the future.

The Hortonworks blog has the first part of a series on new features in the security products (Apache Atlas, Apache Ranger, and Apache Knox) that are part of HDP 2.6. This part focuses on whats new in Atlas 0.8.0, which is the data governance product. Updates include a new REST API that includes swagger documentation, a revamped search UX, visualization updates, and more.

The AWS Big Data blog has an example of building a Redshift data warehouse for healthcare data. The flow consists of an Amazon EMR job to ETL to a staging directory, a lambda function to issue COPY commands into Redshift, and SNS as glue between the two.

Amazon Web Services has published a paper as part of SIGMOD on the architecture and design considerations for Amazon Aurora. Aurora is a OLTP, scale out database-as-a-service that can act as a drop-in replacement for MySQL and Postgres.


Apache Big Data North America is this week. The keynote addresses will be live-streamed on Tuesday and Wednesday. Sign up at the link below to see them.

ZDNet has a post that analyzes the maturity of the Hadoop market based on details from Cloudera's IPO and Hortonworks' earnings report. The quick takeaways are that while both companies are still losing money, the losses are falling and expansion into cloud is likely the next source of growth.

Confluent Cloud is a new SaaS offering of Apache Kafka. It is in early access with support for Amazon Web Services but plans to support Google Cloud and Microsoft Azure too.


Apache Arrow 0.3.0 was released. Arrow is a library and standard for columnar in-memory analytics (like those used by Kudu, Pandas, and Spark). The release adds support for dictionary encoding (not yet fully integration tested), improved support for date/time primitives, integration with PySpark, and much more.

Apache NiFi 1.2.0 was released. There are a number of under the hood improvements, new processors/connectors (including improved support for change data capture from databases), updated UI, security improvements, and more.

kafka-streams is a new nodejs project that aims to "give at least the same options to a nodejs developer that kafka-streams provides for JVM developers." It's still under active development, but a large part of the core library is already implemented.


Curated by Datadog ( )



Bay Area Apache Spark Meetup @ Salesforce (Palo Alto) - Tuesday, May 16


Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, May 15

New Jersey

Hadoop Tools Overview: Hands-On (Princeton) - Tuesday, May 16

IRELAND The Data Analytics Platforms at Zalando and Altocloud (Dublin) - Monday, May 15


Big Data, No Fluff: Let’s Get Started with Hadoop #12 (Oslo) - Thursday, May 18


Spark Meetup (Paris) - Tuesday, May 16


Delft Streaming Processing Meetup (Delft) - Thursday, May 18


Meet Gwen Shapira, the Master of Apache Kafka! (Copenhagen) - Tuesday, May 16


On the Changes That Big Data and Fast Data Bring to Software Development (Zurich) - Wednesday, May 17


Spark Serving Globally and Ingestion of IoT Big Data (Tel Aviv-Yafo) - Wednesday, May 17


Workshop on Spark Streaming (Pune) - Saturday, May 20


Managed Hadoop and Spark on Google Cloud (Kebayoran Baru) - Thursday, May 18


Scaling the Event Stream (Melbourne) - Thursday, May 18

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit