Data Eng Weekly

Hadoop Weekly Issue #201

22 January 2017

This week is a short but very sweet issue, with fantastic articles on Apache Kafka, Apache Spark, Apache Airflow (incubating) and more. Also, Twitter has written about the scale of their infrastructure, and there's a great post describing building materialized views with Kafka. In releases, Apache HBase and Apache Kudu both had releases this week.


The morning paper is covering several papers from this year's Conference on Innovative Data Systems Research (CIDR). These include "SnappyData: A unified cluster for streaming, transactions, and interactive analytics" and "Dependency-driven analytics: a compass for uncharted data oceans." Rather than linking to each of the week's post, here's the post that introduces the coverage for the week.

Twitter has written about the challenges and lessons learned in scaling its infrastructure, of which 19.6% is Hadoop (and close to 50% is made up of data systems). The post covers network traffic, storage (which mentions that Twitter stores 500PB of data across multiple Hadoop clusters, the largest of which is 10k nodes), puppet at scale, and more.

The MapR blog has a post about using Apache Kafka, Apache Spark, and Apache Ignite for a streaming application that writes data out to Apache HBase. Using five or so performance tunings (varying from tweaking JVM settings to fixing timeouts), the Spark Streaming job became 12x faster. The post also covers some details of how the system was stabilized (such as running Spark in standalone mode rather than via Mesos).

This tutorial walks through using Spark's structured streaming to load CloudTrail audit logs into a data warehouse built on S3 and Apache Parquet. While this is mostly a getting started tutorial, it also includes a discussion of how to make this more production-ready by setting up fault tolerance using a checkpoint location.

There have recently been security incidents involving Hadoop clusters that expose themselves on the internet. Cloudera has put together a guide (which is aimed at Cloudera but is in parts generally applicable) with basic steps for locking down a Hadoop cluster.

The Google Cloud Platform Medium publication has a post on using Apache Airflow (incubating) with BigQuery. It highlights some of the useful features of Airflow, such as support for jinja template substitution when building sql queries.

This article illustrates common problems related to performance and caching of data systems, an overview of some common solutions, and a high-performant solution based on a materialized views. From there, the post contains code snippets and describes how to use Kafka Streams and local cache to compute and serve requests. The post also describes how to integrate with other systems like Redshift and ElasticSearch.


JanusGraph is a new effort to build a scalable graph database based on the Titan project. Interestingly, it's being run at the Linux Foundation rather than the Apache Software Foundation.

The Apache blog has a post about Apache Ignite, which is an in-memory data fabric. Ignite supports many use cases such as transactional updates and SQL queries, and it is integrated with Spark, Hadoop, YARN, and more.

As mentioned above in reference to the Cloudera post on securing Hadoop, there have been incident in which publicly addressable Hadoop installs have had data deleted. This post provides more details on what's been happening and how to secure your setup.


Version 2.5.0 of the workflow system, Luigi, was released. There are a number of changes in the release, most notably improvements to the BigQuery support.

Apache HBase 1.3 was released this week, with over 1700 resolved issues. There are several improvements, including date-based tiered compactions, improvements to the metrics system, and client optimizations for looking up region locations.

Apache NiFi released version 1.0.1 and 1.1.1 in December. If you haven't upgraded, there is some more urgency now that a XSS-vulnerability has been disclosed.

Apache Kudu 1.2.0 was released. This release improves the implementation of strong consistency guarantees, fixes a corruption bug with ext4 on RHEL 6, and more.

At VLDB 2015, Facebook published a paper on their time series database, Gorilla. Recently, they open-sourced an implementation called Beringei. It's written in C++ and there's a Dockerfile to get started.


Curated by Datadog ( )



Where the Worlds of Data Eng & Data Science Merge! (Santa Clara) - Thursday, January 26

VEGAS: The Missing Matplotlib for Spark, Presented by Netflix (San Francisco) - Thursday, January 26


Real-Time Data Ingestion & Streaming: Talks from Avvo, Expedia and Confluent (Seattle) - Wednesday, January 25

Seattle Scalability Meetup: Evolution of Machine Learning Sys w/ Stripe Radar (Seattle) - Wednesday, January 25


Jumpstart Your Big Data Analytics Journey with the Hortonworks Sandbox and Hive (Atlanta) - Thursday, January 26

North Carolina

January CHUG: What's the Big Deal with Hadoop? The Elephant in the Room (Charlotte) - Thursday, January 26


Big Data Tools in Azure (McLean) - Monday, January 23


Big Data Governance and Security in Apache Hadoop: Healthcare Client Use Case (Philadelphia) - Thursday, January 26


Seminar: Fundamentals of Apache Spark (Madrid) - Friday, January 27


Kafka Connect & Repeatable Deployment of Kafka Streams Topologies on Kubernetes (Utrecht) - Thursday, January 26


IoT Tech Meetup #3: Streaming Analytics (Berlin) - Tuesday, January 24

Apache Kafka Meetup with Jay Kreps and Michael Noll (Munich) - Wednesday, January 25

What Is New in Hadoop 3.0 (Dusseldorf) - Wednesday, January 25


19th Swiss Big Data User Group Meeting (Zurich) - Monday, January 23


Facebook Presto: SQL-on-Anything­ (Warsaw) - Tuesday, January 24


Processing Big Data Using Apache Hive & Microsoft Azure Machine Learning Studio (Colombo) - Tuesday, January 24