Data Eng Weekly

Hadoop Weekly Issue #205

19 February 2017

Tons of great content this week including a look at Google's new Cloud Spanner, the YARN fair scheduler, connecting Splunk with Kafka, and Jepsen testing of Cockroach DB. In news, there are CFP open for Data Platforms and HBaseCon, and Kafka Summit New York has announced the conference schedule. In releases, there's a neat new command-line tool for interacting with HDFS.


Google announced that the Google Cloud Platform is adding "Cloud Spanner," which is a highly-available and consistent database. This article describes how they achieve high availability at the network level to ensure that their CP system (in terms of the CAP theorem) has five-nines of availability.

The Cloudera blog has the latest in their series on the YARN FairScheduler. In this post, there are a number of example queue configurations for common scenarios such as a best effort queue, low latency computations, limiting the size of ad-hoc queries, as well as more complicated configurations involving nested organizations with varying resource allotments.

The Hive metastore has a hard limit of 4000 characters for nested schemas in a single column. It's possible to work-around this, but it requires a few different hacks (covered in this post) add partitioned data.

As mentioned in last week's issue, the second alpha release of Apache Hadoop 3.0.0 is out. This article describes three highlights of the release: classpath isolation for client jars, support for Microsoft Azure Data Lake and the Aliyun Object Storage System, and support for opportunistic containers and distributed scheduling in YARN.

The video analytics company, Mux, has written about their use of Apache Flink with Amazon Kinesis to detect errors in video playback. Much of the post is devoted to an overview of Flink and the advantages of its event-time-based processing, but there is a bit at the end about Flink at Mux. Specifically, Mux mentions the usage of the "rolling-fold" operator to set a per-customer baseline for error rate.

This post describes (including the architecture and design choices) a new Kafka Connect plugin for sending data from Kafka to Splunk, and it provides a tutorial for setting up a Kafka Connect program to stream data from a Kafka topic to Splunk via the Splunk Heavy Forwarder.

The Jepsen blog has a post about recent testing of CockroachDB, which is a distributed SQL database. The post has some great background on the semantics and guarantees of the databases (which has similar design goals to Google's Spanner), describes the tests and results in depth, and includes a discussion of some of the improvements that Cockroach Labs made as a result of the findings.

The data team at Stitch Fix has recently migrated from Amazon Redshift to Spark (including PySpark and Spark SQL). This presentation discusses some of the reasons that they made the move, some of the gotchas they encountered during the migration (e.g. differences in SQL syntax), their approach to multi-tenancy using the Netflix Genie job server, and more.

This tutorial shows how to run Spark locally (or some other place outside of Azure) to process data stored in the Azure Data Lake Store.

Cloudera has published an updated version of their Impala Cookbook, which covers topics like schema design, cluster sizing, hardware recommendations, and query tuning.

This post dives into the internals of Spark and the JVM to help understand an optimization in a Spark program that resulted in as particular query behaving even faster than expected.

The AWS Big Data blog has a thorough look at Amazon Athena's support for JSON data. It looks at a simple example of nested JSON data (event data from the Amazon Simple Email Service), adding fields with special characters, auto-generating a DDL from sample data, and more.


Confluent has published the results form a survey of Apache Kafka users. This post describes feedback on which languages folks are using with Kafka and which client properties are most important.

Data Platforms is a new conference taking place in Phoenix in May. The call for papers is open through March 15th.

HBaseCon is June 12th in San Francisco. The call for abstracts is open until April 24th.

The agenda for Kafka Summit New York, which takes place on May 8th, has been posted.


Google has announced the public beta of their Cloud Spanner distributed relational database. It offers a pay-as-you-go model and offers JDBC drivers for most popular languages.

Syncsort has announced a new verison of their DMX-h software, which integrates Hadoop, Spark, mainframes, and other data systems. This verison adds support for Spark 2.0 and a new integrated workflow.

Apache Storm 1.0.3 was released. Mostly a bug-fix release, the changelog contains over 60 resolved tickets.

HDFS shell is a new tool that provides an interactive shell to do HDFS operations via the command line. There's a GIF on github that provides a brief overview of the core functions it provides.


Curated by Datadog ( )



How Data Drives Decisions at Netflix (Mountain View) - Tuesday, February 21

Big Data Science Meetup (Fremont) - Friday, February 24


Distributed Persistent Memory for Spark (Portland) - Thursday, February 23


Streaming Data Platforms & Hotel Search in the Cloud (Bellevue) - Monday, February 20

IBM Presenting at Seattle Spark MeetUp (Seattle) - Tuesday, February 21

Spark Working with an IDE: Notebook/Shiny + Resource Managers: Which Is Best (Bellevue) - Tuesday, February 21

Seattle Scalability Meetup (Seattle) - Wednesday, February 22


Data Science and Hadoop Lunch (Lehi) - Thursday, February 23


Powering Near-Real-Time Decisioning with Impala (Addison) - Thursday, February 23


ChiPy Data Science SIG (Chicago) - Monday, February 20

Hands-on Apache Flink Workshop! (Chicago) - Tuesday, February 21

Building Streaming Data Applications Using Kafka (Chicago) - Thursday, February 23


SQL Server Polybase & Hadoop: The Powerful Combo (Fort Lauderdale) - Wednesday, February 22

Reactive Streams: Akka & Kafka (Miami) - Thursday, February 23


Kafka with Craig McCown (Atlanta) - Monday, February 20

North Carolina

Leveraging Hadoop for Advanced Cyber Security (Charlotte) - Thursday, February 23


Using SQL-Compliant Applications and Code to Get the Most Out of Hadoop Data (Vienna) - Wednesday, February 22

Ansible Use Cases: HortonWorks & Cumulus Networks (McLean) - Thursday, February 23

District of Columbia

IOT Real-time Big Data Analytics Using Kafka, Cassandra, and Spark (Washington) - Thursday, February 23


Big Data with Azure Data Lake Store and Data Lake Analytics (Pittsburgh) - Tuesday, February 21

DataPhilly Speaker Series (Philadelphia) - Thursday, February 23

New York

Crunching Streams of Data: An Introduction to Akka Streams (New York) - Thursday, February 23


Apache Spark #17 (Toronto) - Wednesday, February 22


Tutorial: Get Your Hands on Implementing a Flink App (London) - Wednesday, February 22

Apache Spark Real World Use-Cases (Manchester) - Wednesday, February 22


Data AZUG Meetup (Neuilly/Seine) - Wednesday, February 22

Criteo Infrastructure Platform Meetup (Paris) - Wednesday, February 22


Kafka All the Reactive Things (Amsterdam) - Tuesday, February 21

Big Data Ingestion Part 2 (Amsterdam) - Thursday, February 23


Let's Talk about Apache Flink 1.2, and Put It in a Container! (Karlsruhe) - Tuesday, February 21

Data Ingestion with Apache NiFi (Nuremberg) - Thursday, February 23

WebTech Night: Kafka Night! (Karlsruhe) - Thursday, February 23


Hadoop Rockstars (Budapest) - Tuesday, February 21


Introduction to Apache Spark and Build Your First Apache Spark Application (Bangalore) - Saturday, February 25


Real-Time Big Data Analytics Use Cases (Johannesburg) - Tuesday, February 21