Data Eng Weekly

Hadoop Weekly Issue #202

29 January 2017

This week's issue has a pretty even mix of tutorials and technical deep dives. In news, there are two calls for participation, and in releases, Apache Hadoop has a new alpha release along with minor releases of Amazon EMR and Apache Impala (incubating).


The Hortonworks blog has a tutorial describing how to quickly bring up a HDCloud cluster with Apache Spark 2.1 and Apache Zeppelin.

AWS Data Pipeline is a workflow engine with tight integration across AWS data services (e.g. SNS, EMR). This tutorial shows how it can be integrated with Databricks to run a Spark job that converts server log data from text file to Apache Parquet.

The upcoming Apache ZooKeeper 3.4.10 release adds SASL-based authenication and authorization for members of the ensemble. This post provides an introduction to the new feature, gives an overview of the technical design, and describes the rolling upgrade mechanism that was built.

Version 4.0 of the Cask Data Application Platform adds a new mechanism to monitor Hadoop services (HDFS, YARN, HBase, Hive, Spark, and more). This post describes the design principles and implementation of the new monitoring system (including linking out to the code in Github).

Amazon Redshift, which is often coupled with Hadoop/Spark deployments in the AWS cloud, is a general purpose data warehousing system. Given this general purpose, it's often used for mixed workloads—interactive queries and batch processes. This post walks through the Redshift Workload Management features and provides some generic guidelines.

This post details the rise of the data engineer (and what the scope of a date engineer role tends to be at small/medium/large companies), the tools that data engineers are using (noting the popularity of code-based workflow engines like Airflow and Luigi), common responsibilities (e.g. data warehouse, perf tuning, data integration), and required skills (e.g. SQL, data modeling, etl design).

The GoDataDriven blog has a post that discusses how to break up Spark code using the DataFrame API to best support composability and testing. The post includes python sample code.

Rocana has written about the technology stack that powers Rocana Search. Composed of Apache Lucene, Apache Hadoop, Apache Kafka, and Apache ZooKeeper, the service ingests and indexes data stored in an Apache Avro schema. The post describes how they approach scalability and fault tolerance across the open-source projects that make up their system.


The DataWorks Summit/Hadoop Summit San Jose takes place in June. The call for abstracts is open through February 10th, and the program includes the following tracks: Applications, Enterprise Adoption, Data Process & Warehousing, Apache Hadoop, Governance & Security, IoT & Streaming, Cloud & Operations, and Apache Spark & Data Science.

Apache: Big Data North America takes place this May in Miami. The call for participation ends February 11th, travel assistance applications are accepted until March 8th, and early registration ends March 12th.


Apache Impala (incubating) released version 2.8.0. The release fixes a large number of bugs, adds support for Ubuntu 16.04, adds a number of fixes to Kudu integration (support for ADD/DROP range partition, completes support for ALTER commands), and more.

Apache Parquet Java version 1.8.2 was released with a number of bug fixes.

I don't usually cover pre-release versions of software, but the 3.0.0-alpha2 of Apache Hadoop is a big milestone (and this is Hadoop Weekly after all!). Highlights of the release include that client jars are now shaded (to help avoid dependency conflicts), support for Microsoft Azure Data Lake and the Aliyun Object Storage System, improvements to scheduling, and improvements to features added in the first alpha (Timeline Server v2 and HDFS erasure coding).

Amazon EMR version 5.3.0 includes Apache Spark 2.1.0, Apache Hive 2.1.1, Apache Flink 1.1.4, Hue 3.11.0, and Apache Oozie 4.3.0.


Curated by Datadog ( )



Understanding Big Data Streaming and Apache Flink (Fremont) - Wednesday, February 1

Spark < MPI, plus Monolith => Microservices (San Francisco) - Thursday, February 2


Apache HBase Deep Dive (Saint Louis) - Wednesday, February 1


How Google BigQuery Enabled Real-Time Analytics at Motorola Mobility (Madison) - Tuesday, January 31


Confluent Engineering Presents: Introducing Kafka Streams (Chicago) - Thursday, February 2

The Beauty of Kafka (Chicago) - Friday, February 3


Cleveland Big Data and Hadoop User Group (Cleveland) - Monday, January 30

Spinning Up Big Data in the Cloud (Cincinnati) - Thursday, February 2


Apache Spark, Machine Learning, and Healthcare (Miami) - Wednesday, February 1


Data Lake Powered by Apache NiFi: Data Ingestion & Distribution Infrastructure (Ramat Gan) - Sunday, February 5


Meetup for Hands-On to Hortonworks Enterprise Spark (Pune) - Saturday, February 4

Spark 2.0 (Bangalore) - Saturday, February 4


Analyze Online Retailer Data with Hadoop (Christchurch) - Tuesday, January 31