Data Eng Weekly

Hadoop Weekly Issue #196

11 December 2016

Lots of great content this week, including two articles showing off the newly released Amazon Athena. Slack has also written about their AWS-based data infrastructure and some of the challenges they ran into when supporting multiple analysis systems. Finally, there are a handful of releases, including a new version of Apache Hive.


A common practice is to run batch processes (Spark, MapReduce) to produce a dataset (often key/value) for serving in production. Netflix has open-sourced their tool, Hollow, for serving up these read-only datasets. It has some neat optimizations to memory footprint as well as a history tool for inspecting how data is changing by specific record.

Amazon has published a 5 part series on optimizing performance of Redshift through improved table design. Topics covered include distribution keys, sort keys, compression encoding, and data durability.

This post shows a real-world example of using Amazon's recently announced Athena (which is based on Presto) to analyze logs produced by an Application Load Balancer and stored in S3.

The SNOW theorem formalizes a set of design trade-offs that most people building high performance systems have considered for years. The notion is that of Strict serializability, Non-blocking, One response per read, and Write transactions, you can choose three but not all four when designing a system. As usual, the morning paper has a great overview of the key concepts from the paper.

InfoQ has posted the slides and video from Confluent co-founder and CTO Neha Narkhede's QCon presentation entitled "ETL is Dead, Long-live streams." If you're interested in stream processing and haven't seen Neha speak, I highly recommend watching to understand the concepts and architecture that are driving much of the industry.

This post provides a quick walkthrough of getting starting with Apache Kafka using the Clojure language.

For a more involved example of Amazon Athena, this post describes first preparing genomics data (by among other things, converting to Apache Parquet) using Apache Spark. Next, there are example queries to do various simple aggregate analysis using Athena.

The data engineering team at Slack has written about their AWS-based data platform. They store data as Parquet in S3 and query it using Apache Hive, Apache Spark, and Presto. Each tool has its own subtle bugs or inconsistencies in how it interacts with Parquet data (there are several examples in the post), so the Slack team eventually built their own input/output formats to ensure consistency.

This post describes the various YARN-related memory settings as well as some of the main culprits for exhausting memory in a MapReduce job.

NOPaxos, which relies on Network Ordering, is an alternative to Paxos that offers better performance within the data center. It relies on a multicast network primitive in which all receivers process messages in the same order but in which some messages can be lost. The morning paper has details on the algorithm and how it performs in an experimental setting.

The Databricks blog has an example of using Apache Airflow (incubating) to manage a Databricks cluster using their REST API. Even if you're not using Databricks, the post has a useful introduction to Airflow.

The Cloudera blog has a post sharing some snippets for working with CSV files from Spark. Versus a non-distributed version, Spark provides excellent speedups and also offers the ability to transform to optimized formats like Parquet.

The DataTorrent blog has a post about the process of integrating the Apache SAMOA streaming machine learning library with Apache Apex.


insideBIGDATA has an interview with Confluent's Gwen Shapira about the fast data movement (which is often powered by Kafka) and some of the key challenges that an organization will face when building real-time streaming systems.


Apache NiFi announced version 0.1.0 of MiNiFi for Java and C++. The C++ version isn't yet consider production-ready, and the Java version includes important new features like the ability to do pull-based configuration changes.

BlueData announced that BlueData EPIC is now GA on Amazon Web Services. EPIC is a tool for managing big data systems, and it supports multiple different distributions of Hadoop, hybrid clouds, resource quotes on AWS, and more.

Apache Apex Malhar, the operator and codec library for Apex, has announced version 3.6.0. It is the first version to include SQL support (built via Apache Calcite) and includes several other improvements.

Apache Hive 2.1.1 has been released. There are over 200 resolved issues for the release. Many of those are bug fixes, including to Hive's Live Long and Prosper support.

MapR has announced version 2.0 of their MapR Ecosystem Packs. The release includes new versions of Spark, Drill, Kafka (including Connect and the REST Proxy), Hue, and more.

kafka-connect-jmx is a new kafka connect module that pulls data from JMX for a Java process and writes it Kafka as JSON.


Curated by Datadog ( )



Streaming for Personalization Datasets at Netflix (Sunnyvale) - Tuesday, December 13

Using Apache Spark for Machine Learning (Santa Clara) - Tuesday, December 13


Kafka: Introduction and Internals (Bellevue) - Wednesday, December 14


December Austin Data Meetup (Austin) - Monday, December 12

Trevor Grant Tells Us about What's Next for Apache Mahout (Plano) - Tuesday, December 13


Big Data Pipelines/Flows, Use Case–Driven Comparison: Apache NiFi & StreamSets (Kansas City) - Friday, December 16


Hands-On Intro to Apache Spark for Data Engineers, Data Scientist and Developers (Atlanta) - Tuesday, December 13

District of Columbia

Data Processing with Zeppelin/Solr/Spark/Nifi (Washington) - Wednesday, December 14


Hands-On: Data Science at Scale with HAWQ and MADlib and Hadoop (Philadelphia) - Wednesday, December 14

New Jersey

I'm Being Followed by Drones: TensorFlow, HDF 2.0, Phoenix, Python, Zeppelin (Princeton) - Wednesday, December 14

New York

How Apache Spark Will Disrupt the Media Industry (New York) - Wednesday, December 14

How Nielsen Leverages Data & Machine Learning for Real-Time Analytics (New York) - Thursday, December 15


Big Data, No Fluff: Let’s Get Started with Hadoop #11 Julbord (Oslo) - Thursday, December 15


Apache Flink in Action (Madrid) - Thursday, December 15


100% Performance: Hadoop & Vertica... with Criteo and Ogury! (Paris) - Thursday, December 15


Real-Time Transactional SQL on Hadoop (Amsterdam) - Thursday, December 15

Streaming Machine Learning on Flink (Amsterdam) - Friday, December 16


Big Data, Berlin v10.0 (Berlin) - Thursday, December 15


Hadoop User Group Meetup @ IBM Client Center (Vienna) - Tuesday, December 13


Powering the Future of Data in the Cloud (Budapest) - Wednesday, December 14


2nd Apache Kafka Workshop (Cluj-Napoca) - Wednesday, December 14


Continuously Deploying Big Data Pipelines with Amaterasu (Tel Aviv-Yafo) - Thursday, December 15

RUSSIA 2016 Hadoop Meetup (Moscow) - Friday, December 16