Data Eng Weekly

Hadoop Weekly Issue #153

17 January 2016

Given that we're still in the first month of the year, there are still several articles this week reviewing 2015 or making predictions for 2016. And articles this week cover ecosystem projects that have seen major adoption (or are expected to) in the coming year—Kafka, Flink, and Kudu to name a few. Also of note, the program for Spark Summit East was announced, and InfoWorld has a look at the state of open-source big data software as a business.


The Cloudera blog has the third in a series on YARN. This post covers the scheduler—in particular the fair scheduler. It describes queues (including hierarchical queues), queue weights, and more. There are visual aids and xml snippets to demonstrate these main concepts.

The Hortonworks blog has a guest post by one of its customers, Arkena. They are doing advanced analytics on massive amounts of streaming video data. The post details how they use Flume, Hive, Spark Streaming, ElasticSearch, and more.

The MapR blog has a whiteboard walkthrough (both a video and transcript) comparing Apache Spark and Apache Flink. The walkthrough covers the differences between the two—highlighting the key distinction of real time vs. microbatch. It also discusses several use cases (e.g. fraud detection, network anomaly detection) describing when microbatch or real time stream is more appropriate.

The morning paper looks at a new system for providing "distributed ACID transactions with strict serializability, high availability, high throughput and low latency." The protocols take advantage of FaRM (Fast Remote Memory) and RDMA (Remote Direct Memory Access). There's interesting discussions about the expected characteristics of data center hardware as well as how the protocol optimizes for CPU bottlenecks (given that RDMA eliminates other I/O bottlenecks).

The AWS blog has an example of using Spark Streaming from an Amazon EMR cluster to query data in Amazon Kinesis. Further, by microbatching the data and converting a DStream to a DataFrame, queries can be written using Spark SQL. The tutorial has a walkthrough of starting a cluster and a Python application for generating test data in Kinesis.

The IBM developer blog has an overview of some of the key benefits of using Parquet with Spark SQL. The post has a bunch of benchmarking numbers (e.g. 11x faster on Parquet than text files) as well as discussions around Spark internals (e.g. a look at how PushedFilters effect the query plan of a simple SQL query).

It can be frustrating to use the HBase shell with binary data, because the output from read operations is hex-encoded binary (which is difficult to understand). But HBase supports customer formatters for converting bytes to a human readable format when using the shell. This article describes how to build and use a custom formatter (with a focus on Avro data).

MapR has posted about their top-10 posts of 2015. There's a good mix of background, tutorial, and architecture posts covering topics like Spark, HBase, YARN, and Drill.

Apache Kafka and Amazon Kinesis provide a similar set of APIs and guarantees. One of the considerations when choosing one or the other is performance. This post compares throughput performance of the two, varying the number of parallel producers and batch sizes to quantify the impact.


DataInformed has a look at the potential of the Kudu project, which was open-sourced by Cloudera and submitted to the Apache incubator. In the most extreme version, Kudu might become a complete replacement for HDFS (while offering new features like updates and random-access). The post explores this idea and some of the milestones that might get the project to realize that potential.

In a few weeks, Hadoop will be 10 years old, and Cloudera has published an infographic celebrating the milestone. There are lots of numbers about contributions and many highlighted achievements.

InfoWorld has an article about "16 for '16" things to know about the Hadoop and Spark ecosystem (covering hot topics like Zeppelin, security, Kafka, and Impala). The post also discusses some up-and-coming technologies to keep an eye on and some "technologies I'd rather forget."

TechRepublic has a bearish outlook on the big data market. It mentions that only one company (RedHat) has been able to make a strong go as a open source play. And while Hortonworks is public, they're not looking to turn a profit until 2017.

The agenda for Spark Summit East, which takes place in New York on February 16th-18th, is now available. The Databricks blog has highlighted several of the talks and training sessions. If you're planning on going, the post also includes discount information.

Datameer has an article analyzing several years of big data and Hadoop news articles based on data from TechNews.IO. There are a number of interesting outputs, including a look at the top news days of the year, the top publishers of big data and Hadoop articles, and the most prolific big data/Hadoop article authors.

In another look at the year ahead, the Pivotal blog has a post with five predictions for 2016. Forecasted items include increased productivity of the Apache Hadoop ecosystem, an increase in adoption of open-source in corporate governance, and real-time analytics going mainstream.

Hortonworks is launching a new partner program called "PartnerWorks," and this article highlights some of the key provisions of the new program.


MapR announced that they've added support for Apache Drill 1.4. to their distribution. In their post announcing the support, they highlight many of the key features of the release.


Curated by Datadog ( )



Stream Processing Systems (Sunnyvale) - Wednesday, January 20

LinkedIn’s Big Data Pipeline with Kafka, Hadoop, and Couchbase (Mountain View) - Thursday, January 21

January Hive User Group Meeting (Palo Alto) - Thursday, January 21

Cassandra Data Maintenance with Spark (Santa Clara) - Thursday, January 21


Real-Time Operational Analytics with Apache Spark (Bellevue) - Wednesday, January 20


Deploying the Hadoop Ecosystem! (Salt Lake City) - Wednesday, January 20


An Evening with Chris Fregly, Spark Author/Contributor (Austin) - Tuesday, January 19


Flink and Nifi, 2 Stars in the Apache Big Data Constellation (Chicago) - Tuesday, January 19


Hadoop 101 (Brentwood) - Thursday, January 21


Introduction to Apache Kudu (Atlanta) - Tuesday, January 19

SparkR in Big Data (Atlanta) - Wednesday, January 20


Solr, Spark, and Zeppelin: The Analytics Toolkit for Distributed Big Data (Richmond) - Tuesday, January 19

District of Columbia

Hadoop and Metron: An Introduction to Open Source Security with CapitalOne (Washington) - Wednesday, January 20

Moving from Microsoft SQL to Hive (Washington) - Thursday, January 21


New Functions and Workflow Examples with Spark DataFrames (Vancouver) - Monday, January 18


Northern Spark Meetup (Groningen) - January 20, 2016


Real World Case: Spark and the Lifelog App by Sony Mobile (Copenhagen) - Thursday, January 21


Papers We Love: Resilient Distributed Datasets (Warsaw) - Monday, January 18


Second Apache Spark Workshop (Cluj-Napoca) - Wednesday, January 20


Spark Meetup (Shanghai) - Saturday, January 23