Data Eng Weekly

Hadoop Weekly Issue #148

06 December 2015

There's lots of great technical content this week covering large potions of the Hadoop ecosystem as well as distributed systems in general. In news, the CfP for Kafka Summit is open, and there's a new eBook covering "Data Munging with Hadoop." With only a couple of releases, there should be plenty of time to concentrate on the abundance of technical posts.


The Cloudera blog has a post on the past year of Apache Spark development, which has included a lot of work on Spark streaming, Hive-on-Spark, and tools for data science. Two more articles cover the DataFrames API (which enables easier development, better performance, improved interoperability, and more), MLlib (which provides implementations of popular machine learning algorithms), and the Hive-on-Spark project (there have been a number of recent improvements towards a production-ready version).

The IBM Hadoop Dev blog describes some of the recent security features that IBM have added to their distribution via Knox and Ranger. It also describes some plans for the future of these two projects.

This tutorial describes integrating the machine learning libraries with Spark (and in particular Databricks). Specifically, the post shows how to tokenize data, use the TF-IDF libraries from Spark to identify important works, and build an H2O deep learning model to detect spam.

This post aims to highlight the key concepts of distributed systems (with links out to the relevant papers). The content is available as a presentation, a video recording, and a blog post. In total, the post covers nine topics, including timing model, failure modes, and consensus. Whether you're new to distributed systems or are looking to brush up on the main concepts, this is an important resource.

The GoDataDriven blog has a two-part series describing how to configure Cloudera CDH on the Azure cloud. In addition to the common software installation and configuration, the post describes network typology/architecture (including setting up a VPN tunnel), Azure basics, and describes some modifications to the pre-built Cloudera-Azure template.

Region replicas are a relatively new feature of Apache HBase. By enabling them and specifying the correct flag at query time, HBase can delivery high availability of reads. This tutorial describes how to configure HBase for HA reads and gives a quick walkthrough of using the HBase CLI to create a table with replicas and query secondary regions.

This post describes how to build a standalone Hive metastore without a Hadoop cluster, which is the scenario for running Presto with a blobstore like S3 (the instructions also mention how to setup S3 access).

Apache Flink is a streaming-first system, which means it doesn't require micro-batching like Spark streaming. But micro-batching, or windowing data, is often useful since it can be convenient to process a set of events at once. This post explores the rich windowing semantics in Flink streaming—time windows (with different notions of time), key-based partitioning of windows, count-based windows, and the base interfaces for building a new type of window function.

The MapR blog has a short video "Whiteboard Walkthrough" about Apache Myriad (incubating), which is a system for running YARN atop of Apache Mesos. The full transcript of the walkthrough is posted, too, if you'd rather read the information.


Kafka Summit takes place in April in San Francisco. The call for proposals is open until January 11, and early bird registration is until January 15th. Disclosure: I'm on the program committee for the conference.

EMC Elastic Cloud Storage (ECS) has been certified for Hortonworks HDP. The integration makes use of Ambari to deploy HDP with the ECS file system instead of HDFS.

The spark-sql-perf project is used to benchmark Spark SQL. Recently, the benchmark has integrated all of the queries fro the TPC-DS Benchmark, and IBM has been able to run 89 of the 99 queries when encoding data using the Parquet file format.

"Data Munging with Hadoop" is a new eBook by Ofer Mendelevitch and Casey Stella. It covers a range of topics, from implementing quality checks to handling time-series data.

Datamation has a list of twenty companies associated with Big Data. In addition to the big name Hadoop vendors, there are a number of companies that hadn't yet crossed my radar.


Cloudera announced a new QuickStart Docker image for evaluating CDH. As always, it's interesting to see the ways in which Hadoop and Docker are integrating.

Version 4.3.2 of Apache Bookeeper, the replicated log service, has been released with an important bug fix.


Curated by Datadog ( )



A noETL Parallel Streaming Transformation Loader Using Spark, Kafka­ & Vertica (Los Angeles) - Monday, December 7

Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data (Palo Alto) - Wednesday, December 9

Hortonworks Community Celebration and Spark Meetup (San Francisco) - Wednesday, December 9

Spark, Streaming, BlinkDB, Approximate, Twitter Algebird, CountMin Sketch, HyperLogLog (San Francisco) - Thursday, December 10


Case Study: Machine Learning at Scale Using Spark and Hive (Westminster) - Wednesday, December 9


Scalding: A Better Way to Write MapReduce Jobs (Addison) - Monday, December 7

December 2015 Meetup: YARN (Plano) - Monday, December 7


December Edition of MOHUG (Dublin) - Tuesday, December 8


December 2015 Meetup: Kafka (Atlanta) - Thursday, December 10


Spark Hands-on Workshop (Laurel) - Monday, December 7

New Jersey

Continuous Data Management for Hadoop and Spark (Jersey City) - Wednesday, December 9

New York

Scaling Spark (New York) - Monday, December 7

Big Data Warehousing Innovation: Introducing Kudu (New York) - Wednesday, December 9

Database Seminar: Hadoop (Buffalo) - Thursday, December 10


December Presentation Night (Boston) - Thursday, December 10


Apache Spark: Why Should I Care? + Spark in Production (Montreal) - Wednesday, December 9


Spark Meetup (Paris) - Monday, December 7

Data Munging with Apache Spark (Toulouse) - Tuesday, December 8 Saturday, December 12

Integrating Spark/Cassandra, Theory and Practice (Talence) - Thursday, December 10


Spark on Azure + Spark Streaming (Zaventem) - Thursday, December 10


Introduction to Spark Streaming and Deep Dive (Bangalore) - Saturday, December 12

SOUTH KOREA Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data (Seoul) - Tuesday, December 8


Spark Technical Deep Dive with Chris Fregly (Sydney) - Tuesday, December 8

Spark after Dark with Chris Fregly and Jamie Engesser (Melbourne) - Wednesday, December 9