Data Eng Weekly

Hadoop Weekly Issue #158

21 February 2016

It was a busy week of announcements and releases, many of which coincided with Spark Summit East that took place in New York. Among the highlights are the new Apache Arrow project (in-memory columnar storage) and the community edition of Databricks. Netflix and Google both wrote a bit about their big data infrastructure, and there are great articles about Cassandra, MLlib, Python & Hadoop, Kafka Connect, and more.


This post looks at why a time series is a good model for storing many types of medical data, why Cassandra is a good system for storying time series data, how to model time series data in Cassandra (including some example table definitions and row-level inserts), and more.

The Netflix blog has a post on the evolution of their data pipeline, which is currently handling over 1 petabyte of data per day. Originally built on Chukwa and Amazon EMR, the latest system (called Keystone), is based on Kafka, EMR, and is supplemented by Elastic Search and streaming consumers (Spark and others).

The Cloudera blog has an example of using Spark's MLlib to do churn analysis. The post shows how to use Spark SQL (from Python) to define the schema of a csv file, build a feature vector, use a RandomForestClassifier to train a model, and validate the generated model.

The IBM Hadoop blog has a presentation on Spark troubleshooting—covering topics ranging from compiling Spark to optimizing cluster utilization to collecting thread dumps for debugging production issues. The presentation covers over 10 different troubleshooting tasks.

Hortonworks is starting a new weekly blog series highlight articles from their Hortonworks Community Connection. This week, the selected articles cover NiFi, Storm, Kafka, and Ambari. They also highlighted three community questions from the week.

Apache Arrow is a new top-level project for columnar, in-memory data spun out of the Apache Drill project. The Apache blog has the announcement, MapR has a post about the origins and design of Arrow (aka Value Vectors), and Cloudera has a post about their plans for integrating Arrow with other big data projects.

Cloudera has an update on the state of Python and Hadoop. The past year has seen improvements to PySpark and the emergence of several Python DSLs, which have great improved the utility of Python for big data processing. The post recaps the current landscape and describes two new initiatives—efficient data interchange via Apache Arrow and Cloudera Manager integration with Continuum Analytic's Anaconda Python.

While modern tools have automated much of Hadoop configuration, many settings aren't "set it and forget it." As a cluster grows or utilization goes up, settings that made sense initially can cause major problems. This post describes one such issue (and how to resolve it)—a HDFS NameNode that became overwhelmed due to the number of HDFS blocks.

The Confluent blog has a post on Kafka Connect, a new tool for moving data between Kafka and other systems. Part of the recently released Kafka 0.9.0, Kafka Connect ships with connectors for HDFS and JDBC. The introductory blog post has many more details on the design and implementation.

This post describes how to build a Flume Source that plays well with the JMX MBeans that keep track of message statistics for reporting to Cloudera Manager (and other systems).

Google has written about several large-scale sorting experiments that they've run over the past 10 years. Describing the road to sorting 50PB in 2012, the post shares a lot of interesting anecdotes (such as the use of Reed-Solomon encoding in 2010 and that their benchmark from 2012 is faster than the 2015 GraySort winner).


This post looks at the big data landscape, which the author argues has matured and is in a deployment phase. It's still difficult to start a big data system from scratch, but the ecosystem has matured (the post includes some financial numbers to back up this claim). The post also has a map of the landscape, which maps several areas like infrastructure, analytics, and open source.

Hortonworks has a post about what they've seen with enterprise adoption of Spark and how they're helping to speed up adoption. This includes making analytics/data science easier, hardening Spark for enterprise, and innovating core Hadoop.

Spark Summit East was this week in NYC. The Databricks blog has a post highlighting several keynotes and themes from the conference.

Qubole has announced that they're making their Qubole Data Service available to university classes at no cost. There website has more details on eligibility and the process for applying.

For the Cassandra users on the list, Planet Cassandra has a new "This Week in Cassandra." It's a collection of news and blog posts, Jira updates, upcoming events, and more.


Versions 2.1.10 and 3.1.0 of Apache Curator, the high-level framework for interacting with Apache ZooKeeper, was released this week. The releases each contain several bug fixes, improvements, and new features.

At Spark Summit in NYC this week, Databricks announced a new community edition of their Spark-as-a-Service platform and Databricks Dashboards. The community edition (currently in beta) provides free access to Databricks, and Dashboards provide a mechanism for building interactive web pages, logically separating charts and graphs from a single notebook, and more.

Apache BigTop 1.1.0 was released this week. Bigtop is a packaging, smoke/integration testing, and virtualization system for the Hadoop ecosystem. This release is built on Hadoop 2.7.1, supports five operating systems, adds support for Apache Hama and Zeppelin (incubating), adds support for producing docker images, and more.

The Apache Hive team disclosed CVE-2015-7521, which is an authorization bug that can be used to circumvent some authorization checks. The team has provide a new jar and configuration that produce a run-time work-around. Affected versions are Hive 0.13.x through various versions of 1.2.x (see the announcement for the full list).

Cloudera and Continuum have announced an integration for deploying Anaconda Python via Cloudera Manager. This introductory post describes how to configure CM to find and install the Anaconda Parcels and demonstrates how to take advantage of the install using pyspark.

Qubole has announced a new Qubole Data Service SDK for Airflow. With the SDK, it's easy to integrate Airflow workflow steps that execute via the Qubole service.

Version 0.5.1 (with bug fixes to the recently announced version 0.5.0) of Apache NiFi was released this week. NiFi is a system for processing and distributing data. The new release improves S3, Hive, and encryption support, adds new extensions for Reimann, ElasticSearch, and Avro, and has improved data inspection and state management.

Version 1.7.0 of Apache Accumulo, the distributed key-value store, was released this week. The new version is backwards compatible with previous versions, and focusses on bug fixes and improvements in the areas of security (new Kerberos authentication), availability (improved data center replication), and extensibility ( for HTrace for distributed tracing). There are many more details about the release on the Accumulo web site.

Apache Kafka was released this week. The Confluent blog has a good recap of bug fixes and improvements from the release.

Apache Zookeeper 3.4.8 was released. The new version addresses several bugs.

IBM has announced the IBM Platform Conductor for Spark. The product focusses on easier deployment and optimized resource scheduling.


Curated by Datadog ( )



How Netflix Handles Data Streams + Intro to Apache Kudu (San Francisco) - Tuesday, February 23

Building Robust, Scalable, and Adaptive Applications on Spark Streaming (San Francisco) - Tuesday, February 23

Enterprise Grade Streaming on Hadoop in Under 2ms (San Francisco) - Wednesday, February 24

The Future of Apache Storm w/ Taylor Goetz (Los Angeles) - Thursday, February 25

Streaming App w/ under 2ms Latency in Hadoop + Ingestion from Nifi (San Jose) - Thursday, February 25

Hadoop and Spark: A Perfect Duo for Big Data (San Francisco) - Thursday, February 25


Using Cassandra with Spark (Portland) - Monday, February 22


Seattle Scalability Meetup: Riak TS + Kudu (Seattle) - Wednesday, February 24

SparkCLR and Kafka+Spark (Bellevue) - Thursday, February 25


February Edition of MOHUG (Dublin) - Thursday, February 25

North Carolina

MemSQL on "Building Real-Time Data Pipelines" (Charlotte) - Wednesday, February 24


H2O Rains with Databricks Cloud (Washington, DC) - Wednesday, February 24


Advanced Spark Meetup (Laurel) - Monday, February 22

District of Columbia

Moving from Microsoft SQL to Hive (Washington) - Tuesday, February 23


February Meetup (Ottawa) - Tuesday, February 23

Hadoop Night Featuring Hortonworks and Pivotal HAWQ (Toronto) - Wednesday, February 24

Toronto Apache Spark #6 (Toronto) - Wednesday, February 24

Automatic Features Generation and News from Spark Summit (Montreal) - Wednesday, February 24

The Emerging Fast Data Architecture by Dean Wampler (Toronto) - Thursday, February 25


February 2016 Meetup (London) - Tuesday, February 23


Intro to Apache Flink: Networking (Madrid) - Thursday, February 25


Kafka (Prague) - Thursday, February 25


Spark in Action: Spark Streaming (Warsaw) - Monday, February 22


The First Meetup for 2016! (Singapore) - Thursday, February 25


Big Data on Azure Using HDInsight (Perth) - Wednesday, February 24