Data Eng Weekly

Hadoop Weekly Issue #220

18 June 2017

This is quite a monstrous double issue with coverage of announcements and releases from the recent DataWorks Summit and Spark Summit. In addition to all the news, there are great technical articles/presentations (be sure to send any others from the Summits my way) on Spark Streaming, Kafka, HBase, and more.


The Hortonworks blog has a post on some new features that are being added to HDFS to detect slow data nodes and slow disks.

Databricks is touting the throughput of Spark Streaming in a new benchmark, and they're proposing an enhancement to Spark Streaming for "continuous processing" that replaces microbatches and reduces end-to-end latency to the order of 1ms.

You may have heard of the Apache Arrow project, which implements a cross-language in-memory columnar data format. While most devs won't deal with Arrow directly, it provides an under the hood mechanism to speed up a number of things like PySpark . This presentation gives a good overview of what Arrow is and how it helps achieve speedups.

The Confluent blog has an introduction to the librdkafka-based Python APIs for Kafka. In short, the post shows how to use the APIs to produce and consume records from a Kafka cluster and how to setup a local development environment.

The Google Cloud Platform team makes the argument that one-job-per-cluster is the right approach for Hadoop. The penalty for starting up a cluster is low on the Google Cloud (under 2 minutes), and you don't need to worry about optimizing for multitenancy like on a long-running Hadoop cluster.

The Insight Data Science blog has a useful post showing how to spin up a set of AWS and Kafka services (a Postgres database on Amazon RDS, a EC2 instance running Kafka Connect, and an Amazon Redshift cluster) to perform near real-time change data capture to propagate changes from postgres to redshift.

Apache HBase has special support for "Medium Object Storage" or MOB, which separately stores files from references when a value is larger than a particular size. This post describes an enhancement (support for weekly and monthly partitioning) in the design which solves memory problems on the NameNode due to the number of MOB files that could be created.

Netflix has written about the third generation of their Genie tool for executing (around 150k/day) queries on YARN and Presto clusters. New features include a redesigned job execution engine, cluster leadership (via ZooKeeper), security (via Spring Security), and dependency caching.

Pinterest has written about about the process of and their experience in upgrading from a custom 0.94 build of HBase to version 1.2. The protocols and APIs between these versions aren't compatible, so the team had to jump through some extra hoops to ensure that data was replicated (using a custom, thrift-based solution), verify the status of replicated data, ensure their client could support both versions simultaneously, tune performance, and more. Pinterest has a non-trivial amount of data in HBase, so it's very interesting to see what they've done.

The Databricks blog (and accompanying notebook) has a tutorial for using several of the built-in Spark SQL functions for processing JSON and nested structures.

The Cloudera blog has a post on the evolution of Apache ZooKeeper's Four Letter Words support. For these admin commands, there isn't a good security solution (as a connection is over the normal ZK port). As an alternative, ZooKeeper provides support for JMX and as of the 3.5.x release an AdminServer on a separate port.


Databricks and O'Reilly are providing free access (behind an email/phone num wall) to several chapters from the upcoming "Spark: The Definitive Guide."

MapR-XD is a new product that provides data tiering by storing cold data in an object store like Amazon S3. It also supports flash storage for hot data.

Databricks has announced a push towards a managed, auto-scaling Spark service called Databricks Serverless. Their first step in that direction is Databricks Serverless Pools, in which compute and storage autoscale on a set of instances running within a customers own AWS account.

Hortonworks has announced a new type of support contract called a Hortonworks Flex Support Subscription, that allows a single subscription to support both cloud and on-prem deployments. This move seems to reflect the reality of the number of customers that are migrating or experimenting with hybrid clouds.

The ODPi is doing a 6-month grant fund program to make improvements to Apache BigTop (including code, documentation, and more). Applications are due on July 14th.

Cloudera posted their first quarterly earnings report. Datanami has a good breakdown (revenue and GAAP losses were better than expected, but billings missed expectations).

Confluent has posted slides and videos from the recent Kafka Summit NYC.

Videos for Spark Summit have also been posted.

Insight's Data Science, Data Engineering, Health Data, and AI programs are accepting applications through Monday, June 26th for the fall program. Insight's programs are free and over 900 alumni have already completed the 7-week program.

IBM and Hortonworks announced a partnership in which IBM will distributed HDP and HDF rather than IBM BigInsights. In turn, Hortonworks is including some items from the IBM distro including BigSQL.


Amazon EMR 5.6.0 includes new versions of Apache Spark, Apache HBase, Apache Flink, and Apache Mahout. The release also supports in-transit encryption for Apache Presto.

Version 4.2 of the Cask Data Application Platform has been released. The new new version supports interactive Spark queries, event-driven workflows, change data capture for SQL Server and Oracle, several Azure services, and Amazon EMR.

Apache HBase 1.2.6 was released with several critical fixes.

Apache NiFi 1.3.0 and 0.7.4 were released to address two vulnerabilities—a cross frame scripting and a cross site scripting. The 1.3.0 release also includes major improvements and features related to clustering, UI/UX, and the core framework.

Hortonworks DataFlow 3.0 was released with a new Streaming Analytics Manager and Schema Registry. The post has screenshots of the Streaming Analytics Manager, which enables building of streaming applications without writing code.

Version of the StreamSets Data Collector includes a new automatic Data Drift Synchronization feature (which automatically updates schemas in Hive), improved speed of ingesting TCP payloads, support for Spark 2.0, and more.

The 3.0 release of BlueData EPIC has a number of new features. These include, a new gateway service, cluster monitoring with Elasticsearch and Kibana, performance optimizations, improved deployment options and templates, and data science docker containers with support for R, Jupyter, etc. The release also includes a Kerberos Passthrough service, which enables secure hadoop features when compute and storage are separated across several clusters.

Apache Zeppelin 0.7.2 resolves over 40 issues, including a number of bug fixes and improvements (mostly with the Livy Spark job server integration).

Apache Kudu 1.4.0 was released. There are a number of new features (such as a new file system check util), many optimizations and improvements (including updaes to web interface and write performance).

Apache Impala 2.9.0-incubating was released. The release includes a number of documentation updates and many other (over 350 resolved issues!) fixes and improvements.


Curated by Datadog ( )



LVTech Meetup: Hadoop & Impala (Bethlehem) - Tuesday, June 20

New Jersey

Latest HDF Innovation: Schema Registry and More (Princeton) - Tuesday, June 20

New York

MapR @ the New York Big Data Meetup (New York) - Tuesday, June 20


Analytics with Azure Machine Learning Studio + Hadoop For Everyone (Medellin) - Thursday, June 22


DIY Data Analytics with Apache Spark (London) - Thursday, June 22

Drivetribe’s Kappa Architecture with Apache Flink (London) - Thursday, June 22

Big Data, London v 9.0 (London) - Thursday, June 22


Big Data & Stream Processing with Apache Spark (Dusseldorf) - Tuesday, June 20


20th Swiss Big Data User Group Meeting (Zurich) - Monday, June 19


Data Science Workbench with Cloudera Experts (Herzliya) - Monday, June 19


First Johannesburg Meetup (Sandton) - Thursday, June 22

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit