Data Eng Weekly

Hadoop Weekly Issue #210

26 March 2017

Lots and lots of open-source releases this week—Apache NiFi, Apache Knox, Apache Kudu, Apache Flink, and more (including a new open-source timeseries database). There are also some great technical posts on HDFS erasure encoding, Apache Phoenix, and Amazon Athena/Presto.


Sendence has written about Wallaroo, their distributed event processing framework. The team plans to open-source soon, but in the meantime this post describes what it is, the core abstractions, key features (like exactly-once processing), and future plans. Impressively, Wallaroo has median processing latencies in the microseconds and 99.99% around 1ms (their example use case is for a trading system). Currently, APIs are in C++ and Pony but support is planned for other languages too.

Hortonworks has the fourth part in their "Data Lake 3.0" series. This part describes the evolution of HDFS storage—specifically the heterogenous storage system introduced in Hadoop 2.3 and the erasure coding implementation that is underway now. The post has a good description of how erasure coding is implemented, and it describes the main practical challenges (like small files and write-throughput overhead).

This post provides a brief introduction to (with examples using the Spark shell) of connecting Apache Spark to Solr.

The team at Sky Gaming and Betting has written about how they use the Confluent Schema Registry with Apache Avro and Apache Kafka to enable decentralized implementations across squads within the organization. They are using Node.js, so there's also an overview of the state of the schema registry for a Node.js client.

The IBM Hadoop Dev blog has a look at how they've integrated Jupyter notebooks with the IBM Open Platform using Apache Knox for authentication.

The Apache Software Foundation blog has a post on the new Column Mapping and Immutable Data Encoding features of Apache Phoenix 4.10 (more below). In short, the column mapping switches Phoenix to use integers rather than strings for column names, which has a number of advantages (including both significant speedups and space savings of around 40% on a TPC-H benchmark).

Amazon has posted performance tips for Amazon Athena (since Athena uses Presto, many of the tips are applicable outside of Athena, too). There are five tips for storing data (covering partitioning and file formats) and five tips on querying data (e.g. avoiding order by without limit and projecting columns early).


Since hearing that the Strata + Hadoop World conference is being renamed Strata Data Conference, I've been curious to hear more about what the feeling was there. Datanami has some detail with a look at the "shift to real-time," the challenges due to the complexity of Hadoop, and the (perceived?) momentum due to all the companies built around Hadoop.

The DBMS2 blog has a great look at the recently announced Cloudera Data Science Workbench. It adds some new details, like the fact that it's Docker-based to allow teams to install whatever software they need and that it's been beta tested by a number of big companies.


The Apache Tephra (incubating) transaction engine for Apache HBase and other distributed data stores has released version 0.11.0-incubating. The release includes a few improvements and bug fixes.

Apache NiFi has released version 1.2.0 of the NiFi Archive bundle plugin, which can be used for class loader isolation in NiFi.

Version 0.12.0 of Apache Knox was released. There are a number of improvements and new features in the release, including improved proxy support, a YARN HA implementation of the REST API and UI, and pluggable pre-auth header provider support.

Amazon EMR has added the ability to specify "instance fleets" of up to five instance types, on which to build a cluster of mixed on-demand and spot instances.

Apache Kudu 1.3.0 was released with a bunch of new featuers—Kerberos authentication, encryption in transit using TLS, coarse-grained authorization, background tasks to clean up old data, and a new crash reporter. There are also several optimizations (such as a switch to LZ4 compression) as part of the release.

The 1.1.5 release of Apache Flink includes fixes for high availability, fault tolerance, and Kyro serialization (among a dozen or so bug fixes).

Apache Gora, which provides an in-memory data model for several different big data frameworks (including Avro, HBase, MongoDB, Spark, and more), has released version 0.7. The release includes over 80 issue resolutions.

Apache Phoenix 4.10 was released. This version of the SQL-on-HBase engine adds improved disk storage footprint (see separate post above), Apache Spark 2.0 integration, support for consuming data out of Apache Kafka, improved Hive integration, and more.

TimeScaleDB is a new, open-source time series database that's built with the Postgres engine. It's currently available in a single-node version, and there's an interesting whitepaper describing its design.


Last week, I misstated that the Microsoft announcements were made at Hadoop Summit. These were actually made at the Strata + Hadoop World conference.


Curated by Datadog ( )



Big Data App Meetup (Palo Alto) - Wednesday, March 29

Data Science Monthly Talk: Apache Kafka (Sunnyvale) - Thursday, March 30

Stream Processing With Apache Kafka and .NET (Mountain View) - Thursday, March 30

Robust Stream Processing with Apache Flink, with Jamie Grier (San Francisco) - Friday, March 31


Fast Data With Open Source Solutions (Laurel) - Tuesday, March 28


Big Data All the Things (Philadelphia) - Tuesday, March 28

New Jersey

Apache NiFi: Ingesting Enterprise Data at Scale (Princeton) - Tuesday, March 28

Security Analytics: Securonix + Cloudera + Spark + Solr (Princeton) - Tuesday, March 28


Spark in the NHS and Cloud Object Stores (London) - Thursday, March 30


How to Monitor and Optimize Spark Processes (Madrid) - Tuesday, March 28

Fast Analytics on Fast Data With Apache Kudu (Madrid) - Thursday, March 30


PyData Munich March Meetup (Munich) - Tuesday, March 28


Hadoop Stories with Owen O'Malley (Budapest) - Thursday, March 30


Cloud Data Analytics: Trends, Technologies, Challenges, and Opportunities (Tel Aviv-Yafo) - Tuesday, March 28


Real-Time Analytics with Spark Streaming by Padma Chitturi (Bangalore) - Saturday, April 1


Insights From Recent Strata + Hadoop World Conferences (Auckland) - Tuesday, March 28