Data Eng Weekly

Hadoop Weekly Issue #171

22 May 2016

There were quite a few releases this week, including a new open-source project from LinkedIn. On the technical and news front, there are several articles recapping Apache: Big Data North America, and there's an excellent series of posts about analyzing NYC Taxi data across several different data systems.


The Databricks blog has a post about two approximation algorithms that are available in Apache Spark. They are approxCountDistict, which estimates the number of distinct values, and approxQuantile, which generates approximate percentiles. The post describes the algorithms and visualizes the accuracy for varying residuals.

This tutorial describes how to use Apache Hadoop HDFS, Apache Solr, and Hue to store, index, and search for medical images stored in the DICOM format. The post includes a walkthrough of the steps needed to load and fetch the data.

MapR Streams is a system that is API compatible with Apache Kafka. This post describes, at a high-level, the similarities and differences between MapR Streams and Kafka. There's also a clarification of how Kafka Streams relates to MapR Streams.

This post is one of the clearest explanations of Paxos, the consensus protocol for distributed systems, that I've seen. The article includes examples of plotting computers and distributed auctions to help illustrate the protocol.

Based on a presentation at the recent Apache: Big Data North America, Datanami has a look at the new features in the upcoming Apache Hadoop 3 release. Among the highlights are the shell script rewrite, task-level native optimization, the capability to derive memory sizes automatically, and support for erasure codings in HDFS. The post looks closely at erasure codings which should improve storage efficiency (1.5x disk consumption rather than 3x).

This presentation from PyData Berlin describes a future in which Apache Arrow and the Feather file format are the main mechanism for interoperability for data across languages/frameworks.

Videos of two Apache Kafka-related talks from two separate conferences have been posted. The first describes the new security features in Kafka, and the second explores using Kafka to share data across systems.

This blog has a collection of posts about loading/querying the New York City taxi data via various data systems like Amazon Redshift, Google BigQuery, Postgres, and Presto. In addition to raw benchmarking, there are details about troubleshooting, optimizations, and comparing alternatives (such as S3 vs HDFS in AWS).

O'Reilly has an article describing how to implement the kappa architecture with Kafka, Flink, Elasticsearch, and Kibana. The post gives an overview of the lambda and kappa architectures, describes the major architecture components, and describes how to use the setup to detect novelties using Bayesian models.


This post about the recent Apache: Big Data North America conference enumerates many of the big data ecosystem projects that were covered at the conference. There are quite a few, including several that weren't yet on my radar.

The Pivotal blog has an interesting post on big data and agile development. Big data systems are often stuck in a non-agile world in which requirements are gathered and schemas are defined well before data is pulled in. The post argues that the constraints that necessitate this approach (limited capacity and performance, silo'd data, etc), are no long valid in a cloud-based environment.

Databricks has published a recording of their webinar "Apache Spark MLlib: From Quick Start to Scikit-Learn" for on-demand viewing. In addition to the webinar content, they've posted the answer to eight common questions from the session.

The Hortonworks blog has a post overviewing the history of Apache Storm. Open-sourced in 2011, Storm moved to the Apache incubator in 2013, became a top-level project in 2014, and hit its 1.0 release earlier this year. The article discusses the major technical advances in each of those milestones and more.

HBaseCon is this week in San Francisco. The conference includes keynotes from Apple, Yahoo, and Facebook.

MapR has an infographic celebrating the last year of Apache Drill. In that time, it's released 7 times and hit a number of impressive milestones.

Datanami has an article covering a Q&A at Apache: Big Data North America with ASF director Jim Jagielski and ODPi program director John Mertic. The main topic, as expected, was the relationship between the ASF and ODPi.


LinkedIn has open-sourced Ambry, their ObjectStore distribute system. The code for Ambry is on github, and the introductory blog post has a thorough overview of Ambry's targeted SLAs, design goals, architecture, and interfaces.

Pivotal HDB 2.0, which is powered by apache HAWQ (incubating) and provides an analytics database for Hadoop, was released this week.

Version 0.12.1 of Apache Mahout, the machine learning and data mining system, was released this week. The release addresses a number of issues with the Flink/Mahout integration.

Version 0.11.3 of Apache Tajo, the data warehouse for Hadoop, was released. The new release fixes 5 bugs.

MongoDB has announced a new MongoDB Connector for Apache Spark. Versus the Hadoop InputFormat shim for Spark, this connector has a number of features. In addition to the announcement, there's another post explaining some of the key features.

SyncSort has released DMX-h v9, which adds support for Kafka and a new Intelligent Execution framework.


Curated by Datadog ( )



Meetup on the Night before HBaseCon2016! (San Francisco) - Monday, May 23

Solr as a SparkSQL DataSource (San Francisco) - Monday, May 23

PhoenixCon (San Francisco) - Wednesday, May 25

Storm/Kafka Meetup: “Securing Kafka Clusters” (San Francisco) - Wednesday, May 25

The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO for Data, Microsoft (Mountain View) - Wednesday, May 25

Stream Your Operational Data w/ Apache Spark & Kafka into Hadoop Using Couchbase (Santa Monica) - Thursday, May 26


Apache Phoenix + More (Seattle) - Wednesday, May 25

Spark Streaming Primer and TUNE Case Study (Seattle) - Thursday, May 26


Spark Hands-on 1-Day Workshop for Data Engineers, Data Scientists and Developers (Coppell) - Tuesday, May 24

Cloudy to Clear: Big Data and Insights with Azure (Houston) - Tuesday, May 24

What Is All the Hype about Apache Spark (Coppell) - Tuesday, May 24

Cloudera User Group Meetup (Plano) - Wednesday, May 25


Apache Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data (Saint Paul) - Thursday, May 26


Flinking Even Faster with Iterations and Delta Iterations (Chicago) - Thursday, May 26

North Carolina

May CHUG: Cloudera on Kafka (Charlotte) - Wednesday, May 25


Spark Streaming and the Internet of Things (Arlington) - Tuesday, May 24

New Jersey

Interactive Real-Time Streaming with Spark 2.0: Structured Streaming (Princeton, NJ 08544) - Wednesday, May 25

New York

KNN with Apache Flink by the Implementor, Dan Blazevski (New York) - Tuesday, May 24


Special Spark Presentation Night (Somerville) - Tuesday, May 24


Toronto Apache Spark #9 (Toronto) - Wednesday, May 25


Python for Data Engineers and How to Blend the Database World with Apache Spark (London) - Tuesday, May 24


Apache Flink Meetup Berlin #14 (Berlin) - Tuesday, May 24


HBase and MySQL Ecosystem for Real-Time Views of Data (Prague) - Thursday, May 26


DataFrames and Spark SQL in Network Analytics (Budapest) - Wednesday, May 25


Big Data Meetup: Apache Storm‏, Backgammon AI Agents (Athens) - Tuesday, May 24


Hortonworks Data Platform: International Speakers (Dubai) - Monday, May 23


Understanding and Building Big Data Architectures, Part 3: Kafka (Hyderabad) - Saturday, May 28

Machine Learning Pipelines with Spark ML (Bangalore) - Saturday, May 28