Data Eng Weekly

Hadoop Weekly Issue #162

20 March 2016

Technical news about the core Hadoop ecosystem was rather slow this week, but there are interesting big data articles from Netflix, Facebook, and AWS. In addition, there were a lot of releases this week—Drill, Kylin, Mahout, Hama, and Ibis. Finally, Altiscale announced a new product, and rdkafka-dotnet is a new Kafka library for the .NET ecosystem.


The Netflix blog has a post about Mantis, their stream-processing platform for realtime analytics. The post talks about the system architecture (built on Mesos, RxJava, and RxNetty), autoscaling (the number of ec2 instances cycles up and down throughout a given day), use-cases (such as anomaly detection to alert engineering teams), and future plans. The comments also have a good overview of how Mantis compares to a system like Spark Streaming.

This article demonstrates how to use the recently released GraphFrames API for Spark to do graph analysis of flight datasets. In addition to typical summary statistics, the post shows how to use motifs (patterns), run a PageRank calculation, use bread-first search to find flight connections, and more.

The AWS Big Data blog has a tutorial showing how to build a server-less stream processing system using several AWS services (Lambda, Kinesis, DynamoDB streams). The post describes all the system setup and IAM permission configs needed to get going, and it includes example Lambda functions for updating event counters in DynamoDB and deriving metrics from what's stored in DynamoDB for publishing to CloudWatch.

This post aims to define a framework for empowering data scientists and data engineers to focus one what they do best. Rather than the labor categories inherited from the BI departments of the past, data scientists should focus on building end-end products and data engineers focus on abstraction. Having worked in organizations with both strategies, I agree that the latter has a lot of pros.

Apache Beam (incubating)—the framework previously known as the Dataflow SDK—has put together a capability matrix for the various backends (Google Cloud Dataflow, Apache Flink, Apache Spark). The matrix covers the What (types of operators), Where (windows), When (triggers), and How parts of the model. In addition to showing what's currently supported, the matrix also shows what is and isn't possible in the future.

Twitter has published a new post about Manhattan, their distributed key-value store. The topic of this post is consistency, which is implemented using their DistributedLog service (which shares similarities to Kafka). For those interested in distributed systems, this describes some of the practical challenges of implementing one at scale.

The MapR blog describes four different ways to manage passwords for Sqoop—protected files, storing passwords in a db, storing passwords in a db + expect scripting, and the Hadoop CredentialProvider API. The last option is the cleanest, and Sqoop added support for it in version 1.4.5.

Facebook has one of the largest and richest datasets in the world, and it faces data challenges not seen in many other companies. One such challenge is how to index and query a giant distributed graph. This post gives a preview overview of their distributed graph query engine, Dragon, which is powered by systems like RocksDB and MySQL.


The Call for Papers for Cassandra Summit 2016, which takes place in San Jose in September, is now open. The CFP closes on June 1st.

Altiscale has announced a new product, Altiscale Insight Cloud, which is a hosted service for dynamic visualizations, dashboards, and more. The new service adds to the Altiscale Big Data-as-a-service platform to form a complete out-of-the-box solution.

"Hadoop: What You Need to Know" is a new free eBook aimed at Enterprise leadership. The report introduces the basics of Hadoop and why it's important. Access is restricted behind an e-mail wall.

Teams from Pivotal and the University of Wisconsin have submitted a proposal for Quickstep, a modern high-performance database engine, to the Apache Incubator. The project started in 2011 and is implemented in C++. The proposal highlights potential alignments with Hive, HAWQ (incubating), YARN, and Mesos.


Apache Mahout 0.11.2 was released just over a week ago. The new release adds Spark 1.5.2 support, several performance improvements, and a handful of bug fixes.

Apache Hamma, the Bulk Synchronous Parallel computing framework, released version 0.7.1 this week. The new release fixes two high profile bugs for Python and YARN, and it adds a new task scheduling mechanism.

Version 1.3.0 of Apache Kylin, the OLAP analytics engine with Hive integration, was released this week. The new version includes Hive Beeline support, precise distinct counts, and many improvements.

Ibis, the Python analytics framework with support for the Hadoop ecosystem, has released version 0.7.0. This release adds Impala-Kudu integration, bug fixes, and a smarter SQL compiler. If you're interacting with the Hadoop ecosystem from Python, be sure to check it out.

The Apache Drill project has one of the fastest release velocities in the Hadoop ecosystem with a new release nearly every month. This week saw the 1.6.0 release, which adds support for Java 8, inbound impersonation, and additional custom window frames.

rdkafka-dotnet is an Apache Kafka client for C#/F#. Built on librdkafka, it has a lot of the functionality seen in other clients.