Data Eng Weekly

Hadoop Weekly Issue #113

22 March 2015

This issue has more variety than we've seen in recent months. There are great technical articles covering everything from tuning AWS for Hadoop to Apache Flink to Hadoop with Python to Apache Tajo. In news, Tachyon Nexus announced a series A round. And in releases, two exciting new projects provide the ability to run HDFS on Mesos and to stream MySQL replication events to Kafka.


The Confluent blog has a post that provides suggestions for choosing the number of partitions in a Kafka topic. While more partitions will help improve throughput, increasing the number will result in more open file handles, (potentially) longer unavailability in certain circumstances, higher end-to-end latency, and additional memory requirements in clients. The post describes each of these trade-offs in-depth.

This presentation provides an up-to-date overview of the state of Hadoop with Python. It looks at several open-source frameworks, including mrjob and Pydoop for MapReduce jobs, snakebite for interacting with HDFS, and the python APIs included with Spark and Pig.

The Qubole blog has a post looking at the effects of different types and features of virtualization on the Amazon Web Services cloud. The post is worth reading in its entirety, but key takeaways are that switching from PV to HVM instances and enabling enhanced networking is a major win. They didn't see huge enhancements with placement groups. As always, its worth validating these results with your own application.

This is a good read about how one distributed data processing framework solves a lot of distributed system problems. Focussing on equi-joins, the post describes the high-level Flink API, join strategies, memory management, join optimization, and performance.

This post on the Cloudera blog describes the Spark-Kafka integration in the recent 1.3 release of Spark. Topics include creating RDDs for batch jobs, RRDs for streaming, and an overview of strategies for building at least once/at most once/exactly once delivery of results. The exactly-once section describes two strategies—idempotent writes based on unique keys and transactional writes.

A new paper on Spark analyzed performance on the BDBench and TPC-DS benchmarks and found some surprising results. Specifically, they found that CPU is often the limiting factor and not disk or network I/O. It's a big paper with a lot of interesting findings and suggestions for improvement.

The Hortonworks blog has a post on several new features that have been added to the Hadoop ecosystem in order to support rolling upgrades. It discusses some operational items like software packaging and configuration as well as the changes in core HDFS, YARN, Hive, and more. There are also instructions for the order in which to upgrade services as part of a full upgrade.

This post looks at how to package jars into an uber-jar, package a third-party library that isn't available via maven central, and use a jar with the Spark shell.

This post starts out with a story that's all too familiar for many people working with Hadoop—you have a seemingly simple query, but you spend a lot of time finding the right data to query. One solution to this problem is to keep every dataset in Hive and to use comments to describe the dataset. Then, Apache Falcon provides a nice interface to view and search datasets in Hive (in addition to several other features, which the article describes).

Hortonworks has a recap of talks at the recent Apache Slider meetup. There was a talk on running dockerized applications on YARN and another on KOYA (Kafka on YARN). The post also has links to the presenter slides.

This post describes how to convert data from Avro to Parquet. The instructions utilize a simple tool which runs a map-only job to do the conversion.

While MongoDB has a built-in MapReduce framework, there are often advantages to processing data outside of Mongo. To that end, this post gives an introduction on how to integration MongoDB with Spark using the Hadoop input format for Mongo.

The LinkedIn Site Reliability team has pulled back the curtain to reveal a lot about how LinkedIn uses Apache Kafka. Topics covered include scale (175 terabytes/day), the types of applications (queueing, logging, metrics, and more), their multi-datacenter setup, and integration into the application stack.

Apache Tajo version 0.10 was released last week, and this tutorial provides all the instructions needed to get started with Tajo on an Amazon Elastic MapReduce cluster. After specifying a Tajo bootstrap action for the cluster, data is stored in HDFS. If you want to integrate directly with S3, the post describes the additional configuration required to do so.

MapR has posted a new whiteboard walkthrough, which compares and contrasts Hadoop with NoSQL systems. In addition to a short video, the transcript of the presentation is available on the MapR blog. It covers the the strengths of Hadoop vs. NoSQL and when each one is appropriate.


Tachyon Nexus is a new company from the folks at UC Berkeley's AMPLab behind Tachyon, the memory-centric distributed storage system. This week, they announced a Series A round of $7.5 million, led by Andreessen Horowitz.

InfoWorld has an article that recounts some of the themes of Cloudera's analyst day, which took place earlier this week. These include Cloudera's goal of being "the big data company," revenue (and how it relates to customers using Cloudera's free software), and competition with Hortonworks and the Open Data Platform.

A post on the Enterprise Software Musing blog also reports on Cloudera's analyst day. This post is more focussed on the specifics of Cloudera's business—they scale of their traction (adding on average two new employees and two new partners each day), their plans to expand into new verticals like financial services and telcos, and the importance of partners.


MapR has announced support in their distribution for new versions of Hue, Oozie, Spark and Pig. The announcement has more details on the features.

Databricks Cloud added the ability to schedule periodic execution of Spark applications and Databricks notebooks. The new feature, called Jobs, is aimed at running production workloads.

Hortonworks has announced general availability of the Hortonworks sandbox on Microsoft Azure. A post on their blog has a walkthrough of how to get started.

Hadoop-as-a-Service vendor Qubole has announced a new connector between their platform and Amazon Redshift. The integration provides the ability to save the output of queries run in Spark and Hive to a table in Redshift.

Mesosphere has announced a new open-source project to run HDFS on Mesos. When running HDFS via the system, all Datanodes, NameNodes, and Quorum JournalNodes are launched automatically. Enabling "Super High Availability" allows the system to automatically re-provision NameNodes.

The sqlstream project provides an integration between MySQL replication and Apache Kafka. Replication events are translated to JSON and sent to a kafka topic. The Readme shows some examples of the types of events one can expect.


Curated by Datadog ( )



Putting Apache Kafka to Use: Building Real-Time Data Platform for Event Streams (San Jose) - Monday, March 23

Getting Started with Spark & Cassandra (Santa Monica) - Tuesday, March 24

SQL in Hadoop with Actian Vortex (San Jose) - Tuesday, March 24

Update on Complex Types, Contributing to Impala, and an atScale Demo (Palo Alto) - Tuesday, March 24

Spark Data Sources: Overview of API & HBase Data Source from Huawei (Santa Clara) - Wednesday, March 25

Taking the "Oops" out of Hadoop (Santa Clara) - Thursday, March 26


Moneyball & Spark (Seattle) - Wednesday, March 25


Learn about Apache Mesos (Houston) - Wednesday, March 25


What's the Open Data Platform? (Overland Park) - Thursday, March 26


St. Louis Hadoop Users Group Meetup (Saint Louis) - Tuesday, March 24


Introduction to Apache Kafka (Alpharetta) - Thursday, March 26


Apache Drill (St Petersburg) - Wednesday, March 25


"The Future of Big Data" with M.C. Srivas, CTO and Co-founder of MapR (Pittsburgh) - Tuesday, March 24


What is Lambda? And Securing Hadoop (Boston) - Tuesday, March 24

Apache Ignite: Introducing the Future of Fast Data (Cambridge) - Wednesday, March 25


Apache Kafka, from High Level to Deep Dive (Ottawa) - Wednesday, March 25

Distributed Scala: Easy Scalability with Akka and Spark in Action (Toronto) - Wednesday, March 25

Apache Ignite: Introducing the Future of Fast Data (Toronto) - Thursday, March 26


Introducing Myriad, a Mesos Framework for Dynamically Scaling Hadoop Workloads (London) - Wednesday, March 25


Large-scale Machine Learning with Spark and Flink (Stockholm) - Monday, March 23

Apache Spark Show and Tell (Stockholm) - Thursday, March 26


Anatomy of the RDD: A Deeper Dive into RDD Abstraction (Bangalore) - Saturday, March 28

Baby Hadoop Meet-Up (Bangalore) - Saturday, March 28


Spark Meetup 3 (Hangzhou) - Sunday, March 29