Data Eng Weekly

Hadoop Weekly Issue #102

04 January 2015

The first issue of 2015 starts off the year relatively quiet, although there are a number good technical and industry news articles. Technical articles cover the Kite SDK, Kafka, HBase, Hive, and Sqoop; news articles include a number of year-end synopses. Based on these, it looks like 2015 will be an interesting year for the Hadoop ecosystem.


This post describes how to use the Kite SDK’s command line tools to ingest data into HDFS. Using the kite-dataset command, the author sets up the schema for datasets, creates tables in Hive, and loads the datasets. It also describes how to use the tools to build parquet files from csv.

Kafka has a built-in tool called MirrorMaker for replicating data between Kafka clusters. This post looks at an alternative implementation built with golang and supporting similar functionality. The tool, Go Kafka Mirror Maker, also supports a few additional features like adding a topic prefix to avoid collisions (and doesn’t run on the JVM, so it has less overhead)

This post provides a thorough overview of HBase Coprocessors, which (as the article describes) are analogous to triggers and stored procedures, MapReduce, and aspect oriented programming. The post describes the main interfaces and used to implement two categories of coprocessors, complete with a walkthrough of a coprocessor implementation. It also details the steps needed to deploy a co-processor.

Recent releases of Apache Hive have a number of new optimizations, which must be enabled (possibly by reprocessing data). This post describes several of those features (Tez, cost-based optimization, vectorization) and provides instructions for enabling them.

This two-part blog series looks at several SQL-on-Hadoop engines. The first post looks at various storage backends (HBase, HDFS, etc) and file formats, while the second describes several major systems (Hive, Impala, Presto, Drill, and Spark SQL). There’s also a discussion of if these systems are appropriate for for OLTP and OLAP.

Security has been a hot topic for the Hadoop ecosystem recently, and most systems are adopting or improving their enterprise security features. This post describes a new security feature in Sqoop2: support for Kerberos. It walks through the steps necessary to enable security as part of the Sqoop2 server.


This post presents a bullish view on the future of big data. The author argues that big data provides a mechanism to measure and analyze systems that never existed before. Referencing success stories from the energy and agriculture sectors, the post describes some of the new capabilities that big data offers.

This year-end retrospective piece looks at some of the advances in machine learning tools for big data in 2014 (Spark 1.0, H20, GraphLab) and makes some predictions for the industry in 2015.

Based on data from the recent Hortonworks IPO and reports from Wikibon and Forrester, this post speculates on the future of the Hadoop industry. The article points out some bearish observations, such as the sparsity of successful companies based on open-source and the fact that Google has moved away from MapReduce.

We’ve seen quite a few end-of-year posts, and Qubole’s list of Hadoop Happenings this week includes several articles that touch on this theme.


MemSQL has open-sourced a new tool for loading data from S3 and HDFS to MemSQL and MySQL. The system is somewhat similar to Sqoop, but provides additional features like deduplication and failure handling.


Curated by Mortar Data ( )



Docker: New Approaches to Software Development (San Ramon) - Tuesday, January 6

A Real-Time Streaming Implementation of Markov Chain–Based Fraud Detection (Newport Beach) - Thursday, January 8

Hadoop Framework and Tools, Spark Intro, plus Data Science with R and Python (Fremont) - Thursday, January 8

How We Used Storm Wrong, Then Right, at AdRoll (Emeryville) - Thursday, January 8


Hadoop Past, Present and Future, plus NoSQL Data Modeling and Couchbase Mobile (Las Vegas) - Wednesday, January 7

New York

BDW Meetup: Spark SQL (New York) - Wednesday, January 7


RBelgium #9, Including “RHadoop, Using R in Hadoop” (Brussels) - Friday, January 9


Spark Meetup (Shanghai) - Saturday, January 10


Streaming and Apache Spark (Bangalore) - Saturday, January 10