Data Eng Weekly

Hadoop Weekly Issue #191

23 October 2016

This week's issue is short and sweet with a few technical posts, two interesting news articles, and several exciting releases (including Apache Kafka With Spark Summit Europe this week, expect lots of great content in the next issue. And if you're attending, please send interesting slides/talks my way!


Cloudera's CDH supports intra-node disk balancing since version 5.8.2 (it's also part of the 3.0.0 alpha Apache release). Using this feature, a data node can rebalance data blocks across disks using the hdfs diskbalancer command. This post describes how the tool works and shows how to run it.

This post demonstrates the capabilities of the library by building a logistic regression model to predict malignancy of cases from the Wisconsin Diagnostic Breast Cancer data set. The example code covers parsing, exploring a dataset with built-in statistics, extracting features from the input dataset, training the model, and evaluating the model.

The Amazon Big Data blog has a tutorial for running RStudio with sparklyr on EMR. Thanks to a bootstrap action, a cluster complete with RStudio running on the master, can be launched with a single command.

The Databricks blog features a list of seven tips for debugging Apache Spark code on Databricks. Most of the suggestions, like "Scale up Spark jobs slowly for really large datasets" and "Examine the partitioning for your dataset," are generally applicable to all Spark users.


InfoQ has an interview with Yahoo VP of Engineering, Peter Cnudde. Topics covered include Hadoop, Spark adoption at Yahoo (mostly for in-memory computing, not for ETL), and Caffe-on-Spark for deep learning.

ZDNet contributor Tony Baer has read between the lines when it comes to recent benchmarks by Cloudera and Hortonworks. The takeaways are as follows: 1) "SQL's the gateway drug to Hadoop." 2) Cloudera is trying to challenge Amazon (in this case Redshift), and 3) Hortonworks (via Hive's Live Long and Prosper) has caught up on the investment Cloudera made in Impala.


Apache Kafka was released this week. It contains improvements from over 500 pull requests and the implementation of 15 Kafka Improvement Proposals. The Confluent blog has the highlights of additions/improvements to Kafka Server (time-based indexes, replication quotas, and improved log compaction), improvements to Kafka client APIs (interactive queries for Kafak Streams, improved memory management, secure quotas, and more), and bug fixes.

Apache Fluo (incubating), recently had its first release since entering the incubator. Fluo is a tool for making "incremental updates to large data sets stored in Apache Accumulo" a la Google's Perculator.

Apache Flume 1.7.0 was released. It adds support for a taildir source and includes a number of improvements and bug fixes. Many of these are around Flume's integration with Apache Kafka.

Apache NiFi 0.7.1 was released as a follow-up to July's 0.7.0 release (version 1.0.0 was also recently released—in August). This release adds a number of improvements and bug fixes.

Apache Giraph 1.2.0 was released. Highlight's of the release include a new blocks API, support for graphs that don't fit in memory, and the addition of a new set of default configuration options based on Facebook's experience with Giraph.

deeplearning4j is a deep learning implementation that integrates with Hadoop and Spark and supports GPUs. Version 0.6.0 was recently released.


Curated by Datadog ( )



Uber Engineering Tech Talk Series (San Francisco) - Monday, October 24

Real-Time Streaming and Exactly-Once Semantics with Kafka (San Francisco) - Tuesday, October 25

Building Your First Spark & C* App + SMACK Stack + The Cassandra Odyssey (San Francisco) - Wednesday, October 26

Apache YARN Committers/Contribut­ors Meetup #4 (Sunnyvale) - Thursday, October 27


Kafka Palooza: LinkedIn, Microsoft Azure, MapR (Bellevue) - Monday, October 24


PixieDust: Making Python Visualizations Easier for Jupyter Notebooks with Spark (Las Vegas) - Monday, October 24


O&G Big Data Use Cases, by Hortonworks (Houston) - Thursday, October 27


Using Data Quality to Support Analytics in Hadoop (Overland Park) - Tuesday, October 25


Using Data Quality to Support Analytics in Hadoop (Kansas City) - Tuesday, October 25


Big Data Streaming Platform Ecosystem (Chicago) - Tuesday, October 25

Apache Spark 101 (Chicago) - Tuesday, October 25


October Edition of MOHUG (Dublin) - Tuesday, October 25


Apache Spark (Miami) - Wednesday, October 26

New York

Lambda-in-a-Box: Merging Apache Spark & HBase into an Open-Source Database (New York) - Thursday, October 27

October Data Engineering Meetup (New York) - Thursday, October 27


Toronto Apache Spark #14 (Toronto) - Wednesday, October 26

Introduction to MapR (Toronto) - Thursday, October 27


Why SMACK for Fast Data (London) - Monday, October 24

Building Scalable Systems in a Changing Data Landscape (London) - Tuesday, October 25

Spark Structured Streaming in Practice (London) - Wednesday, October 26


Season Premiere with Reynold Xin, Co-Founder & Chief Architect at Databricks (Barcelona) - Thursday, October 27

Introduction to Kafka (Malaga) - Friday, October 28


Spark Pre-Summit Meetup (Brussels) - Tuesday, October 25

Meeting on Streamsets, Datameer and Kudu (Kontich) - Tuesday, October 25

Spark & Machine Learning Meetup (Brussels) - Thursday, October 27


Introduction to Spark & Use Cases (Hyderabad) - Monday, October 24


Rethink SQL for Big Data with Apache Drill (Barton) - Tuesday, October 25

Spark Meetup October (Sydney) - Wednesday, October 26

Rethink SQL for Big Data with Apache Drill (Melbourne) - Thursday, October 27


Big Data: Spark and TensorFlow (Tallinn) - Monday, October 24