Data Eng Weekly

Hadoop Weekly Issue #178

10 July 2016

This week's issue covers quite the medley of big data topics: Apache Kafka, tips for migration from Apache Pig to Apache Spark, SparkR, a tour of ten Apache stream processing projects, Apache Kudu (incubating), and more. There are also several releases this week, including new open source tools from Yahoo (distributed optimization problem solver for Spark) and LinkedIn (tools for Apache Kafka). And if you missed the recent Hadoop Summit, videos of presentations have been published to youtube.


Confluent has an overview of their Control Center project, which provides a web-based Kafka Connect configuration and a Kafka monitoring tool to ensure that data from producers reaches all consumers. The Control Center is a paid add-on for Apache Kafka, and it is included as part of Confluent Platform Enterprise.

This post introduces Apache Spark to Apache Pig developers. It describes some of the trade-offs between the two, and it has line-by-line examples written in both Pig and PySpark. The examples demonstrate some advanced topics like full outer joins, SparkSQL (imposing a schema on a RDD), integration with the Hive metastore, user defined functions, map-side join, and window functions. Even if you're not a Pig developer, this is a great introduction to PySpark.

The Confluent Log Compaction post has highlights of in-progress improvements to Apache Kafka and links to several recent Kafka-related presentations/blog posts. New features targeted for the next release include a new time-based index and improved timeout handling. Work on adding new security features and improving Kafka Streams is also underway.

Amazon S3 popularized object storage. By offering a subset of features of a normal file system, it has enormous scale and provides very high availability/durability. This post looks at why object storage is blossoming outside of AWS, and highlights three popular open-source systems.

WePay has written about their analytics pipeline built on Google BigQuery and Apache Airflow (incubating). Airflow powers ETL from MySQL to BigQuery as well as production analytics.

The Databricks blog has a recap of a recent tutorial on SparkR. The tutorial covered data exploration and advanced analytics. Both parts of the tutorial are available as notebooks with inline descriptions of the various calculations performed.

Amazon has written about how they implement personalized recommendations using Spark and the Deep Scalable Sparse Tensor Neural Engine (DSSTNE). Spark is the driver, which runs within Amazon EMR and synchronizes data to S3. From S3, the data is processed by DSSTNE on GPU nodes managed by Amazon Elastic Container Service and Auto Scaling Group. DSSTNE is open-source, and there's a walkthrough of using a similar setup with the MovieLens dataset via an Apache Zeppelin notebook.

Salesforce recently open sourced Runway, which is a tool for modeling and simulating distributed systems. This code introduces a runway model for the popular Apache BookKeeper project.

Whether you're relatively new to the big data ecosystem or not, it can be really difficult to keep track of all the relevant Apache projects. When it comes to scalable stream processing, there are ten different ASF projects. The New Stack has an overview and sample use case for each (Flume, Flink, Beam, Apex, Ignite, Kafka Streams, Nifi, Samza, Spark, and Storm).

AgilData has a post that enumerates several of Apache Kudu's (incubating) differentiators—from being based on the Raft distributed consensus algorithm to baked in optimizations for SSDs to first-class support for SQL. The post also argues that although Kudu is in beta, it's ready now for production use cases.


Videos of presentations from the recent Hadoop Summit have been posted on Youtube. There are over 150 presentations and keynotes.

The Syncsort blog has two-part interview with Holden Karau, Spark committer and author of O'Reilly books. The discussion covers data hubs, data formats, mainframes, Spark 2.0, Machine Learning, and more.

Talend, the big data integration company, has filed for an IPO to raise $86.25 million.

A few years ago, it was a generally accepted that Hadoop ran best on bare metal in the data center. Advances in virtualization, improvements to big data software for cloud storage, and cost reductions have recently made the cloud much more practical. Qubole has an overview of some of the reasons why you might want to build your big data system in the cloud.


The MongoDB Connector for Apache Spark is now generally available. The connector has many advanced features, and the announcement includes a tutorial illustrating several of these.

Version 2.4.0 of Spring for Apache Hadoop was released. This release adds support for Hortonworks HDP 2.4, for YARN resource labels, and more.

Yahoo has open sourced SparkADMM, a system for solving optimization problems using the Alternating Directions Method of Multipliers.

Cloudera released version 2.1 of Cloudera Director, their tool for running Hadoop clusters in cloud environments. This new version adds support for Microsoft Azure, cross-region and cross-cloud deployments, and usage-based billing.

Oracle released version of Oracle GoldenGate for Big Data. GoldenGate is a tool for mirroring an Oracle database to Apache kafka.

LinkedIn has open sourced their kafka-asssigner tool for managing cluster partitions (removing brokers, rebalancing partitions).

Version 2.2.0 of Luigi, the open-source workflow tool for Hadoop and other data systems. This release has many bug fixes, improvements to integration with AWS, Salesforce, and FTP, support for MSSQL, the ability to print the dependency tree as ascii art, and much more.


Curated by Datadog ( )



Dr. Elephant: Pills for Your Problematic Hadoop Jobs (San Francisco) - Tuesday, July 12

Building/Runn­ing Netflix's Data Pipeline Using Apache Kafka (Redwood City) - Wednesday, July 13

Building a Machine Learning Pipeline Using Aerosolve (San Francisco) - Wednesday, July 13

Expert Panel on Streaming Analytics Technologies (San Francisco) - Thursday, July 14

Recent Developments in SparkR for Advanced Analytics (Sunnyvale) - Friday, July 15


Apache Kafka at Ebay and Salesforce (Bellevue) - Tuesday, July 12


Kafka-Streams Talk! (Austin) - Thursday, July 14


Data Gymnastics: Using HPCC Systems for Processing Big Data (Alpharetta) - Tuesday, July 12

New Jersey

Apache NiFi (Princeton) - Thursday, July 14

New York

Streaming and Akka Persistence + Cassandra Availability Management (New York) - Tuesday, July 12

Robust Stream Processing with Apache Flink (New York) - Wednesday, July 13


Introduction to Spark (Montreal) - Wednesday, July 13

IRELAND Apache Flink... Don’t Cross the Streams! Modern Data Science Workflows (Dublin) - Monday, July 11


Join Us to Learn about Apache Ignite (London) - Wednesday, July 13


Data-Intensive Recommenders and Machine Learning Applications in Spark & Flink (Milan) - Wednesday, July 13


Understanding and Building Big Data Architectures, Part 3: Messaging/Kafka (Hyderabad) - Saturday, July 16


Apache Spark Meetup (Wellington) - Wednesday, July 13