Data Eng Weekly

Hadoop Weekly Issue #174

12 June 2016

Spark Summit was this week in San Francisco, and (as expect) this week's issue has a bunch of Apache Spark-related news, announcements, and releases. In addition to Spark coverage, this issue features articles on Kafka, Cask, Ambari, and more. Of note releases include the first release of Apache Pig in almost exactly a year, a neat new tool for distributed system design called Runway, and a new version of Apache Kudu (incubating).


Debezium is a relatively new project for capturing row-level changes to a database to an Apache Kafka topic. It currently supports MySQL, and there's a new tutorial for configuring Zookeeper, Kafka, MySQL, and more using Docker and Kubernetes.

Some folks were surprised when the Apache Kafka project announced yet another stream processing engine, Kafka Streams. Kafka Streams has some key differentiators vs. other systems, though. This post gives a good overview of these differentiators—such as the abstractions, deployment model, and support for state-based calculations.

Everyone that's ever worked with MapReduce, Spark, or similar systems has run into difficult-to-debug, data-specific bugs. BigDebug is a research project/paper out of UCLA that aims to give developers access to tools found in single-machine programs: input parameters causing a crash, tracing, breakpoints, watchpoints, latency alerts, and more. The tool is available for download for Apache Spark 1.2.1.

Cask has written about running Spark inside of the open-source Cask Data Application Platform (CDAP). Spark programs running in CDAP have access to Apache Tephra (incubating) for fine-grained transaction support. This makes it easy to do a consistent copy of data from one table to another by leveraging snapshot isolation. Spark in CDAP also has access to Cask Tracker, which provides data lineage information (when it was created, used, etc). Depending on your application, the CDAP tools might add a lot of value.

This tutorial from the IBM Hadoop Dev blog walks through using the Ambari REST APIs from cURL. There are examples of establishing a session both with vanilla and kerberos-enabled clusters, and reusing the session for future queries.

The Google Cloud Platform blog has a post about debugging an Apache Beam (incubating) job running on the Google Dataflow backend. To debug the bottleneck, Dataflow has some really useful statistics and UI that allow you to dig into each of the steps.


The Transaction Processing Performance Council (TCP) announced the TCPx-BB benchmark, which is designed for big data systems. In addition to SQL, the benchmark includes support for machine learning clustering and classification problems.

Strata + Hadoop World London was about 2 weeks ago. Videos of the keynotes and slides from many of the presentations have been posted on the conference website.

Splice Machine, makers of the RDBMS built on Hadoop, have announced that they're open sourcing their software. They're currently looking for contributors/mentors/champions to help in the open sourcing effort. Splice Machine has a number of interesting features, such as ACID transactions, secondary indexing, and referential integrity.

The Altiscale blog has compiled a number of big data use-case articles describing applications in sentiment analysis in customer service, climate change, smart cities, bias, and more. The collection also covers some articles by big data skeptics.

Spark Summit was last week in San Francisco. Conference organizer, Databricks, has posted a recap of the two day event with links to various presentations and keynotes.

Big Data as a Service company, Qubole, has written about their customer adoption of Spark. Adoption has been fast—over half of their customers now use Spark. Qubole also supports Presto, and they've seen similar growth in adoption of that tool.

Twitter has submitted DistributedLog, their replicated log service, to the Apache Incubator.

Big Data Day LA takes place at West Los Angeles College on July 9th. The event is free (with up-front registration) and features speakers from Confluent, Databricks, Yahoo, Netflix, and more.


Apache Spark recently released a preview of Spark 2.0. The release announcement notes that the API and functionality are not considered final.

JustOne has built and open-sourced a Kafka-to-PostgreSQL connector. This post introduces the connector, describes the performance, details how messages are converted to rows, describes the configuration settings, and more.

Salesforce has open sourced Runway, which is a tool for modeling, simulating, and visualizing distributed systems. There's a live demo at runway.system with examples of "too many bananas," elevators, and the raft consensus system.

Bloomberg recently open-sourced Presto Accumulo, a Presto connector for Apache Accumulo. In the announcement, there's a link to an 11-page paper which compares the Presto-based queries to those written with the Accumulo Java APIs and provides some benchmark results.

Microsoft Azure has announced the general availability of Apache Spark 1.6.1 for Azure HDInsight. The release contains support for the Project Livy REST job service for Spark, integration with Azure Data Lake Store (and its role-based access controls), integration with IntelliJ, support for Jupyter notebooks, and more.

LinkedIn has open source Photon ML, their library for large scale regression. Photon is built on Spark and is run on YARN at LinkedIn (it used to run on MapReduce and has seen major speed improvements since migrating).

Hortonworks has announced a technical preview of the Spark-HBase connector which they developed in conjunction with Bloomberg. The connector features native Avro support, support for running in a secure cluster, native Spark Datasource APIs, and optimizations such as partition pruning, column pruning, and predicate pushdown.

Databricks has announced the first phase of security features for their Apache Spark platform. This phase adds cluster ACLs, SAML 2.0 support, and end-to-end audit logs.

Apache ORC 1.1.0 was released this. This release completed the migration of Java code out of the Apache Hive code base, fixes the C++ reader's handling of timestamps, and adds connectors for Hadoop MapReduce.

Version 0.9.0 of Apache Kudu has been released. Major features of the release are an UPSERT command, a new Spark data source that doesn't rely on the MapReduce APIs, and improvements to Tablet Server write performance.

The Google Cloud Platform team has announced support for the Spark 2.0 preview release with Google Cloud Dataproc.

Dory (the successor to Bruce), which is a Kafka producer daemon with support for ingesting data via UNIX domain sockets or local TCP, was announced.

Apache Pig 0.16.0, the first release in over a year, was released. This version stabilizes Pig on Tez.


Curated by Datadog ( )



June 2016 Meetup (San Francisco) - Tuesday, June 14

Airflow Meetup (Redwood City) - Tuesday, June 14

Stream Processing Meetup (Mountain View) - Wednesday, June 15

Big Data Application Meetup (Palo Alto) - Wednesday, June 15


Spark 2.0 Highlights from Spark Summit SF 2016 (Seattle) - Wednesday, June 15

Taking Spark to the Clouds (Bellevue) - Thursday, June 16


Spark Hands-on 1-Day Workshop for Data Engineers, Data Scientists and Developers (Houston) - Tuesday, June 14

What Is All the Hype about Apache Spark? (Houston) - Tuesday, June 14


Spark 2.0 Performance Improvements & Blazegraph GPU (Laurel) - Monday, June 13

De-Siloing Data with Apache Drill (Baltimore) - Tuesday, June 14


Introduction to Spark In-Memory Computing (Philadelphia) - Tuesday, June 14

New York

Data Driven NYC #48 (New York) - Monday, June 13

Introducing Scio, a Scala API for Google Cloud Dataflow (New York) - Thursday, June 16


Mike Olson, Co-Founder of Cloudera, to Present at TDWI Boston Chapter (Waltham) - Thursday, June 16


Spark Streaming: Dealing with State (Vancouver) - Thursday, June 16

IRELAND Production Quality Data Science: Building Rapid Ingestion Data Pipelines (Dublin) - Monday, June 13


All about Apache Spark (Lisbon) - Thursday, June 16


Apache Kafka: A High-Throughput Distributed Messaging System, by Samuel Kerrien (Geneva) - Tuesday, June 14


Spark Meetup: GraphX and DataFrames (Vienna) - Wednesday, June 15


Recommendation Systems: From Scratch to a Working Spark Implementation (Tel Aviv-Yafo) - Monday, June 13

Lessons Learned While Running Kafka at Scale, by Gwen Shapira (Tel Aviv) - Tuesday, June 14


Spark Meetup (Shanghai) - Saturday, June 18