Data Eng Weekly

Hadoop Weekly Issue #177

04 July 2016

There were several announcements this week out of Hadoop Summit—HDP 2.5 preview, open-source software from Qubole and LinkedIn, and Mapr's Spyglass Initiative. In terms of technical articles, there are two themes: tutorials (walkthroughs for StreamSets, Mesos, and Sqoop) and data management (a post from Twitter and details about Gobblin and Samza at LinkedIn).


Activision Game Science has published microbenchmarks of several Apache Kafka clients for python. Confluent's client, which is based on librdkafka (a C/C++ library), has the best performance.

Hortonworks has a post about some of the features planned for upcoming releases of Apache Zeppelin. These include pluggable visualization, collaboration via Git, multi-user support, and more.

This tutorial demonstrates how to load time series data from a Raspberry Pi to Apache Cassandra using the StreamSets Data Collector.

LinkedIn has the first post in a series on hard problems in stream processing, and how they're working to solved them at LinkedIn. The post mentions how they handle late data with Apache Samza as well as how they can use Samza (rather than a batch system) for reprocessing. Finally, the talk also mentions how folks are integrating Siddhi into Samza for complex event processing.

MapR has a tutorial for deploying and configuring a Mesos cluster with the Marathon scheduler, Docker support, and the ability to run Spark jobs on Mesos. Once configured, it's possible to drop into an interactive Spark shell or use spark-submit to run a job.

Apache Sqoop is a tool for moving data between relational databases and Hadoop clusters. The AWS blog has a walkthrough describing how to use Sqoop with Amazon EMR (Hadoop-as-a-service) and Amazon RDS (RDBMS-as-a-service). The tutorial uses MySQL and loads data from S3 into a database table.

MapR has a tutorial for Apache Spark in which S&P 500 and oil stocks are compared using data frames and Spark SQL. The post is a good overview of the DataFrame and SQL APIs (e.g. showing the physical plan via explain).

Last week's issue had a post that looked at Google's data search platform. This week, Twitter wrote about their discovery and consumption tools. This includes items like data lifecycle management, a data explorer (which has example code snippets and a wiki-like metadata management interface), and lineage information. The post also has a section about lessons learned and observations.

In the first of (what I assume will be) many slides from Hadoop Summit, this presentation benchmarks various file formats. There are some expected results (such as Parquet and ORC being efficient in size for multiple data sets) as well as some unexpected ones (a dataset in which Avro and JSON compress better than ORC and parquet due to redundancy within rows).


ODPi has announced that Altiscale, ArenaData, Hortonworks, IBM, and Infosys distributions are all ODPi Runtime compliant. Pivotal has a post about how Pivotal HDB (built on the incubator project Apache HAWQ) benefits from runtime compatibility via ease of integration.

Hadoop Summit was this week. ZDNet has a good overview of major news (ODPi, Hortonworks announcements, and more). The Hortonworks blog has a post based on CEO Rob Bearden's keynote on how big data is transforming the enterprise across all industries.

At the Summit, Hortonworks and Microsoft announced that Azure HDInsight is Hortonworks' premier cloud solution.

Pepperdata has announced a free diagnostic report for Hadoop clusters. The software analyzes the cluster for 72 hours to produce recommendations.

Apache Bahir is a new top-level project that's been created by extracting several plugins and connectors from Apache Spark. In its initial form, Bahir supports four streaming extensions for Apache Spark but expansion to other platforms (e.g. Flink, Beam) is planned.


Hortonworks announced a preview of HDP 2.5 at Hadoop Summit this week. The release includes integrating governance and security via Apache Atlas and Apache Ranger, improvements to Apache Zeppelin and Spark's support for ORC files, improved Apache Ambari support for Apache HBase, Hive's new "Live Long and Prosper" (LLAP) support for sub-second queries, and Hortonworks Connected Data Cloud (HCDC) for launching Hive/Spark clusters. LLAP and HCDC are the focus of separate blog posts from Hortonworks.

Quark is a new open-source project from Qubole. Built with Apache Calcite for SQL and JDBC support, Quark is a system for centrally managing data and optimizing queries across multiple data backends.

In another open-source announcement this week, Qubole has open-sourced RubiX. RubiX is a caching engine for Presto and Apache Hive that stores data on local disk when source data is stored in Amazon S3. It exposes a new FileSystem interface and can do column-based caching for columnar storage formats.

LinkedIn has released version 0.7.0 of Gobblin. Previous focussed on HDFS data ingestion from Kafka, Gobblin now supports more data lifecycle features like distributed copies between clusters and data retention (via a dataset configuration file). The post has more information about Gobblin at LinkedIn and also includes some example configuration files.

Apache Drill 1.7.0 was released with support for JMX monitoring, data stored in HBase 1.x, and more. The new version also resolves a number of bugs.

Apache Slider 0.91.0-incubating was released this week. This release includes phase 2 of Slider's support for integrating Docker-based applications with YARN. It also includes a number of bug fixes and improvements.

MapR has announced the Spyglass Initiative for administering MapR clusters. MapR Monitoring, the first phase of the initiative, is built on ElasticSearch, OpenTSDB, Kibana, Grafana, and more.

StreamSets has released version 1.5.0 of their Data Collector with new support for automatically creating Hive schemas, reading/writing data from/to Redis, a new HTTP client processor, and more.

schema-registry-ui, as the name suggests, is a UI for the confluent schema registry. It include access to schemas, version information, example curl commands, and more.


Curated by Datadog ( )



Building (And Running) Netflix's Data Pipeline Using Apache Kafka (San Francisco) - Tuesday, July 5

Big Data Day LA 2016 (Los Angeles) - Saturday, July 9


Building Custom Big Data Integrations (Tempe) - Wednesday, July 6


StreamSets, for the Coding Minimalist in All of Us (Atlanta) - Wednesday, July 6


Intro to Apache Apex & Comparison with Spark Streaming (Sterling) - Wednesday, July 6

Beyond ETL: Real-Time, Streaming Architectures (Arlington) - Thursday, July 7

New York

Stream Processing Using Spring Cloud Data Flow (New York) - Tuesday, July 5


Can Machines See the Invisible? Plus, Building a Data Infrastructure at Scale (London) - Tuesday, July 5

Big Data Presents... Spark in the Real World (London) - Tuesday, July 5

Productionising Data Science (London) - Thursday, July 7

Apache Flink London: July Meetup (London) - Thursday, July 7


Apache Spark & Bluemix (Paris) - Wednesday, July 6


Knime Italy Meetup Goes Big Data on Apache Spark (Milano) - Tuesday, July 5


Event Data Pipelines (Tel Aviv-Yafo) - Sunday, July 10


Talk about Big Data Tools: Spark, Hadoop, Scala (Tokyo) - Friday, July 8


Spark Meetup (Auckland) - Tuesday, July 5