Data Eng Weekly

Hadoop Weekly Issue #142

18 October 2015

This week's issue contains lots of variety of technical content, covering Apache Apex, Ibis, Apache Kafka, Apache Spark, Apache Ambari, Apache Samza, and Apache Hadoop YARN. There's a good mix of practical (e.g. how Collective uses Spark) and forward-looking (what's next for Kafka and Samza) content. On the release front, Apache Drill 1.2 is out with some exciting new features and improvements.


The DataTorrent blog has a post about Apache Apex (incubating) that describes how Apex's architecture is built around DAGs. At a high-level, a data flow is specified as a DAG (the Logical Plan) using the Java API or JSON, and the Streaming Application Master converts this logical plan into a physical plan for execution on a cluster. The post gives an overview of how Apex performs the conversion and executes it on a distributed platform like YARN.

Ibis, the Python data analysis framework for big data, contains an integration with SQL engines (in particular, Ibis aims to work well with Cloudera Impala). This post describes Ibis' SQL API, which provides an API for building and running SQL queries.

The Confluent blog has an update on Apache Kafka, which includes news on a number of features in various stages of development. Of particular note, support for authorization and the new Kafka Streams library have both been committed to trunk.

This post describes how Collective is using a long-running Spark cluster to power interactive dashboards. The system makes use of HyperLogLog for estimating cardinality of the audiences they measure, and the post describes the custom Spark aggregation function they've built for merging HyperLogLogs. After putting all of these things together, a 40 node cluster with 100GB of cached data can answer queries in under 2 seconds.

The Bay Area Samza Meetup hosted a presentation about the next release of Apache Samza, version 0.10.0. In addition to support for new consumers and producers (Amazon Kinesis, HDFS, ElasticSearch), version 0.10.0 adds support for dynamic configuration and host affinity which can improve job startup/recovery time when tasks have a lot of local state. Samza 0.10.0 is expected to be released in November.

Also at the Bay Area Samza Meetup, Netflix presented on how Samza fits into the Netflix data pipeline. Netflix processes over 1 Petabyte / day (550 billion events), using Samza instances running inside of docker on hosts in an EC2 auto-scaling group. The presentation describes their production experience (and some improvements/workarounds they're using) and the number and types of instances that they use for both Samza and Kafka.

This tutorial shows how to update configuration settings in Apache Ambari using the Ambari web UI. The UI exposes knobs for common settings and supports configuration of additional settings by setting raw property values. Ambari also supports comparison/diff across configuration versions.

This post on the Cloudera blog discusses how to calculate resources for YARN by taking into consideration common cluster scenarios and accounting for operating system overhead. It also shows how to verify configuration using the ResourceManager Web UI.


Spark Summit East is taking place on February 16-18, 2016 in New York City. The call for presentations is open until November 22nd.

Videos from AWS re:Invent 2015 have been posted. There are both customer deep dives (including Netflix, Zillow, New Relic, and Coursera), and talks about AWS services.


Apache Curator 3.0.0 was released this week. It requires ZooKeeper 3.5.x and supports ZooKeeper's new dynamic reconfiguration, which allows clients to become aware of (and rebalance against) updated server lists and ports.

Apache Drill 1.2 was released with several new features. These include support for RDBMSs, new window functions (bringing the total of supported window functions to 15), Parquet metadata caching (to reduce file scans across multiple queries), and performance improvements for HBase and Hive tables.

Apache Cassandra 2.1.11 and 2.2.3 were released this week. Both versions contain a small number of bug fixes.


Curated by Datadog ( )



#OCBigData Monthly Meetup #14 (Irvine) - Wednesday, October 21

Sub-Second Querying with In-Memory/Roxie + Hadoop (Mountain View) - Wednesday, October 21

Kudu: Data Store for the New Era (San Francisco) - Thursday, October 22

Ibis: Operating the Python Data Ecosystem at Hadoop Scale (San Francisco) - Thursday, October 22


Putting Together the Platform: Riak, Solr, Redis and Spark (Dallas) - Thursday, October 22


Parquet (Clayton) - Tuesday, October 20


What Is Spark? (Milwaukee) - Tuesday, October 20


Hadoop + HDInsight = Big Data on Azure! (Alpharetta) - Tuesday, October 20

New York

Developing Real-Time Data Pipelines with Spring and Kafka (New York) - Tuesday, October 20

Stopping Invalid Traffic Using Spark Streaming, Kafka and Science (New York) - Tuesday, October 20

Spark on Mesos (New York) - Wednesday, October 21

Hadoop-Based Data Lake (New York) - Thursday, October 22


Connecticut Big Data #2 (Windsor) - Wednesday, October 21


October Presentation Night (Cambridge) - Monday, October 19


Apache Storm (Bogota) - Tuesday, October 20


Chris Fregly of IBM Spark Tech (Barcelona) - Tuesday, October 20

Spark and the Hadoop Ecosystem (Madrid) - Thursday, October 22


Meetup on Big Data Technologies (Paris) - Wednesday, October 21


17th Swiss Big Data User Group Meeting (Zurich) - Monday, October 19


Big Data Community Workshops (Beijing) - Thursday, October 22