Hadoop Weekly Issue #142

18 October 2015

This week's issue contains lots of variety of technical content, covering Apache Apex, Ibis, Apache Kafka, Apache Spark, Apache Ambari, Apache Samza, and Apache Hadoop YARN. There's a good mix of practical (e.g. how Collective uses Spark) and forward-looking (what's next for Kafka and Samza) content. On the release front, Apache Drill 1.2 is out with some exciting new features and improvements.

Technical

The DataTorrent blog has a post about Apache Apex (incubating) that describes how Apex's architecture is built around DAGs. At a high-level, a data flow is specified as a DAG (the Logical Plan) using the Java API or JSON, and the Streaming Application Master converts this logical plan into a physical plan for execution on a cluster. The post gives an overview of how Apex performs the conversion and executes it on a distributed platform like YARN.

https://www.datatorrent.com/blog-tracing-dags-from-specification-to-execution/

Ibis, the Python data analysis framework for big data, contains an integration with SQL engines (in particular, Ibis aims to work well with Cloudera Impala). This post describes Ibis' SQL API, which provides an API for building and running SQL queries.

http://blog.ibis-project.org/ibis-for-sql-programmers/

The Confluent blog has an update on Apache Kafka, which includes news on a number of features in various stages of development. Of particular note, support for authorization and the new Kafka Streams library have both been committed to trunk.

http://www.confluent.io/blog/log-compaction-highlights-in-the-kafka-and-stream-processing-community-october-2015

This post describes how Collective is using a long-running Spark cluster to power interactive dashboards. The system makes use of HyperLogLog for estimating cardinality of the audiences they measure, and the post describes the custom Spark aggregation function they've built for merging HyperLogLogs. After putting all of these things together, a 40 node cluster with 100GB of cached data can answer queries in under 2 seconds.

https://databricks.com/blog/2015/10/13/interactive-audience-analytics-with-spark-and-hyperloglog.html

The Bay Area Samza Meetup hosted a presentation about the next release of Apache Samza, version 0.10.0. In addition to support for new consumers and producers (Amazon Kinesis, HDFS, ElasticSearch), version 0.10.0 adds support for dynamic configuration and host affinity which can improve job startup/recovery time when tasks have a lot of local state. Samza 0.10.0 is expected to be released in November.

http://www.slideshare.net/NavinaRamesh/apache-samza-new-features-in-the-upcoming-samza-release-0100

Also at the Bay Area Samza Meetup, Netflix presented on how Samza fits into the Netflix data pipeline. Netflix processes over 1 Petabyte / day (550 billion events), using Samza instances running inside of docker on hosts in an EC2 auto-scaling group. The presentation describes their production experience (and some improvements/workarounds they're using) and the number and types of instances that they use for both Samza and Kafka.

http://www.slideshare.net/mmddtmp/netflix-keystone-samzaeetup10132015

This tutorial shows how to update configuration settings in Apache Ambari using the Ambari web UI. The UI exposes knobs for common settings and supports configuration of additional settings by setting raw property values. Ambari also supports comparison/diff across configuration versions.

https://developer.ibm.com/hadoop/blog/2015/10/15/update-open-source-component-configurations-via-ambari-part-one/

This post on the Cloudera blog discusses how to calculate resources for YARN by taking into consideration common cluster scenarios and accounting for operating system overhead. It also shows how to verify configuration using the ResourceManager Web UI.

http://blog.cloudera.com/blog/2015/10/untangling-apache-hadoop-yarn-part-2/

News

Spark Summit East is taking place on February 16-18, 2016 in New York City. The call for presentations is open until November 22nd.

https://databricks.com/blog/2015/10/14/call-for-presentation-for-the-2016-spark-summit-east-is-now-open.html

Videos from AWS re:Invent 2015 have been posted. There are both customer deep dives (including Netflix, Zillow, New Relic, and Coursera), and talks about AWS services.

http://blogs.aws.amazon.com/bigdata/post/Tx3D3UYOXB9XG6Z/Videos-now-available-for-AWS-re-Invent-2015-Big-Data-Analytics-sessions

Releases

Apache Curator 3.0.0 was released this week. It requires ZooKeeper 3.5.x and supports ZooKeeper's new dynamic reconfiguration, which allows clients to become aware of (and rebalance against) updated server lists and ports.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201510.mbox/%3CCABRiMSE96T4EHbrqDYV=m+OL_Wgf-t-Dxg4OXrMS4hE9yFgMOw@mail.gmail.com%3E

Apache Drill 1.2 was released with several new features. These include support for RDBMSs, new window functions (bringing the total of supported window functions to 15), Parquet metadata caching (to reduce file scans across multiple queries), and performance improvements for HBase and Hive tables.

http://drill.apache.org/blog/2015/10/16/drill-1.2-released/

Apache Cassandra 2.1.11 and 2.2.3 were released this week. Both versions contain a small number of bug fixes.

http://www.mail-archive.com/dev@cassandra.apache.org/msg08354.html
http://www.mail-archive.com/user@cassandra.apache.org/msg44361.html

Events

Curated by Datadog ( http://www.datadoghq.com )

UNITED STATES

California

#OCBigData Monthly Meetup #14 (Irvine) - Wednesday, October 21
http://www.meetup.com/OCBigData/events/224867815/

Sub-Second Querying with In-Memory/Roxie + Hadoop (Mountain View) - Wednesday, October 21
http://www.meetup.com/HPCC-SV/events/224784517/

Kudu: Data Store for the New Era (San Francisco) - Thursday, October 22
http://www.meetup.com/SF-Spark-and-Friends/events/226023299/

Ibis: Operating the Python Data Ecosystem at Hadoop Scale (San Francisco) - Thursday, October 22
http://www.meetup.com/Data-Mining/events/225841066/

Texas

Putting Together the Platform: Riak, Solr, Redis and Spark (Dallas) - Thursday, October 22
http://www.meetup.com/DFW-BigData/events/225350884/

Missouri

Parquet (Clayton) - Tuesday, October 20
http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/223964173/

Wisconsin

What Is Spark? (Milwaukee) - Tuesday, October 20
http://www.meetup.com/Milwaukee-Big-Data-Users-Group/events/226010392/

Georgia

Hadoop + HDInsight = Big Data on Azure! (Alpharetta) - Tuesday, October 20
http://www.meetup.com/Azure-in-the-ATL/events/224689964/

New York

Developing Real-Time Data Pipelines with Spring and Kafka (New York) - Tuesday, October 20
http://www.meetup.com/Apache-Kafka-NYC/events/225697500/

Stopping Invalid Traffic Using Spark Streaming, Kafka and Science (New York) - Tuesday, October 20
http://www.meetup.com/TechTalks-AppNexus-NYC/events/225732780/

Spark on Mesos (New York) - Wednesday, October 21
http://www.meetup.com/Apache-Mesos-NYC-Meetup/events/225116955/

Hadoop-Based Data Lake (New York) - Thursday, October 22
http://www.meetup.com/NYC-Hortonworks-User-Group/events/225930095/

SPAIN

Chris Fregly of IBM Spark Tech (Barcelona) - Tuesday, October 20
http://www.meetup.com/Spark-Barcelona/events/225868339/

Spark and the Hadoop Ecosystem (Madrid) - Thursday, October 22
http://www.meetup.com/Madrid-Bluemix-Meetup/events/223340232/

FRANCE

Meetup on Big Data Technologies (Paris) - Wednesday, October 21
http://www.meetup.com/Paris-Big-Data-Classes/events/224883661/

SWITZERLAND

17th Swiss Big Data User Group Meeting (Zurich) - Monday, October 19
http://www.meetup.com/swiss-big-data/events/225582915/

CHINA

Big Data Community Workshops (Beijing) - Thursday, October 22
http://www.meetup.com/Big-Data-Community-China/events/225655083/

Data Eng Weekly