Hadoop Weekly Issue #180

24 July 2016

As stream processing continues to be a hot topic, Kafka is showing some maturity—there are two articles this week on Kafka security. In addition to stream processing, there's a good mix of content with articles on core Hadoop, Hive, and data infrastructure automation.

Technical

Datadog has a four-part blog post series on monitoring Hadoop. The first three parts are Datadog-agnostic and describe Hadoop architecture, important metrics for HDFS, MapReduce & YARN, and strategies for collecting metrics. This will likely prove to be a valuable guide for building out Hadoop monitoring and alerting infrastructure.

https://www.datadoghq.com/blog/hadoop-architecture-overview/

Heroku has written about their move to an asynchronous, Kafka-based integration pattern. They've build a HTTP Proxy for Kafka, which in addition to HTTP POST for publishing, supports consuming via websockets. The post has many more details about the rollout of this infrastructure component at Heroku.

https://blog.heroku.com/powering-the-heroku-platform-api-a-distributed-systems-approach-using-streams-and-apache-kafka

In the latest in a series on the Altiscale blog about debugging Hadoop NodeGroup performance issues, this post gets to the bottom of the two problems previously discovered.

https://www.altiscale.com/blog/how-to-identify-and-resolve-hadoop-nodegroup-performance-problems-part-2-2/

The Hortonworks blog has an overview of the various types of disaster recovery and backup support in HBase. Recently, the community has been working on incremental backup tools. The article describes several different backup targets—intra-cluster, inter-cluster, and S3/other long-term storage as well as the commands needed to perform incremental backups and restore from a backup. In terms of restoration strategies, there are several approaches (each with its own trade-offs).

http://hortonworks.com/blog/coming-hdp-2-5-incremental-backup-restore-apache-hbase-apache-phoenix/

WePay has another post about their BigQuery-powered data platform. This time they look at loading data into Google Cloud Storage and BigQuery from production MySQL databases as well as real-time writes using the streaming API. The post discusses several nuances of the process—handling of mutable data, data quality checks, permissions, service accounts, and automation.

https://wecode.wepay.com/posts/bigquery-wepay

The Confluent blog has an article that describes security features in Apache Kafka, with a concentration in the Kafka Stream use-case. Features include encryption-in-transit (both for the client-server and server-server communication) and client authentication/authorization. These settings are disabled by default, and the post has an example of configuring them for a Kafka Streams application.

http://www.confluent.io/blog/secure-stream-processing-with-kafka-streams

While many data processing environments start out as a set of cron jobs, but that's usually not a good long-term solution. This post describes the major problems with cron, and suggests some alternative systems that aim to solve these and other problems with job scheduling.

http://beekeeperdata.com/posts/hadoop/2016/07/19/Cron-Alternatives-For-Hadoop.html

Hortonworks has written about the recently released Apache Hive 2.1. This is the first version with Hive's Live Long and Prosper (LLAP) support. In addition to LLAP, Hive 2.1 has smarter map joins, better vectorization, and a better cost-based optimizer. The post includes some benchmarks and related configuration tweaks needed to get the best performance for Hive.

http://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/

In another Kafka security post, the IBM Hadoop Dev blog describes how to enable and configure Kerberos. In addition to configuration settings, there are examples of several admin functions (such as adding ACLs for a new user account).

https://developer.ibm.com/hadoop/2016/07/20/kafka-acls/

MapR has a whiteboard walkthrough on Apache Flink's savepoints for stream processing. Savepoints solve operational issues commonly found in stream processing frameworks like support for reprocessing and no-downtime upgrades. As usual, there's both a video and transcript of the presentation.

https://www.mapr.com/blog/savepoints-apache-flink-stream-processing-whiteboard-walkthrough

This presentation from Data Day Seattle gives an overview of Apache Airflow (incubating). After motivating Airflow and introducing its major features, the presentation describes use cases at Agari: 1) Message Scoring, which involves Spark, Amazon S3, managing importers via AWS Auto Scaling Groups and 2) Model Building, which is performed with Amazon EMR. The post also looks at SLAs for correctness and timeliness with Airflow.

http://www.slideshare.net/r39132/introduction-to-apache-airflow-data-day-seattle-2016

News

Kafka Summit was a few months ago now, but this is a great summary of the conference themes, lessons learned, stream processing presentations, and more.

http://aviflax.com/post/notes-from-the-first-kafka-summit/

The MapR blog has a post that revisits, as we're half-way into 2016, some big data predictions made at the start of the year. Many of the predictions have come true, and I think it's interesting to see what missed (healthcare) and what wasn't anticipated (containerization).

https://www.mapr.com/blog/mid-year-updates-big-data-trends-apache-kafka-spark-flink-drill-and-more

Syncsort has another in its series of expert interviews, this time with Dr. Ellen Friedman who is the author of "Streaming Data Architecture: New Designs Using Apache Kafka and MapR Streams." The three part interview covers Hadoop in industry, big data stream processing, and more.

http://blog.syncsort.com/2016/07/big-data/dr-ellen-friedman-discusses-increased-flexibility-in-big-data-tools-and-changing-business-cultures/

InfoQ has posted videos from QCon New York 2016. There are a number of relevant presentations, including those covering streaming data at Spotify and stream processing with Apache Kafka.

https://www.infoq.com/qcon-newyork-2016

Splice Machine has open sourced their RDBMS built on Hadoop, HBase, and Spark. As part of the announcement, they've also provide the ability to launch Splice Machine in an AWS-powered sandbox.

http://www.infoworld.com/article/3096252/hadoop/spark-powered-splice-machine-goes-open-source.html

Altiscale has announced that the Altiscale Data Cloud is now compliant with the ODPi Runtime Specification.

https://www.altiscale.com/blog/altiscale-now-odpi-runtime-compliant/

Releases

Apache Chukwa was one of the first log aggregation and analysis frameworks. Development stalled for some years, but the project has now seen two releases in the past 8 months. The 0.8.0 release has a new file format (based on Parquet), an improved HBase schema, and a number of of bug fixes and improvements.

http://chukwa.apache.org/docs/r0.8.0/releasenotes.html

Apache HBase 1.2.2 was released this week. The maintenance release resolves a number of bug fixes.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201607.mbox/%3CCAN5cbe6-9x=O7gtSVFaev5PTtVLRrfhwuFfahSxw2b4kaAjVqQ@mail.gmail.com%3E

Cloudera has announced Cloudera Enterprise 5.8. This release brings Cloudera Navigator Optimizer to general availability. It also features new versions of Impala and Hue. The Cloudera blog has a post on the release and the new optimizer.

http://blog.cloudera.com/blog/2016/07/cloudera-enterprise-5-8-is-now-available/
http://blog.cloudera.com/blog/2016/07/cloudera-navigator-optimizer-graduates-from-beta-is-now-generally-available/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Big Data Application Meetup (Palo Alto) - Wednesday, July 27
http://www.meetup.com/BigDataApps/events/230880075/

Apache Ignite In-Memory Data Fabric for .NET (Mountain View) - Wednesday, July 27
http://www.meetup.com/BayNET/events/231211547/

Hands-On with Twitter Heron (San Francisco) - Saturday, July 30
http://www.meetup.com/Data-Engineers-Guild/events/231874486/

Oregon

Walking a Fine Line: Using Apache Spark and Cassandra (Portland) - Wednesday, July 27
http://www.meetup.com/DataStax-Cassandra-Portland-Users/events/232541640/

Washington

Seattle Scalability Meetup (Seattle) - Wednesday, July 27
http://www.meetup.com/Seattle-Scalability-Meetup/events/230760466/

Spark at Zillow & Realtime Analytics: Spark, NiFi, Kafka, Cassandra, ES, Docker (Seattle) - Thursday, July 28
http://www.meetup.com/Seattle-Spark-Meetup/events/229476550/

Utah

Databricks Community Edition: Spark 2.0 (Lehi) - Thursday, July 28
http://www.meetup.com/BigDataUtah/events/231260601/

Ohio

Cleveland Big Data and Hadoop User Group (Mayfield Village) - Monday, July 25
http://www.meetup.com/Cleveland-Hadoop/events/231755988/

North Carolina

July CHUG: Leveraging Mainframe Data in Hadoop (Charlotte) - Wednesday, July 27
http://www.meetup.com/CharlotteHUG/events/227293996/

New York

Apache Phoenix NYC Meetup (New York) - Monday, July 25
http://www.meetup.com/futureofdata-newyork/events/231536453/

Combining Spark and Open Source Elements (New York) - Tuesday, July 26
http://www.meetup.com/New-York-ODSC/events/232053300/

Massachusetts

Google Cloud Dataflow via Scio & Google Bigtable Learnings (Somerville) - Tuesday, July 26
http://www.meetup.com/Boston-Data-Engineering-Meetup/events/232393968/

Apache Phoenix and HBase: Past, Present, and Future of SQL Over HBase (Bedford) - Tuesday, July 26
http://www.meetup.com/futureofdata-boston/events/231928732/

GERMANY

Building a Fully Automated Fast Data Platform (Munich) - Thursday, July 28
http://www.meetup.com/Hadoop-User-Group-Munich/events/230739156/

ROMANIA

Apache Kafka Workshop (Cluj-Napoca) - Wednesday, July 27
http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/events/232587654/

AUSTRALIA

Spark 2.0 101 + Spark on Knime (Sydney) - Wednesday, July 27
http://www.meetup.com/Sydney-Apache-Spark-User-Group/events/230892723/

Data Eng Weekly