Data Eng Weekly


Hadoop Weekly Issue #207

05 March 2017

A short but sweet issue this week with posts on Apache Spark at Facebook, Apache Sqoop, and Apache Ranger as well as coverage of a number of releases (including Apache HBase and Apache Accumulo).

Technical

Facebook has written about their experience converting their n-gram language model training pipeline from Apache Hive to Apache Spark. The post describes their Hive-based solution, their Spark-based solution, and the scalability challenges (mostly around data skew for popular combinations like "how to..."). The post also talks about the overall difference in the two implementations (e.g. the flexibility of the Spark DSL vs. Hive QL) and shares some performance numbers.

https://code.facebook.com/posts/678403995666478/using-apache-spark-for-large-scale-language-model-training/

This tutorial on the IBM blog provides a brief introduction to Apache Sqoop that demonstrates a common use case—capturing changed rows from a database table and merging them with a previous data set to flatten to a single universal dataset.

https://developer.ibm.com/hadoop/2017/02/28/typical-scenario-sqoop-incremental-import-merge/

The Hortonworks blog has a thorough overview of Apache Ranger's feature set, including how it provides attribute-based access control, its policy engine framework, its Key Management Service (that can integrate with a Hardware Security module), dynamic column masking capabilities for Apache Hive, central auditing, and more.

https://hortonworks.com/blog/morphing-time-apache-ranger-graduates-top-level-project-part-2/

Most AWS services integrate with CloudTrail for auditing. Once you start adding a few services, this can generate an awful lot of data that is overwhelming to consume. The new Amazon Athena is a useful tool for analyzing that data, given that it doesn't require any additional infrastructure. The AWS big data blog has a tutorial with several example queries to get started analyzing the data.

https://aws.amazon.com/blogs/big-data/aws-cloudtrail-and-amazon-athena-dive-deep-to-analyze-security-compliance-and-operational-activity/

News

The Next Platform has an interview with Cloudera's Mike Olson about Hadoop covering it's seemingly falling popularity. The discussion talks about how "Hadoop-at-large" is much bigger than just the Hadoop project (much like the coverage of technologies in this newsletter).

https://www.nextplatform.com/2017/03/01/looking-long-enterprise-road-hadoop/

DataEngConf is in just over a month (April 25-28) in San Francisco. The website also has videos and slides of many talks from past conferences.

http://www.dataengconf.com/

dotScale takes place in Paris on April 24th. The conference covers a number of topics that should be of interest for Hadoop Weekly subscribers, including scalability, devops, and distributed systems. dotScale is offering a 20% discount to subscribers of Hadoop Weekly using the link below.

https://dotscale2017.eventbrite.com?discount=HADOOPWEEKLY

Releases

Apache Accumulo 1.8.1 is out with some major changes, including a fix for an issue with scans after a minor compaction, improvements tablet server performance, and a fix for a synchronization issue with deep copies. There are several other major changes and a dozen other notable changes called out in the release announcement.

http://accumulo.apache.org/release/accumulo-1.8.1/

On the heels of the Apache Kafka 0.10.2 release, Confluent has announced Confluent 3.2. Atop of the Kafka features, they've added support for .NET and JMS clients, a new S3 connector, and improvements to the Confluent Control Center.

https://www.confluent.io/blog/confluent-3-2-apache-kafka-0-10-2-now-available/

Apache HBase 1.1.9 was released with several correctness fixes.

http://mail-archives.apache.org/mod_mbox/hbase-user/201702.mbox/%3CCANZa%3DGuTGmkh64Of8%3DSDVtyRE%2BbmWRo_HQ8t6PMgqm-dcCk3HQ%40mail.gmail.com%3E

StreamSets has announced version 2.4.0.0 of the StreamSets Data Collector. This release improves support for multi-tenancy, support for additional versions of external dependencies, and more.

https://streamsets.com/blog/announcing-data-collector-ver-2-4-0-0/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

Washington

Overkill Analytics + More (Seattle) - Wednesday, March 8
https://www.meetup.com/Seattle-Scalability-Meetup/events/235874411/

Texas

Dean Wampler: Stream All the Things! (Austin) - Tuesday, March 7
https://www.meetup.com/Austin-Apache-Kafka-Meetup-Stream-Data-Platform/events/237913791/

Holden Karau: Debugging Apache Spark, Making Sense of Stack Traces & More (Austin) - Tuesday, March 7
https://www.meetup.com/austin-spark-meetup/events/237914759/

Beyond Stream Analytics (Houston) - Thursday, March 9
https://www.meetup.com/Houston-Data-Science/events/237532793/

IRELAND Fast Analytics on Fast Data for Hadoop + Processing Geo Information in Big Data (Dublin) - Monday, March 6
https://www.meetup.com/hadoop-user-group-ireland/events/237498007/

SPAIN

Patterns of Integration between Kafka and Couchbase (Madrid) - Thursday, March 9
https://www.meetup.com/Couchbase-Espana/events/237520027/

GERMANY

Best of Spark Summit East + More (Berlin) - Thursday, March 9
https://www.meetup.com/Berlin-Apache-Spark-Meetup/events/237849059/

TAIWAN

Exciting New Features in Flink 1.2, Flink-Ppml, and Kafka Streams (Taipei) - Sunday, March 12
https://www.meetup.com/flink-tw/events/237720043/

NEW ZEALAND

Apache Spark Meetup (Auckland) - Tuesday, March 7
https://www.meetup.com/Auckland-Apache-Spark-User-Group/events/236219374/