Data Eng Weekly


Hadoop Weekly Issue #205

19 February 2017

Tons of great content this week including a look at Google's new Cloud Spanner, the YARN fair scheduler, connecting Splunk with Kafka, and Jepsen testing of Cockroach DB. In news, there are CFP open for Data Platforms and HBaseCon, and Kafka Summit New York has announced the conference schedule. In releases, there's a neat new command-line tool for interacting with HDFS.

Technical

Google announced that the Google Cloud Platform is adding "Cloud Spanner," which is a highly-available and consistent database. This article describes how they achieve high availability at the network level to ensure that their CP system (in terms of the CAP theorem) has five-nines of availability.

https://cloudplatform.googleblog.com/2017/02/inside-Cloud-Spanner-qand-the-CAP-Theorem.html

The Cloudera blog has the latest in their series on the YARN FairScheduler. In this post, there are a number of example queue configurations for common scenarios such as a best effort queue, low latency computations, limiting the size of ad-hoc queries, as well as more complicated configurations involving nested organizations with varying resource allotments.

http://blog.cloudera.com/blog/2017/02/untangling-apache-hadoop-yarn-part-5-using-fairscheduler-queue-properties/

The Hive metastore has a hard limit of 4000 characters for nested schemas in a single column. It's possible to work-around this, but it requires a few different hacks (covered in this post) add partitioned data.

https://blog.godatadriven.com/import-google-analytics-hive

As mentioned in last week's issue, the second alpha release of Apache Hadoop 3.0.0 is out. This article describes three highlights of the release: classpath isolation for client jars, support for Microsoft Azure Data Lake and the Aliyun Object Storage System, and support for opportunistic containers and distributed scheduling in YARN.

http://blog.cloudera.com/blog/2017/02/apache-hadoop-3-0-0-alpha2-released/

The video analytics company, Mux, has written about their use of Apache Flink with Amazon Kinesis to detect errors in video playback. Much of the post is devoted to an overview of Flink and the advantages of its event-time-based processing, but there is a bit at the end about Flink at Mux. Specifically, Mux mentions the usage of the "rolling-fold" operator to set a per-customer baseline for error rate.

https://mux.com/blog/discovering-anomalies-in-real-time-with-apache-flink/

This post describes (including the architecture and design choices) a new Kafka Connect plugin for sending data from Kafka to Splunk, and it provides a tutorial for setting up a Kafka Connect program to stream data from a Kafka topic to Splunk via the Splunk Heavy Forwarder.

https://lilgreenwein.com/2017/02/16/splunking-kafka-with-kafka-connect/

The Jepsen blog has a post about recent testing of CockroachDB, which is a distributed SQL database. The post has some great background on the semantics and guarantees of the databases (which has similar design goals to Google's Spanner), describes the tests and results in depth, and includes a discussion of some of the improvements that Cockroach Labs made as a result of the findings.

https://jepsen.io/analyses/cockroachdb-beta-20160829

The data team at Stitch Fix has recently migrated from Amazon Redshift to Spark (including PySpark and Spark SQL). This presentation discusses some of the reasons that they made the move, some of the gotchas they encountered during the migration (e.g. differences in SQL syntax), their approach to multi-tenancy using the Netflix Genie job server, and more.

http://www.slideshare.net/piggybox/migration-from-redshift-to-spark

This tutorial shows how to run Spark locally (or some other place outside of Azure) to process data stored in the Azure Data Lake Store.

https://medium.com/azure-data-lake/connecting-your-own-hadoop-or-spark-to-azure-data-lake-store-93d426d6a5f4

Cloudera has published an updated version of their Impala Cookbook, which covers topics like schema design, cluster sizing, hardware recommendations, and query tuning.

http://blog.cloudera.com/blog/2017/02/latest-impala-cookbook/

This post dives into the internals of Spark and the JVM to help understand an optimization in a Spark program that resulted in as particular query behaving even faster than expected.

https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html

The AWS Big Data blog has a thorough look at Amazon Athena's support for JSON data. It looks at a simple example of nested JSON data (event data from the Amazon Simple Email Service), adding fields with special characters, auto-generating a DDL from sample data, and more.

https://aws.amazon.com/blogs/big-data/create-tables-in-amazon-athena-from-nested-json-and-mappings-using-jsonserde/

News

Confluent has published the results form a survey of Apache Kafka users. This post describes feedback on which languages folks are using with Kafka and which client properties are most important.

https://www.confluent.io/blog/first-annual-state-apache-kafka-client-use-survey/

Data Platforms is a new conference taking place in Phoenix in May. The call for papers is open through March 15th.

https://www.dataplatforms.com/

HBaseCon is June 12th in San Francisco. The call for abstracts is open until April 24th.

https://easychair.org/cfp/hbasecon2017

The agenda for Kafka Summit New York, which takes place on May 8th, has been posted.

https://kafka-summit.org/kafka-summit-ny/schedule/

Releases

Google has announced the public beta of their Cloud Spanner distributed relational database. It offers a pay-as-you-go model and offers JDBC drivers for most popular languages.

https://cloudplatform.googleblog.com/2017/02/introducing-Cloud-Spanner-a-global-database-service-for-mission-critical-applications.html

Syncsort has announced a new verison of their DMX-h software, which integrates Hadoop, Spark, mainframes, and other data systems. This verison adds support for Spark 2.0 and a new integrated workflow.

http://www.businesswire.com/news/home/20170216005164/en

Apache Storm 1.0.3 was released. Mostly a bug-fix release, the changelog contains over 60 resolved tickets.

http://storm.apache.org/2017/02/14/storm103-released.html

HDFS shell is a new tool that provides an interactive shell to do HDFS operations via the command line. There's a GIF on github that provides a brief overview of the core functions it provides.

https://github.com/avast/hdfs-shell

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

How Data Drives Decisions at Netflix (Mountain View) - Tuesday, February 21
https://www.meetup.com/SiliconValley-Microsoft-Data-Platform/events/236735786/

Big Data Science Meetup (Fremont) - Friday, February 24
https://www.meetup.com/Big-Data-Science/events/234127846/

Oregon

Distributed Persistent Memory for Spark (Portland) - Thursday, February 23
https://www.meetup.com/Portland-Spark-User-Group/events/237248020/

Washington

Streaming Data Platforms & Hotel Search in the Cloud (Bellevue) - Monday, February 20
https://www.meetup.com/Eastside-DevOps-Meetup/events/237421589/

IBM Presenting at Seattle Spark MeetUp (Seattle) - Tuesday, February 21
https://www.meetup.com/BleedingBlue/events/236344327/

Spark Working with an IDE: Notebook/Shiny + Resource Managers: Which Is Best (Bellevue) - Tuesday, February 21
https://www.meetup.com/Seattle-Spark-Meetup/events/232754940/

Seattle Scalability Meetup (Seattle) - Wednesday, February 22
https://www.meetup.com/Seattle-Scalability-Meetup/events/235874411/

Utah

Data Science and Hadoop Lunch (Lehi) - Thursday, February 23
https://www.meetup.com/BigDataUtah/events/237701077/

Texas

Powering Near-Real-Time Decisioning with Impala (Addison) - Thursday, February 23
https://www.meetup.com/DFW-BigData/events/237420632/

Illinois

ChiPy Data Science SIG (Chicago) - Monday, February 20
https://www.meetup.com/Metis-Chicago-Data-Science/events/237585592/

Hands-on Apache Flink Workshop! (Chicago) - Tuesday, February 21
https://www.meetup.com/Chicago-Apache-Flink-Meetup-CHAF/events/237385428/

Building Streaming Data Applications Using Kafka (Chicago) - Thursday, February 23
https://www.meetup.com/ChicagoRealTimeStreamingAnalytics/events/237343071/

Florida

SQL Server Polybase & Hadoop: The Powerful Combo (Fort Lauderdale) - Wednesday, February 22
https://www.meetup.com/Microsoft-Business-Intelligence-User-Group-of-South-Florida/events/237687844/

Reactive Streams: Akka & Kafka (Miami) - Thursday, February 23
https://www.meetup.com/Miami-Scala-Enthusiasts/events/236899551/

Georgia

Kafka with Craig McCown (Atlanta) - Monday, February 20
https://www.meetup.com/Docker-Atlanta/events/237072881/

North Carolina

Leveraging Hadoop for Advanced Cyber Security (Charlotte) - Thursday, February 23
https://www.meetup.com/CharlotteHUG/events/235107598/

Virginia

Using SQL-Compliant Applications and Code to Get the Most Out of Hadoop Data (Vienna) - Wednesday, February 22
https://www.meetup.com/bigdatadc/events/237330685/

Ansible Use Cases: HortonWorks & Cumulus Networks (McLean) - Thursday, February 23
https://www.meetup.com/Ansible-NOVA/events/236853616/

District of Columbia

IOT Real-time Big Data Analytics Using Kafka, Cassandra, and Spark (Washington) - Thursday, February 23
https://www.meetup.com/BusinessIntelligentsiaDC/events/234908223/

Pennsylvania

Big Data with Azure Data Lake Store and Data Lake Analytics (Pittsburgh) - Tuesday, February 21
https://www.meetup.com/Pittsburgh-Azure-Meetup/events/237190559/

DataPhilly Speaker Series (Philadelphia) - Thursday, February 23
https://www.meetup.com/DataPhilly/events/237673648/

New York

Crunching Streams of Data: An Introduction to Akka Streams (New York) - Thursday, February 23
https://www.meetup.com/New-York-Scala-University/events/237134973/

CANADA

Apache Spark #17 (Toronto) - Wednesday, February 22
https://www.meetup.com/Toronto-Apache-Spark/events/237474395/

UNITED KINGDOM

Tutorial: Get Your Hands on Implementing a Flink App (London) - Wednesday, February 22
https://www.meetup.com/Apache-Flink-London-Meetup/events/237335603/

Apache Spark Real World Use-Cases (Manchester) - Wednesday, February 22
https://www.meetup.com/HadoopManchester/events/237427329/

FRANCE

Data AZUG Meetup (Neuilly/Seine) - Wednesday, February 22
https://www.meetup.com/AZUG-FR/events/237604268/

Criteo Infrastructure Platform Meetup (Paris) - Wednesday, February 22
https://www.meetup.com/Criteo-Labs-Tech-Talks/events/237401817/

NETHERLANDS

Kafka All the Reactive Things (Amsterdam) - Tuesday, February 21
https://www.meetup.com/Reactive-Amsterdam/events/237449829/

Big Data Ingestion Part 2 (Amsterdam) - Thursday, February 23
https://www.meetup.com/Dutch-Azure-Meetup/events/235816631/

GERMANY

Let's Talk about Apache Flink 1.2, and Put It in a Container! (Karlsruhe) - Tuesday, February 21
https://www.meetup.com/inovex-karlsruhe/events/237131183/

Data Ingestion with Apache NiFi (Nuremberg) - Thursday, February 23
https://www.meetup.com/Nuernberg-Big-Data/events/237392793/

WebTech Night: Kafka Night! (Karlsruhe) - Thursday, February 23
https://www.meetup.com/WebTechNight-Karlsruhe/events/237169748/

HUNGARY

Hadoop Rockstars (Budapest) - Tuesday, February 21
https://www.meetup.com/futureofdata-budapest/events/236853376/

INDIA

Introduction to Apache Spark and Build Your First Apache Spark Application (Bangalore) - Saturday, February 25
https://www.meetup.com/Bangalore-Spark-Enthusiasts/events/237679999/

SOUTH AFRICA

Real-Time Big Data Analytics Use Cases (Johannesburg) - Tuesday, February 21
https://www.meetup.com/RSA-Real-Time-Big-Data-Analytics-and-Machine-Learning/events/236628971/