Data Eng Weekly Issue #259

08 April 2018

Lots of great content this week—stream processing with Apache Kafka, consistent hashing strategies, several posts on data infra architecture, and two posts related to Apache Hive performance. Flink Forward is this week and DataEngConf is next week (see News for a discount code). In releases, Apache Hadoop 3.1.0 is out, Apache Hive released a new version to fix some security issues, and ElastiCache is a cluster provisioning tool that recently gained support for Hadoop and Spark.

Sponsor

Are you looking to elevate your data culture http://bit.ly/culture-at-netflix? Do you want to work with stunning colleagues, solve complex data problems and contribute to the best video viewing experience in the industry? Come press ‘play’ with Netflix. We’re hiring for many types of data roles. Look for us at the upcoming DataEngConf and say hello to learn more.

http://bit.ly/netflix-jobs-data

Technical

This post describes one company's journey with BigQuery—the advantages it provides in performance and usability but also how that can be a double edged sword because it's easy to forget to optimize when it's so easy to use more compute (and money) for a query.

https://labs.unacast.com/one-year-with-bigquery-e3ebd73749cd

While it's been around for a while, I only recently came across the WSO2 streaming SQL system. In addition to analyzing data from Kafka, the open-source engine can take in data from lots of other sources (like HTTP, MQTT, JMS, and email). This post provides a good introduction to its SQL syntax, including how to use it for lots of common operations.

https://wso2.com/library/articles/2018/02/stream-processing-101-from-sql-to-streaming-sql-in-ten-minutes/

If you want to test out an unreleased or custom build of Spark, this post walks through the build and deploy steps for running on a YARN cluster or Kubernetes. The tutorial is tailored to Google Cloud Platform, but most of the steps are broadly applicable.

https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc

This post is a great overview of consistent hashing algorithms, which are a common building block in a distributed system. The post starts off with the classic ring-based consistent hashing mechanism and goes into more recent inventions like Jump Hash, Multi-Probe Consistent Hashing, Rendezvous Hashing, and Maglev Hash (three of the four are from Google).

https://medium.com/@dgryski/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8

This post summarizes a recent presentation on building and deploying ML models at Booking.com. They deploy docker containers using Kubernetes (which has some experimental support for GPUs) and, for serving, pull the pre-trained models from HDFS.

https://www.infoq.com/news/2018/04/booking-kubernetes-machine-learn

The Teads data analytics team writes about their data infrastructure, which is spread across Google Cloud Platform for BigQuery and DataFlow and Amazon Web Services for Kafka and Redshift. They describe a number of key architectural decisions that they made for ingesting, storing, and querying data in BigQuery as well as some limitations. Lots of good tips if you're considering a similar architecture (including some gotchas related to pricing to keep an eye on).

https://medium.com/teads-engineering/give-meaning-to-100-billion-analytics-events-a-day-d6ba09aa8f44

MR3 is a new execution engine for Apache Hadoop. It claims to have feature parity to Tez with improved performance. This two part series describes the performance gains when running Hive on MR3 vs Tez.

https://mr3.postech.ac.kr/blog/2018/04/02/performance-evaluation-sequential-tpcds/

Hortonworks has analyzed Hive LLAP runtime when data is stored in HDFS or Amazon S3 (tiered vs. decoupled storage strategies) for cloud workflows, and they found that there's a small fixed cost overhead of storing data in S3 rather than HDFS (at least for their test scenario).

https://hortonworks.com/blog/cloud-architectures-interactive-analytics-apache-hive/

A quick overview and performance evaluation of Spark RDDs and DataFrames. The performance eval uses the Wikipedia dataset and some simple aggregations both with Spark and PySpark.

https://mindfulmachines.io/blog/2018/4/3/spark-rdds-and-dataframes

Using the Syslog Apache Kafka Connect plugin, you can get syslog data into Kafka in Avro format for analysis. This post shows how to, once you've collected that data, perform analysis using KSQL.

https://www.confluent.io/blog/real-time-syslog-processing-apache-kafka-ksql-part-1-filtering

Great overview of the stream and table concepts in stream processing. The post has lots of animations to help illustrate the relationship between the two, and it has some examples in Scala and KSQL for building a table from a stream.

http://www.michael-noll.com/blog/2018/04/05/of-stream-and-tables-in-kafka-and-stream-processing-part1/

This tutorial walks through running Apache Kafka on Microsoft HDInsight and hooking it up to a Spark databricks cluster for stream processing.

https://lenadroid.github.io/posts/kafka-hdinsight-and-spark-databricks.html

News

Flink Forward is this week in San Francisco. DataArtistans have been previewing the sessions on their blog the past few weeks—here's part 6 of their coverage.

https://data-artisans.com/blog/flink-forward-san-francisco-preview-part-6-of-6-technology-deep-dive

DataEngConf, as mentioned last week, is in just over a week in San Francisco. The conference is offering 25% off for readers of Data Eng weekly, using the code DEWEEKLY.

https://www.eventbrite.com/e/dataengconf-sf-18-tickets-42458685070?discount=DEWEEKLY

The InfoQ podcast has an interview with Uber's Danny Yuan on their streaming systems, which include Apache Kafka, Apache Flink, Apache HDFS, and more. They process around 1 million messages per second, which feeds into OLAP systems, ML models, and more.

https://www.infoq.com/podcasts/Danny-Yuan-uber

Redmonk has analyzed the state of purpose-built time series database engines as well as some general purpose databases that are often used for time series data. They look at popularity on github and stack overflow as well as discuss the large amount of fragmentation we're currently seeing.

https://redmonk.com/rstephens/2018/04/03/the-state-of-the-time-series-database-market/

Sponsor

http://bit.ly/netflix-jobs-data

Releases

HUE 4.2 is out. The release focuses on cloud, analytics, and supportability. More details on those improvements in the release notes.

http://gethue.com/hue-4-2-and-its-self-service-bi-improvements-are-out/

Apache Hadoop 3.1.0 was released with the caveat that it's not yet considered production-ready. Major enhancements include updates to S3 support/performance and new features of the capacity scheduler. YARN also gets CPU and FPGA features as well as support for long running services.

https://lists.apache.org/thread.html/8313e605c0ed0012f134cce9cc6adca738eea81feccea99c8de87cd9@%3Cgeneral.hadoop.apache.org%3E

Apache Hive 2.3.3 is out. It contains security fixes related to 3 CVEs, one of which has been around as far back as Hive 0.6.0.

https://lists.apache.org/thread.html/cc41d3b2e1c176b10cba7518edd968e84fc95927deb5225967602310@%3Cuser.hive.apache.org%3E
http://bit.ly/CVE-2018-1282
http://bit.ly/CVE-2018-1284
http://bit.ly/CVE-2018-1315

Scio, the Scala library for Apache Beam and Google Cloud DataFlow, has released version 0.5.2 with new JavaConvertors implicits, custom validation for BigQuery, and more.

https://github.com/spotify/scio/releases/tag/v0.5.2

ElastiCluster, a tool for provisioning compute clusters, has added support for Hadoop and Spark clusters by way of BigTop. It's built with Ansible, has great documentation, and is licensed GPLv3.

http://elasticluster.readthedocs.io/en/latest/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Airflow Meetup @ WePay (Redwood City) - Wednesday, April 11
https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/247127561/

Ohio

Cleveland Red Hat User Group Meeting (Cleveland) - Tuesday, April 10
https://www.meetup.com/Cleveland-Red-Hat-Meetup/events/247175740/

New York

High-Performance Data Analytics & Visualizations for Volume, Variety, Velocity (New York) - Tuesday, April 10
https://www.meetup.com/mysqlnyc/events/249163161/

CANADA

Implementing Microservices with Domain Events + Replacing MirrorMaker (Vancouver) - Tuesday, April 10
https://www.meetup.com/vancouver-kafka/events/248682368/

SWEDEN

GCP Meetup at Google (Stockholm) - Tuesday, April 10
https://www.meetup.com/gcp-stockholm/events/249164223/

FINLAND

Apache Kafka Meetup @ Zalando (Helsinki) - Wednesday, April 11
https://www.meetup.com/Helsinki-Apache-Kafka-Meetup/events/249025327/

SPAIN

Creating a Big Data Architecture with Apache Spark (Barcelona) - Thursday, April 12
https://www.meetup.com/Spark-Barcelona/events/249359355/

FRANCE

Collaborative Music with Kafka! (Bordeaux) - Thursday, April 12
https://www.meetup.com/IppEvents/events/249380005/

SWITZERLAND

Lugano Tech Talks (Lugano) - Tuesday, April 10
https://www.meetup.com/Lugano-Tech-Talks/events/248955346/

CZECH REPUBLIC

Deep Learning on Hadoop (Prague) - Thursday, April 12
https://www.meetup.com/CS-HUG/events/249386944/

POLAND

Event-Driven Architecture with Kafka Streams (Katowice) - Friday, April 13
https://www.meetup.com/Silesia-JUG/events/249090295/

RUSSIA Introduction to Hadoop and Spark (Moscow) - Thursday, April 12
https://www.meetup.com/BigAlgo/events/247865467/

INDIA

Real-Time Data Pipelines with Apache Kafka & Java Futurity (Bangalore) - Saturday, April 14
https://www.meetup.com/Core-Java-Meetup-Bangalore/events/249219945/

AUSTRALIA

Sydney Data Engineering Meetup (Surry Hills) - Thursday, April 12
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/248326298/

SOUTH AFRICA

ZA HUG #4 (Johannesburg) - Thursday, April 12
https://www.meetup.com/ZA-Hadoop-User-Group/events/247574002/