Data Eng Weekly Issue #281

16 September 2018

New open source projects from Facebook, LinkedIn, Two Sigma, and Oath this week. Several great posts about company's data experiences—the Netflix Keystone platform, Hike's experiences with BigQuery, Clio's experience sharding a production database, nextgen timeseries database at Pinterest, optimizing Redshift at Plaid, and more. And based on some of the news out of Strata, it sounds like Hadoop is really getting ready to ride the Kubernetes wave.

Sponsor

Crunch Data Engineering and Analytics Conf (Oct 29-31, Budapest) is offering you a $60 discount with code WeeklyCrunch on Regular and Late Bird tickets. Google, Slack, Apache Arrow… full lineup below.

https://goo.gl/PfCa49

Technical

Azure Data Factory is a tool for visually designing and running ETLs between various systems (it has a bunch of connectors). This tutorial demonstrates setting up a job to load data from blob storage to a SQL database.

https://medium.com/@karandama2006/data-load-using-azure-data-factory-2528747752fc

Hike shares their experiences in moving from a Hive-based ad hoc analytics system to Google BigQuery. They saw good speedups, especially after making use of clustered tables. They detail their tooling and why they enabled require_partition_filter as a guard rail. Overall, they're seeing 50x speedups and half the cost.

https://blog.hike.in/moving-to-bigquery-data-at-our-fingertips-2273a71252ce

Clio recently went through the process of sharding their online MySQL database, and they've documented the details of the transition. Among these, they applied a regex to detect which operations contained joins and transactions that might be problematic. Lots of practical advice if you're facing something similar.

https://labs.clio.com/sharding-clios-database-part-1-710ec8f4861c

Autotrader has a good walkthrough of setting up Spark to send logs to Logstash using the logstash-gelf library.

https://engineering.autotrader.co.uk/2018/09/10/sending-spark-logs-to-elk-using-logstash-gelf.html

The Plaid technology blog has a great overview of how they analyzed Redshift performance of the queries powering their BI dashboards in Periscope and what changes and improvements they made. These included some well known patterns like VACUUM/ANALYZE and also adding jobs to their Airflow workflow to precompute some rollups.

https://blog.plaid.com/managing-your-amazon-redshift-performance-how-plaid-uses-periscope-data/

Keystone is Netflix's platform for real-time stream processing for analytics. It's built on Apache Kafka and Apache Flink (in addition to a number of Netflix tools). This overview shows just how big the challenges are for building a multi-tenant tool at their scale—all the various flavors of stream processing are needed. The post then describes how they've built the system to meet those requirements and to be self-service with good operational characteristics.

https://medium.com/netflix-techblog/keystone-real-time-stream-processing-platform-a3ee651812a

"Streams and Tables: Two Sides of the Same Coin" formalizes some of the key concepts in Kafka Streams. It describes the trade-offs related to processing and event time, and walks through the Kafka implementation as a case study.

https://dl.acm.org/citation.cfm?id=3242155

Heap analytics has a fascinating debug story about how unexpected ClassLoader behavior led to problems in their Flink jobs. The post includes a lot of great JVM debugging tools to add to your tool belt (javap, BTrace, and -verbose:class).

https://heapanalytics.com/blog/engineering/missing-scala-class-noclassdeffounderror

This presentation discusses the evolution of the Hadoop ecosystem, and it argues that we are currently in a state of "deconstructed database." That is, there are a number of components—storage, query model, data exchange, etc. that have evolved and can often be swapped in and out. The slides close with some predictions about the future.

https://www.slideshare.net/julienledem/strata-ny-2018-the-deconstructed-database

Pinterest writes about their OpenTSDB replacement, Goku, which is wire-compatible and written in C++. The post talks about the architecture of the system (data replication and disk-based storage are forthcoming) and describes its performance characteristics. Pinterest is known to be big users of Apache HBase, so it's notable that they had enough practical challenges running a large cluster for OpenTSDB to motivate building a non-HBase replacement.

https://medium.com/@Pinterest_Engineering/goku-building-a-scalable-and-high-performant-time-series-database-system-a8ff5758a181

This KSQL tutorial shows how to build a streaming application to track music stream events to build an all-time (and last 30 seconds) play count.

https://www.confluent.io/blog/building-streaming-application-ksql/

The AWS blog has published a sample Complex Event Processing application built on Apache Flink and EMR. It's built to detect brush fires based on sensor data.

https://aws.amazon.com/blogs/big-data/real-time-bushfire-alerting-with-complex-event-processing-in-apache-flink-on-amazon-emr-and-iot-sensor-network/

Sponsor

Mount Sinai School of Medicine is hiring Data Engineers; come work on cool research and important applied problems in NYC's largest healthcare system!

https://careers.mountsinai.org/jobs/2311556 to apply

News

Hortonworks has announced the Open Hybrid Architecture Initiative, which is a project to improve hybrid architecture of Hortonworks' products. There are three phases—containerization, separating compute and storage, and Kubernetes / OpenShift integration (in partnership with RedHat and IBM). Along with a glimpse at future plans from Cloudera (second post) who is looking at moving to Kubernetes (MapR has supported K8s for a while), we're liking seeing the beginning of the end of YARN for workflow management.

https://hortonworks.com/blog/bringing-cloud-native-architecture-to-big-data-in-the-data-center/
https://www.datanami.com/2018/09/13/cloud-looms-large-at-strata-and-so-does-kubernetes/

Videos and slides from Flink Forward Berlin, which took place two weeks ago, have been posted.

https://data-artisans.com/flink-forward-berlin-2018

Strata NYC was this week. This article has coverage of a number of announcements and themes from the conference.

https://www.zdnet.com/article/strata-nyc-2018-ai-data-governance-containers-and-the-production-ready-data-lake/

Jobs

Have you checked out the Data Eng Weekly job board yet? https://jobs.dataengweekly.com/ New job this week:

Linux Big Data Engineer, G-Research, London: https://jobs.dataengweekly.com/jobs/cc513d48-56d0-4818-8364-84b1319a9411

Post a job for $99. https://jobs.dataengweekly.com/

Releases

Flint is a new open source library for time series data from Two Sigma. It provides primitives for manipulating entire time series, such as joining, windowing, and resampling. This blog post gives an overview of the API, which is Python.

https://databricks.com/blog/2018/09/11/introducing-flint-a-time-series-library-for-apache-spark.html

Yahoo/Oath have open-sourced the Oak library that implements a hybrid on-heap/off-heap concurrent ordered map for the JVM. It has great impressive scaling and memory improvements over similar implementations, and it is already being integrated into Druid, which is a system that makes heavy use of this type of data structure.

https://yahoodevelopers.tumblr.com/post/178045146133/introducing-oak-an-open-source-scalable-key-value

Hortonworks Data Analytics Studio is now generally available. DAS is a web application that provides things like auto-complete of Hive queries, recommendations to improve query performance, and much more. Tgere are some good screenshots and gifs in the announcement.

https://hortonworks.com/blog/announcing-general-availability-data-analytics-studio/

BlueData has announced support for Google Cloud Platform and Microsoft Azure in their Big-data-as-a-Service platform.

https://www.bluedata.com/blog/2018/09/hybrid-and-multi-cloud-playbook-for-ai-and-big-data-workloads/

Version 2.4.1 of Apache Kylin, the OLAP engine for big data systems, was released. It includes 22 bug fixes and improvements.

https://kylin.apache.org/docs/release_notes.html

LinkedIn has open sourced TonY, their TensorFlow on YARN library. It includes fault tolerance via checkpointing to HDFS, leverages Hadoop's GPU scheduling and isolation, and has forthcoming support for TensorBoard, which is a diagnostic tool for TensorFlow.

https://engineering.linkedin.com/blog/2018/09/open-sourcing-tony--native-support-of-tensorflow-on-hadoop

Facebook has open sourced LogDevice, its distributed log system (with many similarities to Apache Kafka / Pulsar). It's written in C++ and there's a new website with docs on the architecture, configuration, running locally, and more.

https://logdevice.io/blog/2018/09/12/open-sourcing-announcement.html

Data Eng Weekly

Data Eng Weekly Issue #281

Sponsor

Technical

Sponsor

News

Jobs

Releases

Sponsors

Events

California

New York

Massachusetts

UNITED KINGDOM

SPAIN

NETHERLANDS

GERMANY

ITALY

AUSTRALIA