Data Eng Weekly Issue #263

06 May 2018

Great posts this week covering Kubernetes, Apache Kafka, Apache Flink, Alibaba's OceanBase, and more. We also have several articles on evolving data architectures inside of organizations, including at M6 and Netflix. Finally, there are many notable releases, like Apache HBase 2.0.0 and Apache Impala 2.12.0.

Sponsor

We’re Shopify. Our mission is to make commerce better for everyone http://bit.ly/shopify-careers and in the last few years we have built a data warehouse that is now ready to power data products and insights for all of our 500,000+ merchants! We thrive on change, operate on trust, and leverage the diverse perspectives of people on our team in everything we do. We solve problems at a rapid pace and at scale. In short, we get shit done. Join us!

http://bit.ly/join-us-shopify-data

Technical

The RocksDB key-value store has been adopted by a number of projects in the big data ecosystem such as Apache Samza and Apache Flink. Its popularity is largely due to its consistent performance irrespective of available RAM. That's highlighted here, in a performance comparison between MySQL running InnoDB vs. the MyRocks storage engine. Until the data set fits in RAM, MyRocks performs better than InnoDB.

https://www.percona.com/blog/2018/04/30/a-look-at-myrocks-performance/

The data team at M6 has written about the evolution of their data platform. It started with the Krux data management platform and has more recently landed with Hadoop. Along the 2 year journey, they considered AWS, Google Cloud, and other options for building out their Hadoop cluster and supporting Tableau. There are lots of tips and lessons learned in the post as well as several data points describing how long various phases of their migration took.

https://medium.com/@corychaplin/genesis-of-m6s-datalake-edf2524b7d67

Datanami has coverage of a talk from Flink Forward on Netflix's move to Apache Flink. There are some impressive numbers (e.g. 12PB of data per day) as well as several AWS-specific optimizations that the Netflix team implemented (such as randomness in output file names to prevent S3 hot-spotting).

https://www.datanami.com/2018/04/30/how-netflix-optimized-flink-for-massive-scale-on-aws/

At some point in the Apache Kafka journey, you're faced with a trade-off: either invest more in hardware to store historic data or ship data out of the cluster to another storage system (or let it expire out). This post argues that this implementation constraint is the biggest thing preventing an idyllic Kafka-centric architecture.

https://medium.com/@colin.hicks/what-keeps-apache-kafka-from-eating-the-world-ea1caeef99e4

The Landoop blog has a post walking through their new Lenses JDBC driver for SQL on Apache Kafka. They've published a docker image that has all you needed to get going and try it out.

http://www.landoop.com/blog/2018/05/introducing-kafka-lenses-jdbc-driver/

Alibaba has written about their distributed relational database, OceanBase. The post focuses on the consistency and ACID semantics in OceanBase and how they compare to Azure's CosmosDB and Apache Cassandra.

https://medium.com/@alitech_2017/keeping-data-doppelgangers-in-synch-50a2c2863975

This post shares a number of best practices for getting an Apache NiFi project from POC to production.

https://medium.com/@abdelkrim.hadjidj/best-practices-for-using-apache-nifi-in-real-world-projects-3-takeaways-1fe6912101db

Cockroach has a great post on best practices for running stateful applications (e.g. databases) on a Kubernetes 1.9 cluster. There's an in-depth look at the StatefulSets and DaemonSets features that Kubernetes added for these types of workloads. In a second post, they have recommendations specific to running CockroachDB with either of these two approaches.

https://www.cockroachlabs.com/blog/kubernetes-state-of-stateful-apps/
https://www.cockroachlabs.com/blog/kubernetes-orchestrate-sql-with-cockroachdb/

This post has a great overview of Amazon Redshift. It covers the fundamental concepts (like sort keys and distribution style) and how they play out in practice. There are also tips for system maintenance, including commands to detect how much data is marked for deletion and optimize table compression. Chances are that you'll learn something new, even if you're already using Redshift.

https://www.jefclaes.be/2018/05/amazon-redshift-fundamentals.html

A CVE for Apache Hadoop 2.2.0 through 2.7.3 was disclosed this week. The root of issue was that the Hadoop scripts were trusting argv[0]. This post walks through the low-level details of the problem and the eventual solution.

https://effectivemachines.com/2018/05/03/fixing-apache-hadoop-cve-2016-6811-argv0-vs-security/

Databricks has a post pushing back on the conventional notion that big data tools are slower than standard tools for analyzing a data set that fits on a single-machine. By way of comparison to Pandas, they show good performance and better scaling with Spark and PySpark than with Pandas.

https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html

This post details a migration from Oracle to Redshift + Presto on Elastic MapReduce to save money and improve performance.

https://medium.com/@yogeshkansal/cost-saving-of-4m-dollars-per-year-by-migrating-from-oracle-to-redshift-emr-bb2895a4b98e

Data Eng Jobs

Checkout the first two data engineering job postings on our new job board:

https://dataengweekly.com/jobs

News

FoundationDB was open sourced by Apple two weeks ago. It sounds like uptake on the project is strong—new language bindings, work a GraphDB and network block device layer, and more.

https://www.foundationdb.org/blog/foundationdb-community-highlights-two-weeks-in/

Confluent has announced the Confluent Operator for running Kafka on Kubernetes. While it's not out yet, it will be batteries included with automated deployments for multiple Kafka clusters using stateful sets, support fo operational items like scaling up and down, resiliency, and monitoring via Prometheus.

https://www.confluent.io/blog/introducing-the-confluent-operator-apache-kafka-on-kubernetes/

Sponsor

http://bit.ly/join-us-shopify-data

Releases

Github's gh-ost tool for online schema migrations for MySQL has reached its generally availability milestone. Written in Go, it copies data to a ghost table in the background and makes use of the binlog to stream updates to the ghost table before swapping the ghost table in place to complete the migration.

https://github.com/github/gh-ost

The Wallaroo streaming engine has released a Docker image to improve usability for building streaming applications in Python.

https://medium.com/wallaroo-labs-engineering/simplify-stream-processing-in-python-and-wallaroo-using-docker-97cbc57339f6

New versions of StreamSets Data Collector and StreamSets Data Collector Edge. Features include support for HDFS as an origin, ORC File, and writing to Apache Kafka from the Collector Edge.

https://streamsets.com/blog/announcing-streamsets-data-collector-and-data-collector-edge-3-2/

Databricks has announced a new optimized auto-scaling feature. It aggressively scales up and down instances for a job based on collected statistics. This can save significant costs by using fewer resources.

https://databricks.com/blog/2018/05/02/introducing-databricks-optimized-auto-scaling.html

In a somewhat similar tool, Banzai's cluster recommender provides an API that suggests a spot-instance cluster composition based on current pricing and the best practice of diversifying cluster nodes to survive instance termination.

https://banzaicloud.com/blog/cluster-recommender/

Apache HBase 2.0.0 was released. It has a number of new major features including off-heap read/write paths, a new region assignment manager, and in-memory compaction. The release has several backwards-incompatible changes.

https://lists.apache.org/thread.html/96079c034599fe374e0950e271c41f93d16db0998ba562e3bfa4f6bb@%3Cannounce.apache.org%3E

Apache Impala, the distributed SQL engine, 2.12.0 was released. The new version adds support for decimal types on Apache Kudu tables, TABLESAMPLE functionality, and many improvements & bug fixes.

https://lists.apache.org/thread.html/325dd219b3250f063de48053b80bdea107d0dfb1b090ebf1f086b72e@%3Cannounce.apache.org%3E

The Kafka Security Project has been updated to support Apache Kafka 1.1.0.

https://github.com/simplesteph/kafka-security-manager/releases/tag/v0.2

Data Eng Jobs

Checkout the first two data engineering job postings on our new job board:

https://dataengweekly.com/jobs

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Hive User Group Meeting @Cloudera (Palo Alto) - Tuesday, May 8
https://www.meetup.com/Hive-User-Group-Meeting/events/249641278/

Processing Streaming Data with KSQL (Hollywood) - Thursday, May 10
https://www.meetup.com/Los-Angeles-Apache-Kafka-Meetup/events/250219714/

Georgia

Confluent and Striim Talk Kafka (Atlanta) - Wednesday, May 9
https://www.meetup.com/Kafka-ATL/events/250175814/

New York

Streaming Architectures in IoT (New York) - Tuesday, May 8
https://www.meetup.com/IoT-NY/events/249676865/

Using Apache Airflow to Create Dynamic, Extensible Data Workflows on GCP (New York) - Tuesday, May 8
https://www.meetup.com/Big-Data-Warehousing/events/250142741/

BRAZIL

Hortonworks & IBM Cognitive: The Future of Data Science (Sao Paulo) - Monday, May 7
https://www.meetup.com/futureofdata-saopaulo/events/249474208/

ITALY

Kafka Connect in Practice (Milan) - Wednesday, May 9
https://www.meetup.com/Milano-Kafka-meetup/events/249504469/

ROMANIA

Testing Big Data Solutions, Apache NiFi (Bucharest) - Tuesday, May 8
https://www.meetup.com/Bucharest-Big-Data-Meetup/events/250060831/

CHINA

Spark & Flink Meetup (Hangzhou) - Saturday, May 12
https://www.meetup.com/Hangzhou-Apache-Spark-Meetup/events/249424093/

AUSTRALIA

Big Data in Action: Azure HDInsight at Your Service (Brisbane) - Thursday, May 10
https://www.meetup.com/Brisbane-Data-Science-Meetup/events/248886466/