Data Eng Weekly


Data Eng Weekly Issue #290

18 November 2018

Several different technologies covered this week—Kafka, Pulsar, Spark, Druid, Airflow, and HDFS. In open-source news, Etsy announced an Airflow companion tool, Edmunds announced two tools for working with Databricks deployments, and Pravega announced a ZooKeeper Operator for Kubernetes. There's also an interesting look at how WeChat operates at massive scale. On a scheduling note, I'll be skipping next week's issue—so look for issue #291 on December 2nd. Please send articles you find my way in the meantime!

Sponsor

Interested in extracting value from large quantities of data? That's what we do at Criteo and it's actually pretty much all we do. We have R&D job openings in data engineering, machine learning, and related areas. We use a large number of Big Data technologies, mainly open source, and like to contribute to the open-source community as much as possible. We are looking for people who are interested in tackling the challenges that come with ingesting and analyzing hundreds of terabytes of data per day. Never heard of Criteo? We are an international ad-tech company with offices all over the world. We have a pretty cool corporate culture: http://bit.ly/criteo-culture. We take Big Data seriously enough that we have our own conference, NABD http://bit.ly/NABD-Conf, which is guaranteed to be free of marketing talks.

Checkout our R&D jobs at our offices in Palo Alto, Ann Arbor, Paris, and Grenoble: http://bit.ly/criteo-rd-jobs

Technical

Edmunds writes about how they've scaled their use of Databricks notebooks. They've implemented, and open sourced, several tools to automate checks (e.g. to validate job names), establish common software patterns, and implement a deployment pipeline. The open-source tools are a Java REST client for Databricks and a Maven plugin.

https://databricks.com/blog/2018/11/12/open-sourcing-databricks-integration-tools-at-edmunds.html

Streamlio has examples of implementing several approximation algorithms via Apache Pulsar functions. Their post covers Bloom Filters, Count-Min Sketch, HyperLogLog, and more. There are good visuals accompanying each of the algorithms.

https://streaml.io/blog/eda-real-time-analytics-with-pulsar-functions

A tutorial that covers configuring rsyslog to send data to Apache Kafka and pulling data out of Kafka via logstash.

https://sematext.com/blog/recipe-rsyslog-apache-kafka-logstash/

The slides and accompanying source code from the "Build Event-Driven Microservices with Apache Kafka," which was a 3-hour workshop at QCon SF, have been posted. There are three labs that cover building a web service for an online orders application.

https://drive.google.com/file/d/1vq1tcLk4FWK7BKRG5cEhOzbeXoy65hUK/view
https://github.com/confluentinc/qcon-microservices

The Confluent blog has a post explaining Kafka Connect's Converters, which are used to serialize and deserialize data. It describes common errors and problems related to Apache Avro and JSON (two common formats), how to troubleshoot problems by looking at logs and the inspecting data directly from Kafka, and working with other data formats.

https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained

Airbnb writes about their experiences with Druid for analytics. They describe how Druid complements their other big data systems, how they ingest data with Spark Streaming, integration with Presto, monitoring, and challenges/future improvements.

https://medium.com/airbnb-engineering/druid-airbnb-data-platform-601c312f2a4c

Etsy has open sourced their tool, called boundary-layer, for defining Apache Airflow workflows using YAML. One of the reasons for the popularity of Python for workflow engines (e.g. Luigi, Airflow) is that there is inevitably scripting involved in running and deploying a workflow. But not everyone knows Python, and there are some other reasons why a declarative format might make sense (a compelling one here is that boundary-layer can convert Oozie Workflows to its YAML format). This post has more details on their reasoning and how the tool is used at Etsy.

https://codeascraft.com/2018/11/14/boundary-layer%e2%80%89-declarative-airflow-workflows/

Criteo (disclosure: they are sponsoring this week's issue!), has written about challenges in scaling their Hadoop cluster. There are a number of interesting anecdotes from debugging NameNode performance issues and outages. The Criteo HDFS cluster adds 40PB of data each year, and the NameNode is storing 100s of millions of blocks.

https://medium.com/criteo-labs/40pb-per-year-the-challenge-of-data-growth-at-criteo-5d5b73ec5294

The Databricks blog has an introduction to the new higher-order functions introduced to Spark SQL in Spark 2.4. For example, TRANSFORM can apply a lambda to the elements of an array or a map.

https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html

WeChat recently published a paper about how they scale their microservices architecture (which is made up of over 3000 services). The morning paper has a great summary of the paper and the overload control mechanism, which is essential for handling large scale spikes in traffic.

https://blog.acolyer.org/2018/11/16/overload-control-for-scaling-wechat-microservices/

Jobs

Post a job to the Data Eng Weekly job board for \$99. https://jobs.dataengweekly.com/

News

The Ceph Foundation, which is organized under the Linux Foundation, is a new development community for the Ceph project. Ceph includes among its features a S3-compatible object store.

https://www.zdnet.com/article/ceph-open-source-storage-takes-an-organizational-step-forward/

Amazon has announced that they're maintaining a distribution of OpenJDK, called Amazon Corretto. The OpenJDK 8-based version is available in preview now (with GA targeted for Q1 2019) and an Open JDK 11-based version planned for April of next year.

https://aws.amazon.com/blogs/opensource/amazon-corretto-no-cost-distribution-openjdk-long-term-support/

Releases

Version 2.8.0 of Luigi, the workflow engine, has been released. This version fixes a security issue with CORS, adds Python 3.7 compatibility, and includes a handful of other updates.

https://github.com/spotify/luigi/releases/tag/2.8.0

A new version of Amazon EMR was released with Apache Flink 1.6.0, Apache Zeppelin 0.8.0, and lots of other updates.

https://aws.amazon.com/about-aws/whats-new/2018/11/support-for-flink-160-zeppelin-080-and-s3-select-with-hive-and-presto-on-amazon-emr-release-5180/

Kafka Hawk is a tool that reports consumer group offset commit data to Prometheus.

https://github.com/farmdawgnation/kafka-hawk/releases/tag/v1.1.0

Red Hat announced AMQ Streams, which brings Apache Kafka to Red Hat's OpenShift platform.

https://www.redhat.com/en/blog/announcing-red-hat-amq-streams-apache-kafka-red-hat-openshift

Apache Phoenix 4.14.1 was released with critical fixes to secondary index functionality.

https://lists.apache.org/thread.html/2061617de5e8fbbd81a7781d074892f0279881f2f90b57c4ef582c06@%3Cannounce.apache.org%3E

Zookeeper Operator is a new open source project for running Apache ZooKeeper on Kubernetes.

https://github.com/pravega/zookeeper-operator/releases/tag/0.1.1

Sponsor

Interested in extracting value from large quantities of data? That's what we do at Criteo and it's actually pretty much all we do. We have R&D job openings in data engineering, machine learning, and related areas. We use a large number of Big Data technologies, mainly open source, and like to contribute to the open-source community as much as possible. We are looking for people who are interested in tackling the challenges that come with ingesting and analyzing hundreds of terabytes of data per day. Never heard of Criteo? We are an international ad-tech company with offices all over the world. We have a pretty cool corporate culture: http://bit.ly/criteo-culture. We take Big Data seriously enough that we have our own conference, NABD http://bit.ly/NABD-Conf, which is guaranteed to be free of marketing talks.

Checkout our R&D jobs at our offices in Palo Alto, Ann Arbor, Paris, and Grenoble: http://bit.ly/criteo-rd-jobs

Events

Curated by Datadog ( http://www.datadog.com )

UNITED KINGDOM

Mnemosyne: a Distributed Bitmapped Indexing Layer for Big Data (Brighton) - Wednesday, November 21
https://www.meetup.com/Brighton-Java/events/255503632/

BELGIUM

Workshop: Event-Sourced System through Reactive Streams (Leuven) - Thursday, November 22
https://www.meetup.com/Reactive-Programming-Belgium/events/254708703/

ISRAEL

From a Parquet File's Structure to Designing and Developing BD Architectures (Tel Aviv) - Wednesday, November 21
https://www.meetup.com/Machine_Learning_and_Big_Data_hands_on/events/255020524/

INDIA

Apache Spark for Machine Learning & Data Science (Hyderabad) - Saturday, November 24
https://www.meetup.com/Hyderabad-Python-Meetup-Group/events/256034088/

AUSTRALIA

November Sydney Data Engineering Meetup (Sydney) - Thursday, November 22
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/255623523/