Data Eng Weekly Issue #251

11 February 2018

Lots of great content this week—LinkedIn's Apache HDFS load testing tool, a few articles on Apache Flink, example microservices built on Apache Kafka, and details on replacing Kafka's usage of ZooKeeper with etcd. There are also a couple of posts on Hadoop 3.0, and a year-ender from Confluent.

Technical

This post gives a pretty thorough overview of the architecture behind Apache Flink's incremental checkpointing feature that provides resiliency for stateful stream processing. Flink leverages RocksDB for local state, and it keeps track of which sstables (the underlying file format storing the data) need to be backed up to stable storage to create a snapshot. There's quite a bit that goes into this, as is described in the post.

http://flink.apache.org/features/2018/01/30/incremental-checkpointing.html

This in-depth post compares ClickHouse, Druid, and Pinot, all open-source OLAP distributed storage engines. It describes the similarity between systems (e.g. the storage and indexing strategies), compares performance characteristics, and highlights key differences in data ingestion, data replication, and query execution.

https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7

Hortonworks has two posts on new features in HDF 3.1. The first covers the NiFi registry, which facilitates versioning of NiFi flows when promoting them across environments. The second looks at the Kafka integration, which has some neat features like support for ingesting data from the edge using the MiNiFi C++ agent. It also integrates with Apache Ranger for security and Apache Ambari for monitoring.

https://hortonworks.com/blog/hdf-3-1-blog-series-part-2-introducing-nifi-registry/
https://hortonworks.com/blog/hdf-3-1-blog-series-part-3-kafka-1-0-support-powerful-hdf-integrations/

Confluent KSQL doesn't yet have support for runtime configuration of User Defined Functions (UDFs), but it is possible to write a custom function and to rebuild from source. This tutorial walks through the steps to do that.

https://www.confluent.io/blog/write-user-defined-function-udf-ksql/

This post walks through configuring Amazon EMR with Kerberos for security and using IAM roles to implement fine-grained access to data stored in Amazon S3. It also integrates with HUE and Hive, which are part of the demo.

https://aws.amazon.com/blogs/big-data/build-a-multi-tenant-amazon-emr-cluster-with-kerberos-microsoft-active-directory-integration-and-emrfs-authorization/

LinkedIn has written about how they test Apache Hadoop DFS performance before upgrading versions using a load-testing tool called Dynamometer. It simulates a production load by bootstrapping from the NameNode FS image, running a large number of simulated DataNodes, and replaying real operations from the HDFS audit log. There are lots of interesting details and stories in the post. Dynamometer is also now available on github.

https://engineering.linkedin.com/blog/2018/02/dynamometer--scale-testing-hdfs-on-minimal-hardware-with-maximum

The dataArtisans' blog has coverage of a number of upcoming features in Apache Flink. These include a new deployment model with better support for Kubernetes, work on speeding up failure recovery, improved network stack performance, support for the Swift filesystem, and several updates to Flink Streaming SQL.

https://data-artisans.com/blog/apache-flink-master-branch-monthly-whats-new-flink-january-2018

This post details a talk from QCon New York on Netflix's stream processing system. Built on Apache Kafka, Apache Flink, Apache Mesos, and more, the system analyzes data from video playback/discovery events. The post describes some of the challenges that Netflix encountered (such as getting access to live data and JAR conflicts) and strategies implemented (such as caching of data and data recovery) along the way.

https://www.infoq.com/articles/netflix-migrating-stream-processing

This example implements the event-sourcing / CQRS model with Apache Kafka Streams. It includes several diagrams to explain what the service is doing and bundles code on github. That code includes Docker image definitions, making it easy to try out locally.

https://sleeplessinslc.blogspot.com/2018/02/inventory-microservice-example-with.html

MapR has the third part in its series on stream processing to predict flight delays. This part looks at Kafka and Spark Streaming. The implementation makes heavy use of JSON, which is natively supported by MapR-DB.

https://mapr.com/blog/fast-data-processing-pipeline-predicting-flight-delays-using-apache-apis-pt-3/

The folks at Banzai have a fork of Apache Kafka that uses etcd rather than Apache ZooKeeper. The post describes their motivation, how hard it is to do the replacement, and some of the issues that they ran into along the way. They write about some simple testing, but you probably want to do quite a bit more before trying this in production.

https://banzaicloud.com/blog/kafka-on-etcd/

Sponsor

Hello Fresh: Change the way people eat forever. Work with our data technology to deliver healthy meals to millions of customers, with a cutting-edge tech stack (Hadoop, Kafka, Impala, pyspark, AWS, Airflow) and time for personal and engineering development. Click the link for more info on becoming a Data Engineer at Hello Fresh in Berlin!

http://bit.ly/hello-fresh-data-eng-weekly

News

Confluent has written a year-ender celebrating milestones for 2017 for their company and Apache Kafka. Highlights include Confluent's growth, KSQL, Kafka 1.0, and exactly-once in Kafka.

https://www.confluent.io/blog/confluent-apache-kafka-2017/

This post notes that Hadoop 3.0's erasure coding provides huge improvements to storage efficiency for Hadoop clusters at the expense of network and CPU. This changes the way that companies will have to plan to grow capacity, and it may shift some of the cost calculations when it comes to on-prem vs. SaaS.

https://www.datanami.com/2018/02/07/erasure-coding-changes-hadoop-storage-economics/

Erasure coding isn't the only new feature in Hadoop 3. This post summarizes several other major features like support for containers, support for additional standby NameNodes, and intra-node disk balancing.

https://hortonworks.com/blog/hadoop-3-adds-value-hadoop-2/

Releases

Apache Atlas version 0.8.2 was released. Atlas provides governance for Hadoop ecosystem projects, including Hive and Storm. The new release includes search and UI improvements, fixes for high availability, and more.

https://lists.apache.org/thread.html/4044ded10f08f0f75461f26494681bde959274e4b66d51b7e57a68c2@%3Cannounce.apache.org%3E

Version 2.7.1 of Apache Lens, the analytics engine for Hadoop, Hive, services supporting JDBC, and more, was released. Major features include support for Java 8, improvements to per user configuration in the job scheduler, cube segmentation, retries to recover from transient errors, and UNION support across fact tables. There are also a number of bug fixes.

https://lists.apache.org/thread.html/a743fa18eb86c2aa830a0ec73de011e7b8219a10b021818fcf4f4596@%3Cannounce.apache.org%3E

Apache Knox, the REST API gateway to Hadoop, released version 1.0.0. It has a few new features since the 0.14.0 release, but the major changes are to the package namespace.

https://lists.apache.org/thread.html/be18fe2451c0431cf51709368bb81c343b0beb2707cdc5134d63178e@%3Cannounce.apache.org%3E

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Processing 100 Billion Events a Day at GumGum (El Segundo) - Thursday, February 15
https://www.meetup.com/South-Bay-JVM-User-Group/events/246058042/

Building a Big Data Stack on Kubernetes (San Jose) - Thursday, February 15
https://www.meetup.com/BayLISA/events/245919029/

North Carolina

Jeff Dutton from HPE Talking Domain-Driven Design, CQRS, and Event Sourcing (Raleigh) - Monday, February 12
https://www.meetup.com/Raleigh-Apache-Kafka-Meetup-by-Confluent/events/246915440/

New Jersey

Apache NiFi Mardi Gras (Princeton) - Tuesday, February 13
https://www.meetup.com/futureofdata-princeton/events/247285317/

New York

Event Processing at Scale + Advocating for Continuous Improvement (New York) - Thursday, February 15
https://www.meetup.com/ThoughtWorks-Tech-Talks-NYC/events/247431380/

CANADA

Building Multi-Region & Multi-Cloud Services with Kafka (Kanata) - Thursday, February 15
https://www.meetup.com/Big-Data-Developers-in-Ottawa/events/247320681/

SWEDEN

Kafka Meetup with Confluent and Forefront Consulting (Stockholm) - Thursday, February 15
https://www.meetup.com/Stockholm-Apache-Kafka-Meetup-by-Confluent/events/246762760/

Apache Beam: Unified Batch and Stream Processing! (Stockholm) - Thursday, February 15
https://www.meetup.com/stockholm-hug/events/247080325/

FRANCE

Human Talks (Montpellier) - Tuesday, February 13
https://www.meetup.com/HumanTalks-Montpellier/events/247135496/

CZECH REPUBLIC

Peek at Avast Big Data Kitchen Tools (Prague) - Thursday, February 15
https://www.meetup.com/CS-HUG/events/247381691/

ISRAEL

Running Microservices on Apache Kafka (Tel Aviv-Yafo) - Wednesday, February 14
https://www.meetup.com/ApacheKafkaTLV/events/246838417/