Data Eng Weekly


Hadoop Weekly Issue #239

05 November 2017

Apache Kafka hit the 1.0.0 milestone this week—congrats to all those involved! Whether in celebration or by consequence, there are a bunch of Kafka-related posts this week covering applications built with Kafka streams, operational considerations for Kafka, the differences between the Kafka Streams DSL and Processoro API, and debugging a production Kafka issue. There are also great posts on structured logging, Scio, and Python UDFs for Spark.

Technical

When it comes time to analyze data, it's important to have well-structured and enriched source data. Structured logging, in which metadata and a schema are applied to events, is a good way to ensure high-quality. This post looks at how to build an API for structured logging in just a few lines of code.

https://honeycomb.io/blog/2017/10/you-could-have-invented-structured-logging/

Part two of the blog post series on Scio covers the core features of the API, additional syntactic sugar (such as support for side-input joins), and support for type-safe BigQuery (via a macro that performs a dry-run query to generate a case class for output rows). The post also describes some of the ways it is used at Spotify—ad hoc queries (by over 200 people), a tool called BigDiffy (for pairwise comparisons), and another tool, Featran, for feature transformations.

https://labs.spotify.com/2017/10/23/big-data-processing-at-spotify-the-road-to-scio-part-2/

This post demonstrates how to use Kafka for a non-trivial stream processing application. Specifically, data is pulled from an HTTP endpoint and inserted into Kafka using the Producer API. From there, a Kafka Streams application performs fraud detection (with a stubbed out method in this example) and computes long-term and 90-day aggregated statistics. Finally, Kafka Connect writes data to a PostgreSQL database for serving up via REST API.

https://medium.com/@stephane.maarek/how-to-use-apache-kafka-to-transform-a-batch-pipeline-into-a-real-time-one-831b48a6ad85

The MSDN blog has a four-part blog series focused on learning to use Azure for folks who are already familiar with Hadoop. It teaches the basics like access management, networking, storage, compute, and the Azure Data Services (including HDFS compatible PaaS offerings).

https://blogs.msdn.microsoft.com/cloud_solution_architect/2017/10/26/just-enough-azure-for-hadoop/

Zolando has a guest post on the Confluent blog about several operational / SRE considerations for Kafka. These include the type of EBS volumes to use on AWS (specifically, EBS volume type is important since RocksDB makes lots of writes, and for some volume types there is a burst capacity budget that is important to monitor), tuning some defaults, monitoring of consumer lag, monitoring memory usage, and optimizing instance memory (with the counter-intuitive realization that a smaller JVM heap is better given that Kafka uses lots of off-heap memory).

https://www.confluent.io/blog/running-kafka-streams-applications-aws/

Via an integration using Apache Arrow, Spark 2.3 will include support for so-called vectorized user defined functions (UDFs). Performance is greatly improved when a UDF is annotated with @pandas_udf and consumes/returns a pandas.Series. There are several example UDFs, including simple addition, cumulative probability, & OLS, and the post has a microbenchmark that shows 3x-100x speedups.

https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

The team at Honeycomb has written about a recent outage they had with their Kafka cluster. It's interesting to see all the metrics and graphs that they have to debug and to understand some of the new instrumentation that they plan to add.

https://honeycomb.io/blog/2017/10/bitten-by-a-kafka-bug---postmortem/

With an example of stream processing of page view and event data, this post describes how to implement a non-trivial application using both the Kafka Streams DSL and the underlying Processor API. In this example (based on a real-world use case), the Processor-version is significantly more efficient for only a bit more code.

https://mkuthan.github.io/blog/2017/11/02/kafka-streams-dsl-vs-processor-api/

News

The Magpie talk show podcast has an interview with Confluent CEO Jay Kreps.

http://samnewman.io/blog/2017/10/24/magpie-talkshow-episode-23-jay-kreps/

The Call for Papers for Kafka Summit London, which takes place April 23-24, has been announced. Submissions are due December 1st.

https://www.confluent.io/blog/london-kafka-summit-2018-call-papers-open/

A CVE for Apache Hive has been announced. Users are encouraged to upgrade to version 2.3.1, but the announcement also describes mitigations for lower versions (2.1.0 through 2.3.0 are effected).

http://mail-archives.apache.org/mod_mbox/www-announce/201710.mbox/%3C3791103E-80D5-4E75-AF23-6F8ED54DDEBE%40apache.org%3E

The Insight blog has a good post on some of the data related talks, advice, and themes from this year's Grace Hopper.

https://blog.insightdatascience.com/grace-hopper-celebration-draws-wide-interest-in-data-careers-d0859304877e

Releases

This week, Apache Kafka hit version 1.0.0. The new release includes performance improvements, faster TLS, Java 9 support, and more. The Apache Software Foundation has coverage of the milestone, and the Confluent blog describes several of the improvements in recent versions leading up to version 1.0.0.

https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces21
https://www.confluent.io/blog/apache-kafka-goes-1-0/

Version 2.6.3 of HDP has been released. It's the first release to support the Hortonworks DataPlane Service, and there are several new versions of packages (Spark, Zeppelin, Livy, Druid, Atlas, Knox, Ambari, SmartSense, and Ranger).

https://hortonworks.com/blog/hdp-2-6-3-dataplane-service/

Qubole has released a Qubole Jump Start to run the Qubole Data Service on the Oracle Cloud Infrastructure.

https://www.qubole.com/blog/jump-start-big-data-oracle-cloud-infrastructure/

A new project providing JDBC support for Confluent KSQL has bene announced. The initial 0.1.0 release was made earlier this week.

https://github.com/mmolimar/ksql-jdbc-driver

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Big Data Security, Apache Pulsar, and More! (Palo Alto) - Wednesday, November 8
https://www.meetup.com/BigDataApps/events/243529551/

Women in Big Data Apache Spark Meetup (San Francisco) - Wednesday, November 8
https://www.meetup.com/spark-users/events/244065865/

Hands-On Workshop: Apache Pulsar (Santa Clara) - Saturday, November 11
https://www.meetup.com/datariders/events/243704410/

Illinois

Big Data Solutions in Azure (Chicago) - Monday, November 6
https://www.meetup.com/PyDataChi/events/244007266/

Cliff Gilmore on KSQL (Chicago) - Wednesday, November 8
https://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/244551906/

Wisconsin

Apache Spark Saturday Hands-On Workshop and Lecture (Madison) - Saturday, November 11
https://www.meetup.com/BigDataMadison/events/244321203/

FRANCE

Spark with Tim Hunter (Villeneuve d’Ascq) - Thursday, November 9
https://www.meetup.com/Lille-Big-Data-and-Machine-Learning-Meetup/events/243878036/

GERMANY

Kafka Meetup (Berlin) - Tuesday, November 7
https://www.meetup.com/Zalando-Tech-Events-Berlin/events/243644893/

RUSSIA Kafka Use Cases in Sberbank & Kafka Multi–Data Center Cluster Topology (Moscow) - Thursday, November 9
https://www.meetup.com/Moscow-Kafka-Meetup/events/244761507/

INDIA

Scale Up! Meetup (Bangalore) - Monday, November 6
https://www.meetup.com/scale-up-meetup/events/244276505/

Productionalizing Spark ML Pipelines (Bangalore) - Saturday, November 11
https://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/244194263/

TAIWAN

Stateful Stream Processing Explained with Apache Flink (Taipei) - Monday, November 6
https://www.meetup.com/Taiwan-R/events/243265907/