Data Eng Weekly Issue #254

04 March 2018

Two articles this week on Apache Kafka Connect, two articles on best practices for AWS deployments (Cassandra and Kafka), and an interesting look at Appsflyer's NoSQL migration. In addition to a smattering of other news and technical posts, Apache Spark and Apache Fluo both had releases this week.

Sponsors

Hey, it’s us at Foursquare. Right now, 98% of our data eng work is done on-premise. We’re considering re-architecting many workloads to run in AWS. It’s a meaty project that’ll surely be interesting. Want to get involved?

Jobs: https://bitly.com/foursquare-jobs-data-eng-weekly

Stephane Maarek has created the Apache Kafka Series (online courses, 10,000+ students) and you can find the first course Kafka for Beginners below with a special discount for Data Eng Weekly readers. It is the best way to start learning Kafka with over 4 hours of videos tutorial. Happy learning!

http://bit.ly/learn-apache-kafka

Technical

This post details the transformations needed to get metrics for Apache Zeppelin and Apache Spark into Prometheus in a useable format.

https://banzaicloud.com/blog/prometheus-application-monitoring/

Appsflyer has written about their migration of online data from Couchbase to Aerospike. The migration consists of two parts—for new data writes, they write to both systems. For bulk transfer of other data, they copy data using a background process built on Apache Kafka. Overall, it's a good overview of strategies for migrating a live database without too much pain.

https://medium.com/appsflyer/large-scale-nosql-database-migration-under-fire-bf298c3c2e47

This post describes using Google Cloud Pub/Sub and Cloud Dataflow (the Apache Beam SDK) to landing streaming data into Google BigQuery. There's an interesting strategy to implement schema evolution for JSON, in which when a failed write occurs, a schema update in BiqQuery adds the new fields.

https://cloud.google.com/blog/big-data/2018/02/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix

This post gives an overview of the common pattern of managing schemas for data warehouses in version control, describes some common pitfalls, and suggests some mitigations to those problems.

http://danosipov.com/?p=869

This post has some great tips for running Apache Cassandra clusters in AWS. They include selecting instance types, when to use EBS or instance storage, deploying across AZs/regions for high availability, using a secondary elastic network interface to preserve IPs and node ids, security best practices, and backups/upgrades.

https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-cassandra-on-amazon-ec2/

The latest in a series covering Spark on Kubernetes details how to make the Spark driver resilient to failure in the case of a node failure.

https://banzaicloud.com/blog/spark-job-resilience/

When you can't or don't want to commit lots of memory to run a JVM at the edge, the Apache NiFi MiNiFi C++ agent is a low-memory alternative to the full-blown client. This post describes some of the features of the MiNiFi client included in Hortonworks DataFlow 3.1, such as support for Apache Kafka and MQTT data transports.

https://hortonworks.com/blog/hortonworks-dataflow-hdf-3-1-blog-series-part-6-introducing-apache-minifi-c-agent/

As is demonstrated in this post, Kafka Connect can really simplify a streaming architecture. Rather than Spark streaming (which was memory hungry and had other drawbacks), they ran Kafka Connect on Kubernetes for this IOT data collection use case.

http://www.landoop.com/blog/2018/03/kafka-elasticsearch-kubernetes-iot/

Another great article on best practices for AWS, this post from Intuit describes options for deployment (cross-AZ/Region configs, instance types, EBS vs. instance storage) and best practices for encryption, backup and restore, and more.

https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-kafka-on-aws/

Good high-level overview of Apache Kafka Connect covering the architecture, example configuration, command-line options, and more.

https://blog.codecentric.de/en/2018/03/etl-kafka/

News

Heron, which was open-sourced in 2016 as an API-compatible alternative to Apache Storm, was accepted into the Apache Incubator last June. After sitting in limbo for some time, Twitter has announced that the code is now donated to the ASF.

https://blog.twitter.com/engineering/en_us/topics/open-source/2018/heron-donated-to-apache-software-foundation.html

This article makes a compelling argument that a data strategy is more important than a cloud strategy—the latter is just an enabler of the former.

https://hortonworks.com/blog/need-data-strategy-cloud-strategy/

The agenda for Spark + AI Summit, which takes place in June in San Francisco, has been announced. Early bird pricing ends later this week.

https://databricks.com/sparkaisummit/north-america

Releases

Apache Fluo, which implements the Google Percolator application, released version 1.2.0. The release page has more details on the changes.

http://fluo.apache.org/release/fluo-1.2.0/

Apache Spark 2.3.0 was released this week. Major features include experimental support for Spark on Kubernetes, a vectorized ORC file reader, new versions of the Spark History Server and Data source APIs, and performance enhancements for PySpark. The release also includes several other performance improvements, a few new features for structured streaming (including stream-stream joins), a bunch of changes to MLlib, and more.

https://spark.apache.org/releases/spark-release-2-3-0.html

Databricks announced support for Spark 2.3. Their post also has a good overview of the major new features in the Apache release.

https://databricks.com/blog/2018/02/28/introducing-apache-spark-2-3.html

Syncsort and Cloudera have announced a new integrate in which Cloudera Navigator and DMX-h integration to provide data lineage information from data originating outside of Hadoop.

http://www.syncsort.com/en/About/News-Center/Press-Release/Syncsort-and-Cloudera-Integrate-Detailed-Data-Line

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Real-Time Event Monitoring in Large Kafka-Based Systems (Irvine) - Tuesday, March 6
https://www.meetup.com/Orange-County-Gophers/events/247660998/

Tuning Spark & Hadoop Jobs + Improving Cluster Efficiency with Dr Elephant (San Jose) - Tuesday, March 6
https://www.meetup.com/Big-Data-Meetup-LinkedIn/events/248234305/

Washington

Seattle Apache Flink Meetup: Flink SQL by dataArtisans and Pravega by Dell EMC (Seattle) - Monday, March 5
https://www.meetup.com/seattle-apache-flink/events/247603184/

CANADA

Live-Coding Kafka Stream Processing Applications with KSQL (Vancouver) - Thursday, March 8
https://www.meetup.com/vancouver-kafka/events/247776961/

UNITED KINGDOM

Streaming SQL + Beam for ML Use Case + Beam in Production (London) - Monday, March 5
https://www.meetup.com/London-Apache-Beam-Meetup/events/247868648/

SPAIN

Intro to Spark Structured Streaming (Madrid) - Tuesday, March 6
https://www.meetup.com/Madrid-Apache-Spark-Meetup/events/247346722/

FRANCE

Real-Life Big Data Use Cases + BI on Hadoop (Paris) - Thursday, March 8
https://www.meetup.com/futureofdata-paris/events/247993258/

NETHERLANDS

Data-Intensive Processing in Practice: Kafka Streams and Spark (Nieuwegein) - Thursday, March 8
https://www.meetup.com/Code-Star-Night/events/246608174/

GERMANY

The World of Kafka + A Roadmap Overview (Frankfurt) - Wednesday, March 7
https://www.meetup.com/Frankfurt-Apache-Kafka-Meetup-by-Confluent/events/247589611/

ITALY

Edge Data Processing: Architectures and Production Challenges (Milano) - Wednesday, March 7
https://www.meetup.com/Milano-Kafka-meetup/events/247647060/

POLAND

WHUG: Apache Gobblin and TouK Nussknacker (Warsaw) - Tuesday, March 6
https://www.meetup.com/warsaw-hug/events/247797926/

RUSSIA KSQL (Moscow) - Tuesday, March 6
https://www.meetup.com/Moscow-Kafka-Meetup/events/247589707/

INDIA

Data @ Uber Tech Talk in Association with DataGiri (Bengaluru) - Thursday, March 8
https://www.meetup.com/DataGiri-Bengaluru/events/247868852/

Building a Serverless Data Lake @ Amazon (Bengaluru) - Saturday, March 10
https://www.meetup.com/meetup-group-mdhldGoH/events/247731514/