Data Eng Weekly Issue #257

25 March 2018

This week's issue is quite large—great technical content covering BigGraphite (a Cassandra backend for Graphite), Apache Spark Streaming, Postgres on Kubernetes, KSQL, and a DynamoDB/API Gateway solution for serving ML data. In releases, there are a bunch, including Apache Trafodion, Apache Kudu, Apache Drill, and Apache HAWQ. There is hopefully something for everyone, but if you're looking for even more then checkout @DataEngWeekly on twitter throughout the week.

Sponsor

Cut HBase storage needs by 50%; elevate job throughput by 260%. ScaleFlux Computational Storage cohesively optimizes your Hadoop HBase infrastructure. Significant performance leaps on Spark, RocksDB, and Aerospike as well. Now shipping - see more:

http://bit.ly/scaleflux-computational-storage-hbase-260percent-faster

Technical

Last week, I included the wrong link to the GO-JEK article—here's the right one. In addition to that post on their data infrastructure (built on Apache Kafka, Apache Flink, Kubernetes, Chef, and InfluxDB), there's another post from their blog. This second article has a great overview of some resilience strategies—timeouts, idempotency, and the circuit breaker pattern—in distributed systems.

https://blog.gojekengineering.com/data-infrastructure-at-go-jek-cd4dc8cbd929
https://blog.gojekengineering.com/resiliency-in-distributed-systems-efd30f74baf4

The OpenMessaging Project is a new initiative to 1) provide unifying APIs across messaging systems and 2) provide standard tools for benchmarking those systems. It is able to generate synthetic workloads that are tunable on several axes and comes with tools to provision Apache Kafka and Apache Pulsar on AWS.

https://streaml.io/blog/openmessaging-benchmarks/

Criteo has open sourced a new Apache Cassandra backend for Graphite called BigGraphite. With it, they store over 1 million metrics per second, and they take advantage of Cassandra's secondary indexes to support the different types of Graphite queries. This presentation goes into further detail on the new backend and also links out to the design document and code on github.

https://docs.google.com/presentation/d/17opE2U2aIe1TFYJgr0Z8Dd2fJhy2x0YbReCVeCz6uhA/edit

Apache Spark 2.3.0 adds experimental support for continuous mode in stream processing. Compared to microbatch latency of 100s of milliseconds, continuous mode uses long-running processes that can take a record end to end in 10s of milliseconds.

https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html

The Pivotal blog has a post from Crunchy Data, makers of an open source tool for running Postgres on Kubernetes. The article provides a brief introduction to the tool, a tutorial for getting a DB up and running, and a demo of scaling up the number of read replicas.

https://content.pivotal.io/blog/postgresql-for-kubernetes-quickstart-for-pks

A tutorial on the Confluent Blog covers loading data into Kafka using Kafka Connect for MySQL (via Debezium for change data capture) and from CSV files, querying and joining those two streams, and writing the results out to Amazon S3 using the Kafka Connect S3 sink.

https://www.confluent.io/blog/ksql-in-action-enriching-csv-events-with-data-from-rdbms-into-AWS/

This article enumerates several of the mechanisms that can be used to access data in Apache Kudu (such as Impala and Spark), and it provides some performance comparison to HDFS when using the TPC-H synthetic dataset. While it's always important to test with your own data set, the experiment shows that Impala is nearly as fast at querying both parquet data in HDFS and Kudu (with some exceptions). Kudu, additionally, has the benefit of efficiently supporting CRUD and random access via a primary key.

https://blog.clairvoyantsoft.com/guide-to-using-apache-kudu-and-performance-comparison-with-hdfs-453c4b26554f

Great overview of Impala best practices—such as using Parquet, optimizing file size (in this case, they saw a 2,000x speedup by combining small files), configuring separate coordinators and executors (a new feature as of Impala 2.9), and computing statistics. The post also describes a strategy of automatically detecting and canceling long running queries.

https://medium.com/@adirmashiach/apache-impala-my-insights-and-best-practices-8e0507089935

Azure Event Hubs is a managed service for storing data for streaming applications (with similarities to Apache Kafka or Google Cloud PubSub). This post describes using Event Hubs with Apache Spark (via Azure Databricks) to stream data from the Twitter API into Event Hubs and then to consume the tweet data from Event Hubs.

https://lenadroid.github.io/posts/connecting-spark-and-eventhubs.html

This post has five tips for working with data—if you haven't yet learned these the hard way yet, then they're good advice.

https://medium.com/@EasySize/5-crucial-tips-we-learned-watching-over-our-clients-data-226bf795fbdf

Rue La La is using AWS DynamoDB and API Gateway for serving data generated by their data science team via Spark jobs. This post provides an overview and links out to two other posts: the first on how store data in DynamoDB for efficiently answer API queries (including over a 20B element sparse matrix), and the second on using AWS Batch and Lambda to load data from S3 to DyanmoDB (there are some interesting details on long-running S3 reads and efficiently loading data).

https://medium.com/rue-la-la-tech/rich-data-science-apis-at-rue-la-la-3d44bd23b529

This article describes how one org does non-destructive schema changes using Rail's ActiveRecord migration utilities. They've codified a number of these best practices, to the point where exceptions are raised it you try to drop a table or column.

https://samsaffron.com/archive/2018/03/22/managing-db-schema-changes-without-downtime

News

The Data Engineering Podcast had its 23rd episode this week—it covers Elasticsearch and the Elastic stack.

https://www.dataengineeringpodcast.com/elastic-stack-with-philipp-krenn-episode-23/

Apache Airflow has gained so much popularity that there's now a podcast from the folks at Astronomer. This episode describes Airflow best practices.

https://soundcloud.com/the-airflow-podcast/best-practices

This post defines several job areas related to data—Data Analytics, Data Science, Machine Learning Engineering, Data Engineering, Backend Engineering, and Devops. It also describes how they overlap with a 6-circle Venn diagram. Whether or not these are the exact definitions you use at your company, it's useful to have these clear definitions to help identify gaps or blind spots.

https://mindfulmachines.io/blog/2018/3/17/the-flavors-of-data-science-and-data-engineering

PhoenixCon, the conference on Apache Phoenix, has announced that they're holding the next instance in San Jose on June 18th. The CFP is open through April 16th.

https://phoenix.apache.org/phoenixcon-2018/

Releases

Astronomer recently announced Astronomer Cloud Edition—a managed Apache Airflow service.

https://www.astronomer.io/blog/announcing-apache-airflow-on-astronomer/

Spring Cloud Data Flow 1.4 is out. The new version is mostly a UI/UX refresh, which includes a new builder for configuring streaming applications. There are also some security feature enhancements in the release.

https://content.pivotal.io/blog/spring-cloud-data-flow-1-4-ui-ux-refresh-stream-deployment-builder-and-security-improvements

Kafka Streams Viz is a new tool for visualizing streaming topologies. It's a single page webapp that takes a plain-text definition of a streaming topology and outputs a user-friendly diagram.

https://zz85.github.io/kafka-streams-viz/

The first of four SQL/data query engine Apache releases this week, Apache Trafodion 2.2.0 is the also the inaugural release since the project graduated from the Apache incubator. This is a bug fix release with a few enhancements.

https://trafodion.apache.org/release-notes-2-2-0.html

Apache Drill 1.13.0 is out with YARN support for Drill, support for JDK 8, improvements to security and memory usage, and more.

https://medium.com/@ApacheDrill/apache-drill-1-13-released-35a1663249ee

Apache HAWQ 2.3.0.0-incubating was announced. This was a bug fix and feature enhancement release that adds support for HDFS transparent data encryption and improved Ranger support (kerberos authentication and RPS HA).

https://cwiki.apache.org/confluence/display/HAWQ/Apache+HAWQ+2.3.0.0-incubating+Release

Apache Kudu 1.7.0 was released. New features include support for the decimal data type, improvements to self-healing through re-replication, and a new READYOURWRITES scan mode. There are a bunch of improvements to the Java client, performance during inserts and updates, and more. See the release notes for full details.

https://kudu.apache.org/releases/1.7.0/docs/release_notes.html

The 0.3.0 release of Banzai Pipeline, the PaaS system for deploying Spark, Kafka, and others on Kubernetes, is out. This version adds support for Google Kubernetes Engine, strengthens security, and includes several bug fixes and improvements.

https://banzaicloud.com/blog/pipeline-third-release/

Databricks and Microsoft announced that Azure Databricks is now generally available.

https://databricks.com/blog/2018/03/22/azure-databricks-industry-leading-analytics-platform-powered-by-apache-spark.html

Sponsor

http://bit.ly/scaleflux-computational-storage-hbase-260percent-faster

Interested in sponsoring? See: https://www.dataengweekly.com/advertising.html for details and email info@dataengweekly.com

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

WePay Engineering Meetup: Managing Data at Scale (Redwood City) - Wednesday, March 28
https://www.meetup.com/WePay-Engineering-Meetup/events/248205670/

Bay Area Apache Spark & Women in Big Data @ Databricks HQ (San Francisco) - Thursday, March 29
https://www.meetup.com/spark-users/events/248239522/

Arizona

An Introduction & Dive into Apache Kafka and the Power of Messaging Systems (Phoenix) - Thursday, March 29
https://www.meetup.com/Phoenix-Java/events/248305322/

Illinois

Stealing the Jewels from Python: Putting PySpark Code into Production (Chicago) - Monday, March 26
https://www.meetup.com/Chicago-Spark-Users/events/248836419/

Pennsylvania

Event Sourcing from Scratch with Apache Kafka and Spring (Pittsburgh) - Tuesday, March 27
https://www.meetup.com/The-Pittsburgh-Java-Meetup-Group/events/246494113/

Massachusetts

March Presentation Night: Spark 2.3, Cloud, and Use Cases (Cambridge) - Wednesday, March 28
https://www.meetup.com/Boston-Apache-Spark-User-Group/events/248548389/

CANADA

Reactive Applications with the SMACK Stack, Led by Duncan DeVore! (Kanata) - Thursday, March 29
https://www.meetup.com/Big-Data-Developers-in-Ottawa/events/248471095/

UNITED KINGDOM

Mesosphere and Confluent at Our First Kafka Meetup (Manchester) - Wednesday, March 28
https://www.meetup.com/Manchester-Kafka/events/248451554/

FRANCE

Apache Kafka March Meetup (Paris) - Monday, March 26
https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/248693926/

GERMANY

IoT Stream Processing and Analytics in Real-time (Frankfurt) - Wednesday, March 28
https://www.meetup.com/Future-of-Data-Frankfurt-am-Main-Germany/events/247939726/

SWITZERLAND

Let's Talk about KSQL for Kafka (Zurich) - Tuesday, March 27
https://www.meetup.com/Zurich-Apache-Kafka-Meetup-by-Confluent/events/246915204/

AUSTRIA

Get Started with Data Integration Using Apache NiFi (Vienna) - Tuesday, March 27
https://www.meetup.com/Hadoop-User-Group-Vienna/events/248266228/

CZECH REPUBLIC

Big Data: How to Deal with Sensorics Data Easily (Prague) - Tuesday, March 27
https://www.meetup.com/Prague-Data-Management-Meetup/events/248157876/

HUNGARY

Big Data on Kubernetes (Budapest) - Tuesday, March 27
https://www.meetup.com/Big-Data-Meetup-Budapest/events/248053604/