Data Eng Weekly Issue #283

30 September 2018

Some good tutorials this week covering streaming with python and Kafka, building a logging pipeline, writing a Spark UDF, and Apache Kylin. Also, an interesting postmortem from the Azure DevOps team and looks at Zenreach's and Paypal's streaming platforms, which are both built on Kafka.

Sponsor

Built by narwhals, just for you – Dremio simplifies data engineering and data analytics with the power of Apache Arrow. Connect almost any data source. Accelerate queries up to 1,000x. Let your BI and data science users curate their own data with our nautically-themed user interface. Open source.

Visit https://bit.ly/about-dremio to learn more, or download for free.

Technical

Earlier this month, Microsoft Azure DevOps had an extended outage. They've published a postmortem of the event (which was triggered by lighting storms and data center shutdown), which revolves around data replication across regions. It's a great discussion of the tradeoffs of synchronous vs asynchronous replication, and it has good list of changes that they'll implement to improve their system reliability.

https://blogs.msdn.microsoft.com/vsoservice/?p=17485

This Kafka tutorial uses docker-compose to spin up Kafka and Zookeeper alongside a docker container running a fraud detection app written in Python.

https://blog.florimondmanca.com/building-a-streaming-fraud-detection-system-with-kafka-and-python

This tutorial builds a logging pipeline using Fluentd, Kafka Connect, ElasticSearch, and S3. Everything is deployed locally using docker-compose to quickly get off the ground.

https://hackernoon.com/distributed-log-analytics-using-apache-kafka-kafka-connect-and-fluentd-303330e478af

Apache Ignite, which is an in-memory data engine, uses the H2 embedded database internally to store and query data. This post shows how to enable the H2 debug web console to dig into how Ignite is managing your data. I can recall a number of times a web console like this would have been quite useful.

http://frommyworkshop.blogspot.com/2018/09/apache-ignite-deep-dive-sql-engine.html

A good and practical discussion of what amount of investment a company should make into data pipelines based on size and scale. Also includes some discussion of what tools they use at Grubhub.

https://bytes.grubhub.com/scaling-etl-how-data-pipelines-evolve-as-your-business-grows-72ff6c744e6e

Apache Hivemall (incubating) is a machine learning library for Hive and Spark. This presentation gives a good introduction to the library, and it has examples for several of Hivemall's features. There's also coverage of the new algorithms and changes coming in future releases.

https://www.slideshare.net/myui/introduction-to-apache-hivemall-v050-116293454

Apache Kylin comes up from time to time since it provides a "OLAP on Hadoop" capability. This post provides a good introduction to what that means and describes some practical considerations when using Kylin.

https://medium.com/@mvneethu90/olap-in-hadoop-apache-kylin-bf0377d8b44f

Zenreach writes about the evolution of their real-time data platform, which started as Python scripts, moved to Spark Streaming, and is now Kafka Streams. They share lessons learned across scaling, persistent storage for stateful streaming, fault tolerance, logging/monitoring/alerting, and more.

https://www.confluent.io/blog/real-time-presence-detection-apache-kafka-aws

Apache Flink 1.6.0 has a new State Time-To-Live (TTL) feature. This post describes how to use the API for the new feature and provides some details on how it's implemented (and the current limitations).

https://data-artisans.com/blog/state-ttl-for-apache-flink-how-to-limit-the-lifetime-of-state

Apache Parquet, ORC, and other column-oriented data frameworks have been quite important in many of the speedups that Hadoop-ecosystem projects have gained in recent years. This post summaries a 2012 survey paper on modern column-oriented database systems, and it's a good way to dive into the techniques that the storage formats and runtimes use to speed things up.

https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/

This post previews what is to come in Spark 2.4's improved Kubernetes support. Both PySpark and Spark R support is being added, there's new support for some Kubernetes volumes, and more.

https://databricks.com/blog/2018/09/26/whats-new-for-apache-spark-on-kubernetes-in-the-upcoming-apache-spark-2-4-release.html

Confluent has a guide for troubleshooting common KSQL problems. Lots of good tips that are more generally applicable to Kafka if you're net yet using KSQL.

https://www.confluent.io/blog/troubleshooting-ksql-part-1

The Paypal engineering blog has a post about their big data platform. They use the Squbs framework (which is based on Akka streams) with Kafka. At their scale, they also had a need for an intermediary buffer to sit in front of Kafka, which they discuss in their post.

https://medium.com/paypal-engineering/https-medium-com-paypal-engineering-tracking-user-behavior-at-scale-f0c584c4ddd4

Good example of writing a bit-more-than-trivial Spark UDF in Scala.

https://medium.com/@sfranks/i-had-trouble-finding-a-nice-example-of-how-to-have-an-udf-with-an-arbitrary-number-of-function-9d9bd30d0cfc

Sponsor

"We are no tables, but you might join us." If you find this as funny as we do, you might be perfect for our Runtastic's Data engineering team, building algorithms for our suite of fitness apps.

Apply here: http://bit.ly/runtastic-data-engineer

News

Apache Pulsar, the distributed pub-sub system, has been promoted to a top level project. Originally developed at Yahoo, Pulsar has similarities to Apache Kafka and also several differentiators.

https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces39

Microsoft announced that the forthcoming SQL Server 2019 will have built-in support for Apache Spark and the Hadoop Distributed File System. It's an interesting development that shows that Hadoop's storage layer is still in demand and that interoperability can be a key differentiator of a database system.

https://techcrunch.com/2018/09/24/microsofts-sql-server-gets-built-in-support-for-spark-and-hadoop/

InfoWorld has crowned Apache Spark, Apache Pulsar, Apache Beam, Vitess, TiDB, InfluxDB and several other projects as the "best open source software for data storage and analytics."

https://www.infoworld.com/article/3306454/big-data/the-best-open-source-software-for-data-storage-and-analytics.html

Java 11 was released this week, with some exciting new features (such as the now-open source flight recorder and dynamic class-file constants) that should be quite helpful in large scale distributed systems.

https://www.infoq.com/news/2018/09/java11-released

Jobs

Have you checked out the Data Eng Weekly job board yet? https://jobs.dataengweekly.com/. Jobs:

Linux Big Data Engineer, G-Research, London: https://jobs.dataengweekly.com/jobs/cc513d48-56d0-4818-8364-84b1319a9411
Data Engineer, AginicX, Sydney: https://jobs.dataengweekly.com/jobs/07f44617-4048-4236-beb7-9b7ae47fb849

Post a job for $99. https://jobs.dataengweekly.com/

Releases

Ko is a new type-safe programming language focussed on building concurrent, deadlock-free systems. It's built on the Go runtime.

https://github.com/kocircuit/kocircuit

The AWS Database Migration Service now supports migrating from Apache Cassandra to Amazon DynamoDB.

https://aws.amazon.com/about-aws/whats-new/2018/09/aws-dms-aws-sct-now-support-the-migration-of-apache-cassandra-databases/

Apache Parquet C++ 1.5.0 is out. Release highlights include the ability to split RowGroups based on size and initial support for encryption and bloom filters.

https://lists.apache.org/thread.html/6c5702a5e5cdadb021ab372ff4cfe42e44e76515f0a37648a6f1e731@%3Cannounce.apache.org%3E

Apache HAWQ, the MPP database, released version 2.4.0.0. It's the first release as a top level project.

https://lists.apache.org/thread.html/f5c076f19d0098a443a73eea04f81e78af414e07fb85d65b6874ba21@%3Cannounce.apache.org%3E

Microsoft and Starburst have announced that Starburst's distribution of Presto is now available on Azure HDInsight.

https://azure.microsoft.com/en-us/blog/azure-hdinsight-and-starburst-brings-presto-to-microsoft-azure-customers/

Apache HBase 1.2.7 is out. It has a number of critical fixes, and there are a handful of backward incompatible changes.

https://lists.apache.org/thread.html/68df512a3f55b699478e018299027b0e91ea9438bd98a53ebd9cc106@%3Cannounce.apache.org%3E

Sponsor

Visit https://bit.ly/about-dremio to learn more, or download for free.

"We are no tables, but you might join us." If you find this as funny as we do, you might be perfect for our Runtastic's Data engineering team, building algorithms for our suite of fitness apps.

Apply here: http://bit.ly/runtastic-data-engineer

Events

Curated by Datadog ( http://www.datadog.com )

California

Data Engineering SD Meetup (San Diego) - Thursday, October 4
https://www.meetup.com/Data-Engineering-San-Diego/events/255035034/

Oklahoma

Streaming Data (Oklahoma City) - Thursday, October 4
https://www.meetup.com/Big-Data-in-Oklahoma-City/events/251214678/

Virginia

MemSQL: How Kafka and Modern Databases Benefit Apps & Analytics (McLean) - Thursday, October 4
https://www.meetup.com/BusinessIntelligentsiaDC/events/248502453/

UNITED KINGDOM

Beam Summit London 2018 (London) - Monday, October 1
https://www.meetup.com/London-Apache-Beam-Meetup/events/254297128/

Spark + AI Meetup (London) - Tuesday, October 2
https://www.meetup.com/Spark-London/events/254679118/

How Much Kafka Do You Need? The Art and Science of Capacity Planning (London) - Thursday, October 4
https://www.meetup.com/Apache-Kafka-London/events/254610038/

Apache Kafka’s Role in Modern Data Architectures (Leeds) - Thursday, October 4
https://www.meetup.com/Leeds-JVMThing/events/254404621/

FINLAND

Apache Kafka Meetup @ Aiven (Helsinki) - Monday, October 1
https://www.meetup.com/Helsinki-Apache-Kafka-Meetup/events/254753142/

FRANCE

Beyond the Brokers: A Tour of the Kafka Environment (Rennes) - Thursday, October 4
https://www.meetup.com/BreizhJUG/events/254737310/

INDIA

Apache Arrow and PySpark (Bangalore) - Saturday, October 6
https://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/255019701/

AUSTRALIA

Kafka Talk with Tim Berglund and Servian (Melbourne) - Tuesday, October 2
https://www.meetup.com/KafkaMelbourne/events/254251775/

Data Engineering Meetup (Sydney) - Wednesday, October 3
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/254467121/