Data Eng Weekly

Data Eng Weekly Issue #261

22 April 2018

Lots of great content this week, including a couple of system migration stories (Hive+Sqoop to Spark, Cron to Airflow), paradigms of stream processing, DynamoDB at Nike, and LinkedIn's Aeon system for latency tracking. In news, there's a great post on data engineering vs. data science roles and video interviews from DataWorks. In releases, Apple open sourced FoundationDB and Apache Hadoop 2.7.6 is out.


Data Eng Weekly is starting a job board! For the next month, postings are discounted at $149 (regularly $249) for 31 days. Hopefully this will be a useful service for both job seekers and companies hoping to reach the data engineering community. Questions or comments?


This is a great, Azure-focussed whirlwind tour of Hadoop (and briefly MapReduce), Pig (on Tez), Storm (with Azure Event Hubs), and Spark. It uses Powershell and the Azure UI to deploy clusters to crunch data from the Global Database of Events, Language, and Tone (GDELT) dataset.

Wayfair has written a post on the scalability and reliability issues they faced with a large Graphite deploy. They're moving to a new system built on InfluxDB with Kafka as the transport.

Qubole's AIR platform analyzes data about Hive, Spark, and Presto clusters. The system is powered by Apache Airflow for job orchestration, which is the topic of this post. Qubole discusses why they chose Airflow and what some of the pitfalls have been as they've worked with it.

This post describes how use Scala for data prep in Apache Spark. Once that's done, Spark SQL and Apache Zeppelin can be used to query and visualize the results. This type of hybrid solution seems like a great way to make sure you're using the best tool at each step of your analysis.

TechTarget has coverage of the Flink Forward conference talks by Capital One and Comcast. There are some interesting insights into how the companies are supporting data science and machine learning—e.g. both are using Jython to bridge the gap between data science and production systems.

Here's a story of migrating from a Hive+Sqoop setup to a Spark-based one. The system is running in AWS, integrates with S3 and Redshift, and uses Zeppelin notebooks for Spark.

This post covers a few paradigms for stream processing. The "Kafka abstraction funnel" is a new one to me—it describes the fallback approach of using KSQL first, then the Streams DSL, then the Processor API, and then the raw Producer/Consumer APIs. There's also a new project that demos "hello world" in a bunch of different stream processing frameworks, to give you a good flavor of each.

Videoamp migrated from Cron to Apache Airflow, and they have a lot of lessons learned (both good things and some pitfalls!) to share about the transition.

This presentation introduces Pachyderm, a data management for Kubernetes that includes data versioning, data provenance, and more.

Apache Heron (incubating) supports running streaming jobs via HashiCorp Nomad for cluster scheduling. This post walks through the steps to get that setup up and running.

The LinkedIn engineering blog has a post about Aeon, their event and latency tracking system built on Apache Kafka and Apache Samza.

Apple open-sourced FoundationDB this week, which got a lot of attention. Snowflake is a happy user of FoundationDB, which powers their metadata store. They share some more details in this post.

Nike is using DynamoDB as the data store for many of their microservices. Dynamo replaces Couchbase and Cassandra—the main advantages are operational overhead and additional features like secondary indexes and encryption at rest. It's not without its pitfalls, though—hot-spotting (in which cases requests are throttled) and big launch events require careful consideration.

If you have a lot of small, time sensitive data tasks, then Apache Airflow might not be the best fit. This post describes this situation and the tradeoffs of switching to Celery for task scheduling.

ScaleFlux has published a whitepaper that describes how they've achieved performance speedups to HBase. Using hardware acceleration, their solution achieves GZIP compression ratios with Snappy speed and throughput.


This is a great article on the core competencies of data engineers and data scientists, several negative scenarios that might occur when a data scientist is spending their time on data engineering, and the growing role of machine learning engineer, which sits in the middle.

theCUBE was at DataWorks summit EU last week, and they have posted a number of interviews with folks from Hortonworks, IBM, Accenture, and more.

The ARCHITECHT podcast has an interview this week with Jay Kreps. Topics covered include the rise of Kafka alternatives, big data IPOs, and open source business models.

Videos and slides from the Women in Big Data meetup have been posted. There are talks on Python+JVM, DevOps for Data Scientists, and visualizations with big data.

As mentioned above, Apple has open sourced FoundationDB, which is a distributed key-value store with ACID transactions. The project is getting off the ground and building a community.

This post recaps a number of distributed systems talks from QCon London, which took place earlier this year. There's a summary and a link to slides for each.


Data Eng Weekly is starting a job board! For the next month, postings are discounted at $149 (regularly $249) for 31 days. Hopefully this will be a useful service for both job seekers and companies hoping to reach the data engineering community. Questions or comments?


Luigi 2.7.5 was released. It includes all the features of the 2.7.4 release (some fixes and new features), as well as a fix for a cross-site scripting vulnerability in the visualizer UI.

At DataWorks Summit, Hortonworks announced Data Steward Studio. It's a security and governance focused product with applications to GDPR. SiliconANGLE has more coverage.

Version 4.1 of the Confluent Platform includes the GA of KSQL (less than a year after the developer preview started). The release also includes enhanced clients and Apache Kafka 1.1.

Apache Oozie 5.0.0 was released. Highlights include JDK 8 support, a new YARN-based launcher, and several updates.

This tool for running a Kafka cluster via docker compose has been updated to support version 4.1 of the Confluent Platform.

Apache Hadoop 2.7.6 is out. It includes a total of 46 bug fixes and optimizations.


Curated by Datadog ( )



First StreamSets User Group Meetup: Scale Out with StreamSets (San Francisco) - Tuesday, April 24


First Denver Data Engineering Meetup (Denver) - Thursday, April 26


Spark 2.3 Update, Machine Learning Pipelines Intro, and CI/CD How-to (Atlanta) - Thursday, April 26


Spark 2.3 and Azure Databricks (Reston) - Wednesday, April 25


2 Billion Messages in Kafka (Montreal) - Tuesday, April 24


Apache Beam Meetup 4: Use Case on Beam + Becoming a Committer + Discussions (London) - Tuesday, April 24

Recap and Summary from Flink Forward SF 2018 (London) - Tuesday, April 24

Streaming with KSQL + Monitoring Kafka Like a Pro (London) - Wednesday, April 25


Processing Hierarchical Tables with Spark, by Jose Luis Sanchez from Zurich (Barcelona) - Thursday, April 26


Disaster Recovery Solutions for Hadoop Clusters (Neuilly-Sur-Seine) - Tuesday, April 24


SageMaker, DeepLens, & Message-Driven Architecture (Amsterdam) - Tuesday, April 24


KSQL and Stream All Things with Gwen Shapira and Matthias J. Sax (Berlin) - Wednesday, April 25


Big Data Architecture 101 & Kafka 101 (Taguig) - Wednesday, April 25