Data Eng Weekly


Data Eng Weekly Issue #264

13 May 2018

Lots of great content this week, including a new project for MapReduce using AWS Lambda called Corral, coverage of the Onyx and Wallaroo stream processing systems, distributed tracing with Jaeger, and tips for working with Kafka Connect. In news, there's an article warning that the prevalence of open source could change and a podcast episode covering the Great Expectations library for testing data pipelines. In releases, Apache Impala 3.0.0 is out, and there are several other interesting projects to catch up on like the Mara python ETL framework.

Sponsors

At Datadog, we’re on a mission to build the best monitoring platform in the world. We're looking for Data Engineers to use modern tech to build and extend our core pipelines that process trillions of data points per day. More here:

https://dtdg.co/2FBYYTv

We’re Shopify. Our mission is to make commerce better for everyone http://bit.ly/shopify-careers and in the last few years we have built a data warehouse that is now ready to power data products and insights for all of our 500,000+ merchants! We thrive on change, operate on trust, and leverage the diverse perspectives of people on our team in everything we do. We solve problems at a rapid pace and at scale. In short, we get shit done. Join us!

http://bit.ly/join-us-shopify-data

Technical

Corral is a Golang implementation of MapReduce that runs on AWS Lambda. This post gives a good overview of the architecture and design tradeoffs—for example, due to Lambda timeouts, shuffle data is written to S3.

https://benjamincongdon.me/blog/2018/05/02/Introducing-Corral-A-Serverless-MapReduce-Framework/

DynamoDB has a client that implements transparent client-side encryption of data. When data is read, the client also verifies a signature to ensure that the values weren't tampered with. Hopefully we'll soon see more systems, where data is only accessed by a primary key, support this type of client.

https://aws.amazon.com/blogs/security/how-to-encrypt-and-sign-dynamodb-data-in-your-application/

This post covers the windowing features in the Onyx distributed stream processing framework. Windows can be fixed, sliding, global, or session-based. The author includes good examples and visualizations of each.

https://medium.com/formcept/analyzing-real-time-data-streams-using-onyx-windowing-options-formcept-explores-983b498b1cdd

Another up-and-coming stream processing framework, Wallaroo, has posted a FAQ that covers architecture internals, licensing, scalability and production-readiness, and more. Notably, Wallaroo doesn't require the JVM and has native support for Python and Golang.

https://medium.com/wallaroo-labs-engineering/your-wallaroo-questions-answered-dce3ce97643e

This post provides a useful comparison between Apache Kafka's subscription semantics and the shared message queues and publish-subscribe models of classic message queue systems.

http://blog.cloudera.com/blog/2018/05/scalability-of-kafka-messaging-using-consumer-groups/

Sematext has a three part series on distributed tracing. It covers OpenTracing and two open-source implementations—Zipkin and Jaeger (which is the topic of the latest article).

https://sematext.com/blog/opentracing-jaeger-as-distributed-tracer/

The Mockaroo service can be a useful way to generate test data. This post shows hot to use it with kafkacat to populate a kafka topic with test data.

https://rmoff.net/2018/05/10/quick-n-easy-population-of-realistic-test-data-into-kafka-with-mockaroo-and-kafkacat/

This post introduces the notion of "scientific debt" in data science. There are several causes, but a couple of them are data engineering-related: lack of monitoring, lack of data infrastructure, and the absence of data engineering process.

http://varianceexplained.org/r/scientific-debt/

Localz has a walkthrough of building a log analysis system with all managed services on AWS—Glue for ETL, Athena for query, and Quicksite for visualization. There's an analysis of how a Glue job to convert from JSON to ETL pays for itself after only a couple of queries because Presto (which is charged per TB scanned) consumes drastically less data with Parquet storage.

https://medium.com/localz-engineering/serverless-big-data-start-here-aws-glue-athena-quicksite-4c70ecac9fe3

Good debug tips and lessons learned for working with Kafka Connect when consuming JSON and Apache Avro data out of Kafka.

https://medium.com/funding-circle/debugging-kafka-connect-sinks-c2c08f0de97f

Data Eng Jobs

There are three listings for data engineering jobs in New York, San Francisco, Mountain View, and Paris. Check them out or add your own!

https://dataengweekly.com/jobs

News

Pyrostore is a new product that provides Kafka-like semantics but uses object storage (Amazon S3 or Google Cloud Storage) to decouple the storage from compute.

http://pyrostore.io/blog/2018/05/10/kafka-potential-past-present.html

The Call for Presentations for Flink Forward Berlin, which takes place in September, is open now through June 4th.

https://flink-forward.org/call-for-presentations-submit-talk/

I've noticed recently that a number of newer distributed system projects/companies are using a combo of open-source and freemium software licenses. This article is a great analysis of the trend—it highlights that we are starting to take open-source licenses for granted, and we may need to accept new software built using a different model.

http://redmonk.com/sogrady/2018/05/11/taking-open-source-for-granted/

Podcast.init has an interview this week with creators of Great Expectations, which is a python library for modeling assumptions in your data pipelines and performing automated tests to verify those.

https://www.podcastinit.com/great-expectations-with-abe-gong-and-james-campbell-episode-161/

Sponsors

At Datadog, we’re on a mission to build the best monitoring platform in the world. We're looking for Data Engineers to use modern tech to build and extend our core pipelines that process trillions of data points per day. More here:

https://dtdg.co/2FBYYTv

We’re Shopify. Our mission is to make commerce better for everyone http://bit.ly/shopify-careers and in the last few years we have built a data warehouse that is now ready to power data products and insights for all of our 500,000+ merchants! We thrive on change, operate on trust, and leverage the diverse perspectives of people on our team in everything we do. We solve problems at a rapid pace and at scale. In short, we get shit done. Join us!

http://bit.ly/join-us-shopify-data

Releases

Astronomer Enterprise Edition 0.2.0 is a SaaS product for Apache Airflow. With this release, you can now deploy Airflow on your own Kubernetes cluster. The release also includes updates to the API and CLI.

https://www.astronomer.io/blog/announcing-astronomer-enterprise-edition-0-2-0/

HMSClient is a new python library for interacting with the Hive Metastore.

https://github.com/gglanzani/hmsclient

Mara is an ETL framework written in Python. It aims to be more powerful than scripting but less complex than a project like Airflow. It has support for Postgres and a web UI.

https://github.com/mara/data-integration

Amazon Aurora Backtrack is a new feature of the hosted database that provides access to the historic state of your database for up to 72 hours. It can be used to return to a previous state and undo a destructive command.

https://aws.amazon.com/blogs/aws/amazon-aurora-backtrack-turn-back-time/

Apache Impala 3.0.0 was released. There are a number of changes (some backwards incompatible), including several new reserved keywords (for SQL:2016), Decimal_V2, and array handling in parquet files.

https://lists.apache.org/thread.html/a4f9cfab455b366e9ce388e45d8e49617b056d22749856253259152d@%3Cannounce.apache.org%3E

Data Eng Jobs

There are three listings for data engineering jobs in New York, San Francisco, Mountain View, and Paris. Check them out or add your own!

https://dataengweekly.com/jobs

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

BASM @ Bloomberg (San Francisco) - Tuesday, May 15
https://www.meetup.com/spark-users/events/250221273/

Building Data Pipelines x2 + ML and Go (San Francisco) - Wednesday, May 16
https://www.meetup.com/golangsf/events/249469374/

Colorado

Building a Real-Time Streaming Platform (Englewood) - Tuesday, May 15
https://www.meetup.com/Front-Range-Apache-Kafka/events/250217845/

Utah

Monthly Spark, Big Data, and Data Engineering Meetup (Salt Lake City) - Monday, May 14
https://www.meetup.com/apache-spark-slc/events/248643902/

Virginia

High Speed Data Visualization: Kafka Meets Elasticsearch (Ashburn) - Wednesday, May 16
https://www.meetup.com/Apache-Kafka-DC/events/250336706/

BELGIUM

Intro to & Experiences Implementing Databricks (Zaventem) - Monday, May 14
https://www.meetup.com/Microsoft-Advanced-Analytics-User-Group/events/248414092/

BULGARIA

Uber Engineering Sofia (Sofia) - Thursday, May 17
https://www.meetup.com/Uber-Engineering-Events-Sofia/events/250449225/

ISRAEL

KSQL Deep Dive, with Kai Waehner of Confluent (Tel Aviv) - Monday, May 14
https://www.meetup.com/ApacheKafkaTLV/events/249561540/

Async with Akka, Spark Pitfalls (Tel Aviv-Yafo) - Tuesday, May 15
https://www.meetup.com/underscore/events/249827607/

RUSSIA From Queues to Stream Processing (Moscow) - Thursday, May 17
https://www.meetup.com/Moscow-Kafka-Meetup/events/250138402/

AUSTRALIA

Stream All Things with Kafka and KSQL (Sydney) - Tuesday, May 15
https://www.meetup.com/apache-kafka-sydney/events/250052794/

Streaming ETL with Apache Kafka and KSQL/Apache Metron (Melbourne) - Wednesday, May 16
https://www.meetup.com/KafkaMelbourne/events/250373412/

Fast Data: Stream All the Things! (Sydney) - Thursday, May 17
https://www.meetup.com/Sydney-Apache-Spark-User-Group/events/249667854/

NEW ZEALAND

Streaming ETL with Apache Kafka and KSQL (Wellington) - Thursday, May 17
https://www.meetup.com/Kafka-Wellington/events/250078551/