13 May 2018
Lots of great content this week, including a new project for MapReduce using AWS Lambda called Corral, coverage of the Onyx and Wallaroo stream processing systems, distributed tracing with Jaeger, and tips for working with Kafka Connect. In news, there's an article warning that the prevalence of open source could change and a podcast episode covering the Great Expectations library for testing data pipelines. In releases, Apache Impala 3.0.0 is out, and there are several other interesting projects to catch up on like the Mara python ETL framework.
At Datadog, we’re on a mission to build the best monitoring platform in the world. We're looking for Data Engineers to use modern tech to build and extend our core pipelines that process trillions of data points per day. More here:
We’re Shopify. Our mission is to make commerce better for everyone http://bit.ly/shopify-careers and in the last few years we have built a data warehouse that is now ready to power data products and insights for all of our 500,000+ merchants! We thrive on change, operate on trust, and leverage the diverse perspectives of people on our team in everything we do. We solve problems at a rapid pace and at scale. In short, we get shit done. Join us!
http://bit.ly/join-us-shopify-data
Corral is a Golang implementation of MapReduce that runs on AWS Lambda. This post gives a good overview of the architecture and design tradeoffs—for example, due to Lambda timeouts, shuffle data is written to S3.
https://benjamincongdon.me/blog/2018/05/02/Introducing-Corral-A-Serverless-MapReduce-Framework/
DynamoDB has a client that implements transparent client-side encryption of data. When data is read, the client also verifies a signature to ensure that the values weren't tampered with. Hopefully we'll soon see more systems, where data is only accessed by a primary key, support this type of client.
https://aws.amazon.com/blogs/security/how-to-encrypt-and-sign-dynamodb-data-in-your-application/
This post covers the windowing features in the Onyx distributed stream processing framework. Windows can be fixed, sliding, global, or session-based. The author includes good examples and visualizations of each.
Another up-and-coming stream processing framework, Wallaroo, has posted a FAQ that covers architecture internals, licensing, scalability and production-readiness, and more. Notably, Wallaroo doesn't require the JVM and has native support for Python and Golang.
https://medium.com/wallaroo-labs-engineering/your-wallaroo-questions-answered-dce3ce97643e
This post provides a useful comparison between Apache Kafka's subscription semantics and the shared message queues and publish-subscribe models of classic message queue systems.
http://blog.cloudera.com/blog/2018/05/scalability-of-kafka-messaging-using-consumer-groups/
Sematext has a three part series on distributed tracing. It covers OpenTracing and two open-source implementations—Zipkin and Jaeger (which is the topic of the latest article).
https://sematext.com/blog/opentracing-jaeger-as-distributed-tracer/
The Mockaroo service can be a useful way to generate test data. This post shows hot to use it with kafkacat
to populate a kafka topic with test data.
This post introduces the notion of "scientific debt" in data science. There are several causes, but a couple of them are data engineering-related: lack of monitoring, lack of data infrastructure, and the absence of data engineering process.
http://varianceexplained.org/r/scientific-debt/
Localz has a walkthrough of building a log analysis system with all managed services on AWS—Glue for ETL, Athena for query, and Quicksite for visualization. There's an analysis of how a Glue job to convert from JSON to ETL pays for itself after only a couple of queries because Presto (which is charged per TB scanned) consumes drastically less data with Parquet storage.
Good debug tips and lessons learned for working with Kafka Connect when consuming JSON and Apache Avro data out of Kafka.
https://medium.com/funding-circle/debugging-kafka-connect-sinks-c2c08f0de97f
There are three listings for data engineering jobs in New York, San Francisco, Mountain View, and Paris. Check them out or add your own!
https://dataengweekly.com/jobs
Pyrostore is a new product that provides Kafka-like semantics but uses object storage (Amazon S3 or Google Cloud Storage) to decouple the storage from compute.
http://pyrostore.io/blog/2018/05/10/kafka-potential-past-present.html
The Call for Presentations for Flink Forward Berlin, which takes place in September, is open now through June 4th.
https://flink-forward.org/call-for-presentations-submit-talk/
I've noticed recently that a number of newer distributed system projects/companies are using a combo of open-source and freemium software licenses. This article is a great analysis of the trend—it highlights that we are starting to take open-source licenses for granted, and we may need to accept new software built using a different model.
http://redmonk.com/sogrady/2018/05/11/taking-open-source-for-granted/
Podcast.init has an interview this week with creators of Great Expectations, which is a python library for modeling assumptions in your data pipelines and performing automated tests to verify those.
https://www.podcastinit.com/great-expectations-with-abe-gong-and-james-campbell-episode-161/
At Datadog, we’re on a mission to build the best monitoring platform in the world. We're looking for Data Engineers to use modern tech to build and extend our core pipelines that process trillions of data points per day. More here:
We’re Shopify. Our mission is to make commerce better for everyone http://bit.ly/shopify-careers and in the last few years we have built a data warehouse that is now ready to power data products and insights for all of our 500,000+ merchants! We thrive on change, operate on trust, and leverage the diverse perspectives of people on our team in everything we do. We solve problems at a rapid pace and at scale. In short, we get shit done. Join us!
http://bit.ly/join-us-shopify-data
Astronomer Enterprise Edition 0.2.0 is a SaaS product for Apache Airflow. With this release, you can now deploy Airflow on your own Kubernetes cluster. The release also includes updates to the API and CLI.
https://www.astronomer.io/blog/announcing-astronomer-enterprise-edition-0-2-0/
HMSClient is a new python library for interacting with the Hive Metastore.
https://github.com/gglanzani/hmsclient
Mara is an ETL framework written in Python. It aims to be more powerful than scripting but less complex than a project like Airflow. It has support for Postgres and a web UI.
https://github.com/mara/data-integration
Amazon Aurora Backtrack is a new feature of the hosted database that provides access to the historic state of your database for up to 72 hours. It can be used to return to a previous state and undo a destructive command.
https://aws.amazon.com/blogs/aws/amazon-aurora-backtrack-turn-back-time/
Apache Impala 3.0.0 was released. There are a number of changes (some backwards incompatible), including several new reserved keywords (for SQL:2016), Decimal_V2, and array handling in parquet files.
There are three listings for data engineering jobs in New York, San Francisco, Mountain View, and Paris. Check them out or add your own!
https://dataengweekly.com/jobs
Curated by Datadog ( http://www.datadog.com )
BASM @ Bloomberg (San Francisco) - Tuesday, May 15
https://www.meetup.com/spark-users/events/250221273/
Building Data Pipelines x2 + ML and Go (San Francisco) - Wednesday, May 16
https://www.meetup.com/golangsf/events/249469374/
Building a Real-Time Streaming Platform (Englewood) - Tuesday, May 15
https://www.meetup.com/Front-Range-Apache-Kafka/events/250217845/
Monthly Spark, Big Data, and Data Engineering Meetup (Salt Lake City) - Monday, May 14
https://www.meetup.com/apache-spark-slc/events/248643902/
High Speed Data Visualization: Kafka Meets Elasticsearch (Ashburn) - Wednesday, May 16
https://www.meetup.com/Apache-Kafka-DC/events/250336706/
Intro to & Experiences Implementing Databricks (Zaventem) - Monday, May 14
https://www.meetup.com/Microsoft-Advanced-Analytics-User-Group/events/248414092/
Uber Engineering Sofia (Sofia) - Thursday, May 17
https://www.meetup.com/Uber-Engineering-Events-Sofia/events/250449225/
KSQL Deep Dive, with Kai Waehner of Confluent (Tel Aviv) - Monday, May 14
https://www.meetup.com/ApacheKafkaTLV/events/249561540/
Async with Akka, Spark Pitfalls (Tel Aviv-Yafo) - Tuesday, May 15
https://www.meetup.com/underscore/events/249827607/
RUSSIA
From Queues to Stream Processing (Moscow) - Thursday, May 17
https://www.meetup.com/Moscow-Kafka-Meetup/events/250138402/
Stream All Things with Kafka and KSQL (Sydney) - Tuesday, May 15
https://www.meetup.com/apache-kafka-sydney/events/250052794/
Streaming ETL with Apache Kafka and KSQL/Apache Metron (Melbourne) - Wednesday, May 16
https://www.meetup.com/KafkaMelbourne/events/250373412/
Fast Data: Stream All the Things! (Sydney) - Thursday, May 17
https://www.meetup.com/Sydney-Apache-Spark-User-Group/events/249667854/
Streaming ETL with Apache Kafka and KSQL (Wellington) - Thursday, May 17
https://www.meetup.com/Kafka-Wellington/events/250078551/