Data Eng Weekly Issue #272

08 July 2018

Lots of interesting topics this week, including Flink at Netflix, Facebook's embedded machine learning library, a look inside the core Spark codebase, and how Goodreads mirrors data from DynamoDB for analysis by Amazon Athena.

Sponsor

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2t96fa7, or visit http://dremio.com to learn more.

Technical

Spiral is Facebook's embedded real-time machine learning library. It can be run in an embedded mode, which enables a few interesting use cases, like building a model to determine which entries to invalidate in a cache based on the parameters of an update.

https://code.fb.com/data-infrastructure/spiral-self-tuning-services-via-real-time-machine-learning/

Have you been bitten by running a job (or an entire workflow) with bad input? This post outlines a reusable, Airflow operator that executes some SQL queries to sanity check data before running the next job.

https://medium.com/coinmonks/sql-based-sanity-check-in-airflow-before-kicking-off-machine-learning-model-2733868b1cf2

Consistency in distributed systems can be really hard to understand. This map provides a good overview of the various models. Each model links to a page where you can read a whole lot more about their semantics.

https://jepsen.io/consistency

Databases 101 aims to provide a good overview of databases, and to some extent, distributed systems. It covers a lot of ground, like types of databases, consistency models, and ETL. While it's aimed at the non-technical audience, it's a great reference even if you know this stuff pretty well.

https://thomaslarock.com/2018/07/databases-101/

A common data architecture problem, especially when integrating with legacy systems, is how to handle out of order events in Kafka. This post describes a solution in which events are cached in Redis.

https://developer.ibm.com/hadoop/2018/07/01/use-redis-as-a-cache-mechanism-to-handle-out-of-order-kafka-messages/

The Databricks blog has a post describing a streaming application built with AWS Lambda, Amazon Kinesis, and the Databricks platform (including Databricks delta).

https://databricks.com/blog/2018/07/02/build-a-mobile-gaming-events-data-pipeline-with-databricks-delta.html

The Morning Paper covered two data systems that aim to provide strong security and privacy guarantees. The first, EnclaveDB implements a secure database that can be run in an untrusted environment (there are some constraints, like all queries must be determined ahead of time). The second, Oblix, implements a search index over encrypted data. Lots to each of these, and they're probably a good peak at some techniques that might soon get commercial adoption.

https://blog.acolyer.org/2018/07/05/enclavedb-a-secure-database-using-sgx/
https://blog.acolyer.org/2018/07/06/oblix-an-efficient-oblivious-search-index/

Interesting use case of converting XML to RDF at relatively large scale (six billion RDF tuples) using Apache Beam (running on Google Cloud DataFlow) and coordinated by Apache NiFi.

https://medium.com/datafabric/apache-beam-and-apache-nifi-cooperation-bba42d2de36c

A great deep dive into the Spark codebase (including testing and build tools) by way of adding a new UDF to Spark SQL.

https://blog.insightdatascience.com/a-journey-through-spark-5d67b4af4b24

Netflix's stream processing system processes over 4 trillion events per day (36GB/sec) using thousands of servers. To enable teams within the organization, they've built a self-service infrastructure that is built on Apache Flink and Apache Kafka. Much of this presentation centralizes around why and how they use Flink, including what scalability challenges they've faced (e.g. in checkpointing), fine grained recovery, and how they implement reprocessing from scratch.

https://www.slideshare.net/mdaxini/flink-at-netflix-paypal-speaker-series-104453588

A look at how Goodreads transfers data from DynamoDB to S3 (in Parquet Format) for query by Amazon Athena. Since Dynamo is optimized for transactional queries, this enables some analytics use cases that are awkward directly against Athena.

https://aws.amazon.com/blogs/big-data/how-goodreads-offloads-amazon-dynamodb-tables-to-amazon-s3-and-queries-them-using-amazon-athena/

Pandas on Ray is a project to implement a distributed processing backend to Pandas, the popular Python data analysis framework. This post describes the state of the implementation and provides some benchmarks with performance characteristics when run on a multi-core machine.

https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/

Jobs

Etsy is hiring Senior Data Engineers in New York.

https://jobs.dataengweekly.com/jobs/0aa00df2-679e-4b35-b646-77ccbafd3451

Netflix is hiring a Senior Data Engineer, Playback in Los Gatos.

https://jobs.dataengweekly.com/jobs/08f4bef5-90fd-4a11-9249-15e7ae86023d

Job postings on the Data Eng Weekly job board are now just $99. Submit a job to reach your peers looking for something new!

https://jobs.dataengweekly.com/submit/job

News

Presto Summit is coming up in just over a week, on July 16th. It's free and takes place at Facebook.

https://www.starburstdata.com/technical-blog/announcing-the-first-ever-presto-summit/

Apache Cassandra is 10 years old, and Datastax has an infographic that highlights some key milestones in the project's history.

https://www.datastax.com/resources/infographic/apache-cassandra-version-history

California has enacted a new law, the California Consumer Privacy Act of 2018, that goes into effect in 2020. It has a lot of similarities to GDPR—including rights to know and delete your data.

https://www.datanami.com/2018/07/06/californias-new-data-privacy-law-takes-effect-in-2020/

Sponsor

Releases

Azure HDInsight has added updates to Apache Spark, Apache Kafka, ML Services, and more.

https://azure.microsoft.com/en-us/blog/enterprises-get-deeper-insights-with-hadoop-and-spark-updates-on-azure-hdinsight/

MapR 6.1 was recently released. This post describes some of the main features, including changes to MapR-DB (their database) and MapR-ES (their Kafka-compatible pub/sub system).

https://mapr.com/blog/mapr-6-1-simplifies-the-development-of-ai-and-analytics-applications/

The Apache Kafka team announced maintenance releases of 0.10 and 0.11. Both releases include a number of bug fixes and improvements.

https://lists.apache.org/thread.html/8e6b14fd49b2c7bf814c40a1139898d106634dbbae56c54ebc6fa219@%3Cannounce.apache.org%3E
https://lists.apache.org/thread.html/48bc8b4b11f1d6c4cc7d0cd2b622600de19f943e1c42239f739cba5a@%3Cannounce.apache.org%3E

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Fast and Easy Stream Processing (Irvine) - Wednesday, July 11
https://www.meetup.com/Orange-County-Java-Users-Group-OCJUG/events/250699963/

Washington

Seattle Apache Kafka Meetup (Bellevue) - Thursday, July 12
https://www.meetup.com/Seattle-Apache-Kafka-Meetup/events/251243101/

Colorado

Big Data at the Speed of Business with the Elastic Stack (Denver) - Wednesday, July 11
https://www.meetup.com/Denver-All-Things-Data/events/251736155/

Texas

Optimizing and Indexing Streams with Kafka for Low-Latency Queries (Austin) - Tuesday, July 10
https://www.meetup.com/Austin-Apache-Kafka-Meetup-Stream-Data-Platform/events/252179246/

Michigan

Event-Driven Architectures in Node.js Using Kafka (Detroit) - Wednesday, July 11
https://www.meetup.com/DetNode/events/250762697/

CANADA

Free Big Data Open Class: Getting Started with Apache Spark on AWS (Toronto) - Saturday, July 14
https://www.meetup.com/WeCloudData/events/252129469/

GUATEMALA

Intro to Big Data: Apache Kafka (Guatemala) - Tuesday, July 10
https://www.meetup.com/Big-Data-Guatemala/events/251744452/

UNITED KINGDOM

Big Data Bristol Evening Session (Bristol) - Wednesday, July 11
https://www.meetup.com/BigDataBristol/events/250574564/

ROMANIA

JavaScript Everywhere + Kafka: Bringing Unicorn Technology to the Enterprise (Bucharest) - Wednesday, July 11
https://www.meetup.com/Bucharest-A-D-C-E-S-Meetup/events/251636066/

RUSSIA Moscow Apache Ignite Meetup #3 (Moscow) - Tuesday, July 10
https://www.meetup.com/Moscow-Apache-Ignite-Meetup/events/252272163/

AUSTRALIA

ML & Analytics Optimized by Cloudera Altus (Melbourne) - Wednesday, July 11
https://www.meetup.com/Melbourne-Azure-Nights/events/246957514/