Data Eng Weekly


Data Eng Weekly Issue #269

17 June 2018

Companies have shared lots of great posts this week—Pandora's web UI for Kafka, metadata management at Netflix, GraphQL at AirBnB, robust data pipelines at DataXu, and fronting Kafka at GO-JEK. There's also coverage of the new YARN long running application scheduler, a high performance single server stream processing engine, and a recap of the recent Spark + AI summit.

Sponsor

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2rHK6iw, or visit dremio.com to learn more.

Technical

AirBnB has written about their experiences implementing GraphQL as an API gateway atop of Apache Thrift services. The post has a good mix of technical (their architecture including Thrift/GraphQL translators) and non-technical (about how to frame the conversation and seek compromise) topics.

https://medium.com/airbnb-engineering/reconciling-graphql-and-thrift-at-airbnb-a97e8d290712

Originally in Chinese, this post analyzes a recent exploit of unsecured Apache Hadoop YARN clusters that was used for cryptocurrency mining. It also outlines how to secure a cluster with publicly accessible endpoints.

https://www.microsofttranslator.com/bv.aspx?from=&to=en&a=https%3A%2F%2Fcloud.tencent.com%2Fdeveloper%2Farticle%2F1142503

Amazon DynamoDB has change data capture feature called DynamoDB streams. It easily integrates with AWS Lambda for real-time processing. This article explains how to use these features to compute real-time aggregates. There's a good discussion of how to tune the system for correctness, for error handling, and to increase throughput.

https://medium.com/signiant-engineering/real-time-aggregation-with-dynamodb-streams- f93547cfb244

It can be a challenge to share large research and government data sets (think atmospheric or satellite data). To make this type of data accessible, this post proposes that organization "Place your Big Data in cloud object storage in a self-describing, cloud-optimized format." It goes into some more details about the challenges (and some solutions) that are unique to these types of data in adopting that practice.

https://medium.com/pangeo/step-by-step-guide-to-building-a-big-data-portal-e262af1c2977

Dataxu shares their solution to data synchronization—handing off data from one step in the pipeline to the next. Rather than relying on file system paths, they have a centralized "file feed" protocol that provides a number of benefits.

https://medium.com/dataxutech/synchronizing-data-pipelines-93443b501a4a

This post compares SABER, a single-server stream processing engine, to Apache Flink and Apache Spark. With modest hardware (20 cores, 32GB RAM), SABER outperforms a 5-node cluster of each. In some ways, this post is reminiscent of the "CLI tools are 235x faster than Hadoop" thread from a few years back.

https://lsds.doc.ic.ac.uk/blog/do-we-need-distributed-stream-processing

Qubole has a post about their new query optimizer feature that estimates the total amount of memory needed for a Presto query. There are details on the design and correctness results from the TPC-DS benchmark.

https://www.qubole.com/blog/memory-cost-model-qubole-presto/

Many organizations design microservices so that they each use their own data store to avoid the drawbacks of a multitenant database system. This post describes how Kafka as an event store is an interesting alternative architecture.

https://www.oreilly.com/ideas/microservices-events-and-upside-down-databases

The Morning Paper has coverage of the Medea scheduler, which implements scheduling for long-running applications atop of Apache Hadoop YARN. Medea offers constraints like anti-affinity (to keep HBase region servers on separate nodes), global optimizations, and more. The authors compare it to other schedulers like Hadoop YARN's previous scheduler and a Java version of the Kubernetes scheduling algorithm. Medea is in use at Microsoft and is part of the Apache Hadoop 3.1.0 release (YARN-6592).

https://blog.acolyer.org/2018/06/13/medea-scheduling-of-long-running-applications-in-shared-production-clusters/

The GO-JEK team uses a fronting REST service for ingesting data into Kafka. That service in turn writes data to a fronting Kafka cluster, or it fails over to Redis if Kafka is down. This post explains more about the motivation and architecture.

https://blog.gojekengineering.com/kafka-4066a4ea8d0d

The Apache Hadoop YARN Service Framework makes it quite easy to deploy a long-lived application to Hadoop via a single Yarnfile definition. The Hortonworks blog has a brief overview of what it takes to migrate Apache Hive LLAP from Apache Slider to use the YARN Service Framework.

https://hortonworks.com/blog/apache-hive-llap-as-a-yarn-service/

This post introduces Metacat, Netflix's tool for data discovery, programatic dataset metadata access, and more. It is a proxy to other backends (such as the Hive metastore), and it provides advanced features via an elasticsearch index. Metacat is open sourced on github.

https://medium.com/netflix-techblog/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520

Data Eng Jobs

Trovit (Barcelona) and Comcast (Philadelphia) are both hiring engineers. Check out their posts and add your own!

https://jobs.dataengweekly.com

News

This post has a great overview of main themes from the recent Spark + AI summit as well as brief recaps of a few presentations.

https://medium.com/@szelvenskiy/spark-ai-summit-2018-overview-7c5a8d7be296

This list of distributed systems papers has been updated with some new content from the past 4 years. If you're interested in learning the fundamentals of distributed system theory, it's a great place to start.

http://www.the-paper-trail.org/post/2014-08-09-distributed-systems-theory-for-the-distributed-systems-engineer/

Dremio has announced a new initiative to bring LLVM support to Apache Arrow. They are targeting up to 100x speedups.

https://www.businesswire.com/news/home/20180614005638/en/Dremio-Announces-Gandiva-Initiative-Apache-Arrow

Dataworks Summit is this week in San Jose. Here's a preview of some of the talks.

https://hortonworks.com/blog/explore-latest-apache-hadoop-yarn-dataworks-summit-san-jose-2018/

Sponsor

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2rHK6iw, or visit dremio.com to learn more.

Releases

Pandora has open sourced KBrowse, a web ui and search tool for Apache Kafka. This post walks through how they use KBrowse at Pandora to debug issues with new content.

https://engineering.pandora.com/kbrowse-kafka-search-d6ddb85a5961

Apache Crail 1.0-incubating was released. Crail is a distributed storage engine that's optimized for high-performance networking and storage with hooks for data processing frameworks.

https://lists.apache.org/thread.html/9f01a0a72abdd16b3e67ab1559158dd14465d9c05f1c510e7ea432e4@%3Cannounce.apache.org%3E

Apache Phoenix 4.14 was released. It adds support for HBase 1.4 and several CDH versions (in addition to many previous ones), resolves lots of bugs, supports GRANT and REVOKE, and more.

https://blogs.apache.org/phoenix/entry/announcing-phoenix-4-14-released

Version 0.3 of the Kafka Security manager is out. It adds a gRPC/REST gateway service, a read-only mode, and support for Confluent 1.1.0.

https://github.com/simplesteph/kafka-security-manager/releases/tag/v0.3

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Apache NiFi @ DataWorks Summit (San Jose) - Monday, June 18
https://www.meetup.com/ApacheNiFi/events/250545732/

Apache Ambari 2.7 and Beyond Updates (San Jose) - Monday, June 18
https://www.meetup.com/Apache-Ambari-User-Group/events/250859836/

Deep Dive into Apache Metron and Big Data Security (San Jose) - Monday, June 18
https://www.meetup.com/siliconvalleysecurity/events/251005670/

All Things Spark: Machine Learning, Atlas Integration, ORC & Hive EDW Updates (San Jose) - Monday, June 18
https://www.meetup.com/futureofdata-siliconvalley/events/250805909/

Kafka and Microservices: Insights from Uber and Confluent (Mountain View) - Tuesday, June 19
https://www.meetup.com/microservices-apis-integration-meetup/events/251660189/

Birds of a Feather Sessions @ DataWorks Summit (San Jose) - Wednesday, June 20
https://www.meetup.com/futureofdata-siliconvalley/events/250629232/

New York

Stream Processing Double Presentation (New York) - Thursday, June 21
https://www.meetup.com/mysqlnyc/events/251296530/

Massachusetts

Intro to Spark Training (Framingham) - Saturday, June 23
https://www.meetup.com/Boston-Apache-Spark-User-Group/events/250581211/

GERMANY

Software Engineering with Spark (Herzogenaurauch) - Monday, June 18
https://www.meetup.com/Nuernberg-Big-Data/events/251300799/

Let's Talk about Azure Databricks & Apache Spark! (Karlsruhe) - Tuesday, June 19
https://www.meetup.com/inovex-karlsruhe/events/251040178/

CYPRUS

PyData Cyprus Meetup #4 (Limassol) - Thursday, June 21
https://www.meetup.com/PyData-Cyprus/events/250888229/

INDIA

Kafka and Stream Processing Meetup at LinkedIn (Bangalore) - Saturday, June 23
https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/251707854/

AUSTRALIA

Sydney Data Engineering Meetup (Sydney) - Wednesday, June 20
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/250794750/