17 June 2018
Companies have shared lots of great posts this week—Pandora's web UI for Kafka, metadata management at Netflix, GraphQL at AirBnB, robust data pipelines at DataXu, and fronting Kafka at GO-JEK. There's also coverage of the new YARN long running application scheduler, a high performance single server stream processing engine, and a recap of the recent Spark + AI summit.
Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2rHK6iw, or visit dremio.com to learn more.
AirBnB has written about their experiences implementing GraphQL as an API gateway atop of Apache Thrift services. The post has a good mix of technical (their architecture including Thrift/GraphQL translators) and non-technical (about how to frame the conversation and seek compromise) topics.
https://medium.com/airbnb-engineering/reconciling-graphql-and-thrift-at-airbnb-a97e8d290712
Originally in Chinese, this post analyzes a recent exploit of unsecured Apache Hadoop YARN clusters that was used for cryptocurrency mining. It also outlines how to secure a cluster with publicly accessible endpoints.
Amazon DynamoDB has change data capture feature called DynamoDB streams. It easily integrates with AWS Lambda for real-time processing. This article explains how to use these features to compute real-time aggregates. There's a good discussion of how to tune the system for correctness, for error handling, and to increase throughput.
https://medium.com/signiant-engineering/real-time-aggregation-with-dynamodb-streams- f93547cfb244
It can be a challenge to share large research and government data sets (think atmospheric or satellite data). To make this type of data accessible, this post proposes that organization "Place your Big Data in cloud object storage in a self-describing, cloud-optimized format." It goes into some more details about the challenges (and some solutions) that are unique to these types of data in adopting that practice.
https://medium.com/pangeo/step-by-step-guide-to-building-a-big-data-portal-e262af1c2977
Dataxu shares their solution to data synchronization—handing off data from one step in the pipeline to the next. Rather than relying on file system paths, they have a centralized "file feed" protocol that provides a number of benefits.
https://medium.com/dataxutech/synchronizing-data-pipelines-93443b501a4a
This post compares SABER, a single-server stream processing engine, to Apache Flink and Apache Spark. With modest hardware (20 cores, 32GB RAM), SABER outperforms a 5-node cluster of each. In some ways, this post is reminiscent of the "CLI tools are 235x faster than Hadoop" thread from a few years back.
https://lsds.doc.ic.ac.uk/blog/do-we-need-distributed-stream-processing
Qubole has a post about their new query optimizer feature that estimates the total amount of memory needed for a Presto query. There are details on the design and correctness results from the TPC-DS benchmark.
https://www.qubole.com/blog/memory-cost-model-qubole-presto/
Many organizations design microservices so that they each use their own data store to avoid the drawbacks of a multitenant database system. This post describes how Kafka as an event store is an interesting alternative architecture.
https://www.oreilly.com/ideas/microservices-events-and-upside-down-databases
The Morning Paper has coverage of the Medea scheduler, which implements scheduling for long-running applications atop of Apache Hadoop YARN. Medea offers constraints like anti-affinity (to keep HBase region servers on separate nodes), global optimizations, and more. The authors compare it to other schedulers like Hadoop YARN's previous scheduler and a Java version of the Kubernetes scheduling algorithm. Medea is in use at Microsoft and is part of the Apache Hadoop 3.1.0 release (YARN-6592).
The GO-JEK team uses a fronting REST service for ingesting data into Kafka. That service in turn writes data to a fronting Kafka cluster, or it fails over to Redis if Kafka is down. This post explains more about the motivation and architecture.
https://blog.gojekengineering.com/kafka-4066a4ea8d0d
The Apache Hadoop YARN Service Framework makes it quite easy to deploy a long-lived application to Hadoop via a single Yarnfile definition. The Hortonworks blog has a brief overview of what it takes to migrate Apache Hive LLAP from Apache Slider to use the YARN Service Framework.
https://hortonworks.com/blog/apache-hive-llap-as-a-yarn-service/
This post introduces Metacat, Netflix's tool for data discovery, programatic dataset metadata access, and more. It is a proxy to other backends (such as the Hive metastore), and it provides advanced features via an elasticsearch index. Metacat is open sourced on github.
Trovit (Barcelona) and Comcast (Philadelphia) are both hiring engineers. Check out their posts and add your own!
https://jobs.dataengweekly.com
This post has a great overview of main themes from the recent Spark + AI summit as well as brief recaps of a few presentations.
https://medium.com/@szelvenskiy/spark-ai-summit-2018-overview-7c5a8d7be296
This list of distributed systems papers has been updated with some new content from the past 4 years. If you're interested in learning the fundamentals of distributed system theory, it's a great place to start.
Dremio has announced a new initiative to bring LLVM support to Apache Arrow. They are targeting up to 100x speedups.
Dataworks Summit is this week in San Jose. Here's a preview of some of the talks.
https://hortonworks.com/blog/explore-latest-apache-hadoop-yarn-dataworks-summit-san-jose-2018/
Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2rHK6iw, or visit dremio.com to learn more.
Pandora has open sourced KBrowse, a web ui and search tool for Apache Kafka. This post walks through how they use KBrowse at Pandora to debug issues with new content.
https://engineering.pandora.com/kbrowse-kafka-search-d6ddb85a5961
Apache Crail 1.0-incubating was released. Crail is a distributed storage engine that's optimized for high-performance networking and storage with hooks for data processing frameworks.
Apache Phoenix 4.14 was released. It adds support for HBase 1.4 and several CDH versions (in addition to many previous ones), resolves lots of bugs, supports GRANT and REVOKE, and more.
https://blogs.apache.org/phoenix/entry/announcing-phoenix-4-14-released
Version 0.3 of the Kafka Security manager is out. It adds a gRPC/REST gateway service, a read-only mode, and support for Confluent 1.1.0.
https://github.com/simplesteph/kafka-security-manager/releases/tag/v0.3
Curated by Datadog ( http://www.datadog.com )
Apache NiFi @ DataWorks Summit (San Jose) - Monday, June 18
https://www.meetup.com/ApacheNiFi/events/250545732/
Apache Ambari 2.7 and Beyond Updates (San Jose) - Monday, June 18
https://www.meetup.com/Apache-Ambari-User-Group/events/250859836/
Deep Dive into Apache Metron and Big Data Security (San Jose) - Monday, June 18
https://www.meetup.com/siliconvalleysecurity/events/251005670/
All Things Spark: Machine Learning, Atlas Integration, ORC & Hive EDW Updates (San Jose) - Monday, June 18
https://www.meetup.com/futureofdata-siliconvalley/events/250805909/
Kafka and Microservices: Insights from Uber and Confluent (Mountain View) - Tuesday, June 19
https://www.meetup.com/microservices-apis-integration-meetup/events/251660189/
Birds of a Feather Sessions @ DataWorks Summit (San Jose) - Wednesday, June 20
https://www.meetup.com/futureofdata-siliconvalley/events/250629232/
Stream Processing Double Presentation (New York) - Thursday, June 21
https://www.meetup.com/mysqlnyc/events/251296530/
Intro to Spark Training (Framingham) - Saturday, June 23
https://www.meetup.com/Boston-Apache-Spark-User-Group/events/250581211/
Software Engineering with Spark (Herzogenaurauch) - Monday, June 18
https://www.meetup.com/Nuernberg-Big-Data/events/251300799/
Let's Talk about Azure Databricks & Apache Spark! (Karlsruhe) - Tuesday, June 19
https://www.meetup.com/inovex-karlsruhe/events/251040178/
PyData Cyprus Meetup #4 (Limassol) - Thursday, June 21
https://www.meetup.com/PyData-Cyprus/events/250888229/
Kafka and Stream Processing Meetup at LinkedIn (Bangalore) - Saturday, June 23
https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/251707854/
Sydney Data Engineering Meetup (Sydney) - Wednesday, June 20
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/250794750/