Data Eng Weekly Issue #311

12 May 2019

Lots of great stuff in this week's issue ranging from efficient processing of files in Java to a deep dive into Paxos. There's coverage of Apache Spark, Apache HDFS, and Apache Kafka as well as Dropbox's cold storage system and Twitter's hybrid cloud architecture for big data.

News

A fascinating article about the design and architecture of Dropbox's cold storage solution, which replicates data across regions to provide high durability. They've built a clever system that reduces data storage requirements, and it even provides better latency over their warm storage in the 99th percentile. There are several nuggets in here related to testing and deploying large scale distributed systems.

https://blogs.dropbox.com/tech/2019/05/how-we-optimized-magic-pocket-for-cold-storage/

This three part post looks at a technical report covering distributed consensus with Paxos. The posts serve as a good high-level summary as well as an index into the text (which comes in at nearly 150 pages).

https://blog.acolyer.org/2019/05/07/distributed-consensus-revised-part-i/

This post describes three common problems, merging partitions (to generate optimal-sized files), archiving cold data, and deleting old data, related to managing partitions in HDFS (or similar) storage. There's a good description of some mechanisms (including at the Hive/Impala metadata level) to merge partitions without downtime. As is discussed, you have to take care to prevent a user query from analyzing duplicate data during the merge operation.

https://blog.cloudera.com/blog/2019/05/partition-management-in-hadoop/

The Apache Kafka Consumer APIs provide an easy to use high-level interface that hides complexity of the protocol. Even if you're a happy user of those high-level APIs, it's a good idea to understand the implementation (especially when things go wrong). This post provides a great introduction and overview of the Kafka Consumer protocol, including how consumers are assigned partitions, clients maintain commit offsets, and fault tolerance via consumer group rebalance.

https://www.confluent.io/blog/apache-kafka-data-access-semantics-consumers-and-membership

spark-alchemy is a framework that implements HyperLogLog (HLL) functions for Apache Spark. This post looks at how HLL is used to estimate cardinality of a dataset, including its nice properties for incremental calculations. spark-alchemy uses a serialization format that's compatible with Postgresql-HLL and other HLL implementations.

https://databricks.com/blog/2019/05/08/advanced-analytics-with-apache-spark.html

One of the goals of merging partitions in HDFS as discussed above, is to consolidate small files. This post looks at the overhead of storing inodes for small files (including how it impacts that Impala metadata cache), common sources of so many files, how to find them, and some tools/mechanisms to reduce their numbers.

https://blog.cloudera.com/blog/2019/05/small-files-big-foils-addressing-the-associated-metadata-and-application-challenges/

I always enjoy a good distributed systems post-mortem/debug story—it's useful to learn from folks and understand how they've discovered (and will fix!) issues. In this case, there was a perfect storm of a slow db queries plus unexpected behavior in a cache library and golang's SQL connection pooler.

https://www.honeycomb.io/blog/anatomy-of-a-cascading-failure/

While this post isn't exactly related to distributed systems, it looks at high performance file reading in golang and Java (which are two popular languages for implementing distributed systems). The author has to pull out a number of tricks in Java to get good performance—it's not that often that operations on your in-memory queue are the bottleneck you need to optimize (via batching in this case).

https://boyter.org/posts/file-read-challange/

Twitter is migrating lots of data (we're talking 100s of PBs and 800Gbps) to Google Cloud Storage from their on-prem clusters for their "partly cloudy architecture." They're using dedicated Hadoop clusters (with access to the internet) to copy data, and they are running both Presto and Hadoop clusters in Google Cloud for analysis.

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/partly-cloudy-architecture.html

EdgeDB has its own query language, EdgeQL, that solves a number of problems in SQL. While the new query language is interesting in itself, the post starts with a great critique of SQL. The sections on composability and NULL in SQL really resonated with me.

https://edgedb.com/blog/we-can-do-better-than-sql/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.