19 August 2018
Qubole and Datadog open sourced new tools this week for Spark and Kafka (respectively). In tech, great articles to learn from Pandora, Netflix, Instacart, JW Player, and Rezdy about how they're solving data challenges. A couple of technical deep dives and tutorials on KSQL UDFs and Airflow testing round things out.
From the creators of Apache Arrow, Dremio is an open source Data-as-a-Service platform. Accelerate your queries (up to 1,000x!) and make data truly self-service for your BI and data science users.
Visit https://bit.ly/about-dremio to learn more, or download for free.
Confluent writes about building a User Defined (Aggregate) Function for KSQL, which is a new feature of their latest release. The post contains example code and anticipates some development challenges and their solutions.
https://www.confluent.io/blog/build-udf-udaf-ksql-5-0
LocusDB is an experimental analytics database written in Rust and built on RocksDB. This post describes the internals of the system that enable it, using some columnar encoding and compression tricks, to provide impressive throughput.
Pandora has written about using MemSQL as their analytics database. The post covers the goals of their analytics db (which replaced Hadoop), some of the tools they evaluated, the data design, and the system configuration (e.g. using RAID 10). MemSQL supports both columnar and row-based storage, and there's an interesting discussion of the tradeoffs to consider when deciding how to store a dataset.
https://engineering.pandora.com/using-memsql-at-pandora-79a86cb09b57
The Netflix Data team supports Jupyter notebooks as a first class component of their data pipeline. They're scheduled by data scientists, data engineers, software engineers, and data analysts. In this post, they describe how they support the team's use cases and the infrastructure that powers it.
https://medium.com/netflix-techblog/notebook-innovation-591ee3221233
Scylla is an open-source (AGPL) distributed database with Apache Cassandra capability. It's a different codebase, written in C++, and thus there are some architectural differences. This post describes the internals of its data replication (or streaming) and how they plan to improve the performance in an upcoming release.
https://www.scylladb.com/2018/08/14/upcoming-improvements-scylla-streaming/
This post describes how Instacart split up their databases for isolation and scalability. For coupled components, they implemented asynchronous data replication so that db joins still work. This new design also helped them find and eliminate places in which multiple components were writing to the same tables. The post describes the steps they took to roll out the new design and migrate data (including a nifty pgsync
tool for data replication).
The Rezdy data team shares their core philosophies and the tools they're using to implement a new data platform. Lots of good tips for how to build a platform that serves an entire organization.
https://medium.com/rezdy-engineering/an-introduction-to-data-at-rezdy-53b12d9935f5
In this post, an Apache Impala power-user shares some pain points / missing features identified from running Impala at scale.
https://medium.com/@adirmashiach/5-main-missing-features-in-impala-imo-1343c767081f
The JW Player team, which analyzes hundreds of GBs of playback data per day, has written about their experiences tuning an expensive Spark job. The post has a good explanation of several tuning knobs, including shuffle partitions and broadcast joins.
This post has a good overview of the features (and maturity of those features) of Azure Event Hubs. It also discusses some of the trade-offs vs. running your own Kafka cluster (Event Hubs supports the Kafka 1.x client protocol) and some performance tips.
https://medium.com/@yvescallaert/azure-event-hubs-the-good-the-bad-and-the-ugly-5b1120b8b9c2
A good primer on testing with Apache Airflow, this article covers testing DAG validity (e.g. no cycles), testing DAG definition (e.g. correct dependencies), and unit testing an operator.
Astronomer helps organizations run Apache Airflow at scale. Easily deploy to a hosted service or private cloud with a full Kubernetes-based stack and ramp up your team with exclusive training and professional services.
Visit http://bit.ly/about-astronomer to learn more.
There's a new monthly Presto newsletter from the folks at Starburst. The first issue has a good collection of videos, posts, and presentations.
https://www.starburstdata.com/newsletter/presto-newsletter-1/
If you're looking for some good data systems papers to read, the August proceedings of the PVLDB are out. Among the papers are ones on building an Apache Beam runner for IBM Streams, data quality verification, streaming joins at Facebook, Google's F1 query engine, and Alibaba's distributed file system PolarFS.
http://www.vldb.org/pvldb/vol11.html
Mayo Clinic is looking for a Lead Hadoop System Admin based in Rochester, MN or Phoenix, AZ or Jacksonville, FL
https://jobs.dataengweekly.com/jobs/5c0e56d3-309f-4ff8-a3d4-640a64ff3bab
Submit a job to the Data Eng Weekly board at https://jobs.dataengweekly.com/
Qubole has open sourced their tool for profiling and predicting performance of Apache Spark jobs.
https://www.qubole.com/blog/sparklens-0-2-0-release-features-and-fixes/
Kafka-Kit is a new open source tool from Datadog. It includes two tools: topicmappr
, which is a rack-aware tool for reassigning partitions and updating replication factors, and autothrottle
, which is a tool for preventing service degradation during a replication event by adjusting Kafka's replication throttle.
https://www.datadoghq.com/blog/engineering/introducing-kafka-kit-tools-for-scaling-kafka/
Version 0.5.7 of Scio, the Scala library for Apache Beam, has been released. This new version includes support for Beam 2.6.0 and other improvements and bug fixes.
https://github.com/spotify/scio/releases/tag/v0.5.7
FASTER is a new key-value store implementation with a design and architecture that improves performance.
https://github.com/Microsoft/FASTER
Azure HDInsight has a new integration between Apache Phoenix and Apache Zeppelin.
From the creators of Apache Arrow, Dremio is an open source Data-as-a-Service platform. Accelerate your queries (up to 1,000x!) and make data truly self-service for your BI and data science users.
Visit https://bit.ly/about-dremio to learn more, or download for free.
Astronomer helps organizations run Apache Airflow at scale. Easily deploy to a hosted service or private cloud with a full Kubernetes-based stack and ramp up your team with exclusive training and professional services.
Visit http://bit.ly/about-astronomer to learn more.
Curated by Datadog ( http://www.datadog.com )
Big Data Ingest for Data Scientists (Phoenix) - Thursday, August 23
https://www.meetup.com/Data-Science-Phoenix/events/253737092/
Why's It Gotta Be Batch? + Lightning Talks! (Austin) - Tuesday, August 21
https://www.meetup.com/Austin-Apache-Kafka-Meetup-Stream-Data-Platform/events/253755551/
Gwen Shapira Talks Kafka and the Service Mesh (Chicago) - Tuesday, August 21
https://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/253095489/
Streamline Streaming: Framework for Data Pipelines with Kafka (Chicago) - Wednesday, August 22
https://www.meetup.com/Women-Who-Code-Chicago/events/253660629/
Productionalizing Spark Streaming and Kafka Applications (Toronto) - Tuesday, August 21
https://www.meetup.com/tordatascience/events/253515454/
Cloudera Sessions Mexico 2018 (Mexico City) - Thursday, August 23
https://www.meetup.com/Meetup-de-Cloudera-en-Mexico/events/252940710/
Run Everything-as-a-Service Everywhere, with Mesosphere CTO Tobi Knaup (Sydney) - Tuesday, August 21
https://www.meetup.com/Sydney-Docker-User-Group/events/247969223/