02 June 2019
This week's issue has a couple of important non-technical posts on the principles of DataOps and Data Engineering. As usual, there's lot of great technical content, too—Qubole's HiveServer2 deployment, AWS' service log ingestion library, data analytics at Ada, and more.
Retina writes about their DataOps principles, which aim to empower individuals and teams working with data. These principles are great examples of best practices from both a cultural (e.g. it's important to place data scientists and DataOps engineers on the same team) and technical (automation and tooling are key). Lots of great ideas for anyone working as part of a data team or building data-driven products.
https://retina.ai/blog/dataops-principles/
Qubole has written about their internal deployment of HiveServer2 (HS2), which has been customized with a proxy tier that enables horizontal scaling of HS2 workers. The LoadBalancer, called Megamind, uses ZooKeeper to keep track of running HS2 workers. This approach seems to have lots of advantages, and something similar might make sense for other long-running job services.
https://www.qubole.com/blog/increase-scalability-of-hiveserver2/
AWS has open sourced a new library to ingest service logs into Athena from ELB, CloudTrail, CloudFront, S3, and VPC Flow using their Glue service. The job takes are of converting data to Apache Parquet and partitioning it based on year/month/day. Once run, you can easily query the data via Athena.
https://aws.amazon.com/blogs/big-data/easily-query-aws-service-logs-using-amazon-athena/
A team of folks from Apache Flink, Apache Calcite, Apache Beam, and Oak Ridge National Lab have published a paper proposing the addition of streaming data support to the SQL standard. The proposal covers three parts: "time-varying relations" (i.e. a table with historic values), the notion of event time, and some new keyword extensions. The appendix contains an overview of the state of streaming SQL in Calcite, Flink, and Beam.
https://arxiv.org/abs/1905.12133v1
Ada, who builds an AI Chatbot, writes about their data analytics system built on Apache Airflow and Redshift. The post has some great stuff on monitoring workflows, deploying Airflow on Kubernetes, and why they have chosen to do ELT rather than ETL.
After nearly a year since its last release, Apache Storm 2.0.0 is out. It includes overhauls to its architecture (including a switch from Clojure to Java) and its Kafka integration, as well as new APIs, an upgrade to java 8, and more.
https://storm.apache.org/2019/05/30/storm200-released.html
Nordstrom (the department store chain) has open sourced a tool for "data profiling," which is used to capture the characteristics of a dataset through statistics (e.g. counts, sums, distinct counts) and assertions (e.g. schema validation and custom data quality checks). This post provides an overview of the tool and how they deploy and integration with Datadog and Pagerduty for alerting.
https://medium.com/tech-at-nordstrom/data-profiling-in-the-age-of-big-data-7675d486c89c
Good collection of articles, tools, blogs, and online courses for folks looking to learn data engineering.
https://github.com/adilkhash/Data-Engineering-HowTo
LiveRamp describes the data replication service they built to keep data transfer volumes to a reasonable level during their migration from an on-prem cluster to the cloud. They started by moving jobs from the end of their pipeline to the cloud, and the replication tool ensures that the inputs to those jobs are copied across. Some useful architecture patterns if you're facing a similar problem.
A good overview of what it means to be a data engineer, what tools from the ecosystem are important to know, and a theory as to why data engineers are in so high demand (lots of work to do across many different skillsets!).
Curated by Datadog ( http://www.datadog.com )
Airflow Meetup @ Google Cloud (Sunnyvale) - Wednesday, June 5
https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/260712102/
Building Your Data Streams for All the IoT with Craig Hobbs (San Francisco) - Wednesday, June 5
https://www.meetup.com/Time-Series-SF/events/260291444/
TiDB Operator in OLTP and OLAP Workloads + Cassandra & Kafka (Mountain View) - Thursday, June 6
https://www.meetup.com/BigDataApps/events/261624701/
Spark and Cassandra: ETL, Analytics, and Streaming (Minneapolis) - Monday, June 3
https://www.meetup.com/Minneapolis-St-Paul-Cassandra-Meetup/events/261095992/
Java and Data (Sao Paulo) - Wednesday, June 5
https://www.meetup.com/SouJava/events/261354386/
Building Your Own In-House Multi-Tenant/Multi-Cluster Kafka for Dummies (Edinburgh) - Tuesday, June 4
https://www.meetup.com/Edinburgh-Kafka/events/261293248/
The Road to Apache Spark 3.0, Koalas, and Neptune Spark Meetup (London) - Thursday, June 6
https://www.meetup.com/Spark-London/events/261460906/
10 Ways to Deploy Apache Kafka and Have Fun Along the Way (Barcelona) - Wednesday, June 5
https://www.meetup.com/Barcelona-Kafka-Meetup/events/261449512/
Dutch Mesos User Group @ PVH (Amsterdam) - Monday, June 3
https://www.meetup.com/Dutch-Mesos-User-Group/events/260725176/
Let's Talk about Stream Processing with Apache Flink! (Munich) - Tuesday, June 4
https://www.meetup.com/inovex-munich/events/261643276/
IoT Data Storage: Apache Cassandra, MongoDB, Redis (Frankfurt) - Wednesday, June 5
https://www.meetup.com/IoT-Hessen/events/261099466/
Kick-Off Event @ Ververica, f.k.a. Data Artisans (Berlin) - Wednesday, June 5
https://www.meetup.com/Gateway-to-China-Berlin/events/259927762/
Orchestrate Apache Kafka on Kubernetes (Munich) - Thursday, June 6
https://www.meetup.com/Apache-Kafka-Germany-Munich/events/261464989/
Kafkafest Event: Tim Berglund & Viktor Gamov (Tel Aviv-Yafo) - Tuesday, June 4
https://www.meetup.com/ApacheKafkaTLV/events/261393073/
Melbourne Data Engineering Meetup (Melbourne) - Wednesday, June 5
https://www.meetup.com/Melbourne-Data-Engineering-Meetup/events/261283044/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.