Data Eng Weekly


Data Eng Weekly Issue #289

12 November 2018

Lots of coverage this week of optimizing real-time systems: linear programming for Elasticsearch shard placement, Facebook's Akkio for moving data closer to end users, and two-phase commit on Dropbox's sharded MySQL cluster. There are also several great posts on architecture (like change data capture and time series databases) and using tools like Spark, Flink, Kafka, and Airflow.

Sponsor

Interested in extracting value from large quantities of data? That's what we do at Criteo and it's actually pretty much all we do. We have R&D job openings in data engineering, machine learning, and related areas. We use a large number of Big Data technologies, mainly open source, and like to contribute to the open-source community as much as possible. We are looking for people who are interested in tackling the challenges that come with ingesting and analyzing hundreds of terabytes of data per day. Never heard of Criteo? We are an international ad-tech company with offices all over the world. We have a pretty cool corporate culture: http://bit.ly/criteo-culture. We take Big Data seriously enough that we have our own conference, NABD http://bit.ly/NABD-Conf, which is guaranteed to be free of marketing talks.

Checkout our R&D jobs at our offices in Palo Alto, Ann Arbor, Paris, and Grenoble: http://bit.ly/criteo-rd-jobs

Technical

Meltwater writes about how they use linear programming to balance data in their Elasticsearch clusters (which include over 750 nodes) in a way that is aware of query workload.

https://underthehood.meltwater.com/blog/2018/11/05/optimal-shard-placement-in-a-petabyte-scale-elasticsearch-cluster/

This post covers the operational challenges and the assumptions that breakdown when implementing change data capture (CDC). Since data in your database is accessed outside of an API or service, changes to a schema impact the CDC system. This increases the risk and complexity of some database schema changes, in ways that many developers aren't used to. Lots of great insights and recommendations in this piece.

https://riccomini.name/kafka-change-data-capture-breaks-database-encapsulation

The morning paper has a good summary of Facebook's article on Akkio, their system for migrating data across data centers to improve latency and reduce data transfer cost. Akkio is implemented as a client, and its client library has a mechanism for fine-grained sharding data as defined by the application itself.

https://blog.acolyer.org/2018/11/05/sharding-the-shards-managing-datastore-locality-at-scale-with-akkio/

Netflix writes about their time series storage engine for user viewing data that's built atop of Apache Cassandra. They've recently rearchitected the system to scale further with techniques like dividing data across clusters based on the age and type of data (rather than having one big cluster).

https://medium.com/netflix-techblog/scaling-time-series-data-storage-part-ii-d67939655586

Apache Flink has had SQL for a while, but it recently added a CLI for running and executing queries. This post covers how to define the schema of a data source, run a query, and create a view of streaming data.

https://data-artisans.com/blog/flink-sql-powerful-querying-of-data-streams

The Ele.me team is in the unique position to provide an overview of the benefits and tradeoffs of various stream processing systems because they use Apache Storm, Apache Spark, and Apache Flink in production. They write about these tradeoffs, and the benefits of Flink leading to their decision to use it going forward.

https://hackernoon.com/flink-or-flunk-why-ele-me-is-developing-a-taste-for-apache-flink-7d2a74e4d6c0

This post provides an introduction to Pulse, an open-source log aggregation and alerting framework built on Solr for CDH. Pulse is Apache 2.0 licensed with code on github.

http://blog.cloudera.com/blog/2018/11/proactive-data-pipeline-alerting-with-pulse/

Lots of good tips for working with Spark: several hard-fought lessons on partitioning and joining (like how to deal with nulls), the importance of monitoring, how to decipher log messages, and a bit on file formats.

https://www.enigma.com/blog/things-i-wish-id-known-about-spark

Streamlio has a good comparison of the architectures of LogDevice (Facebook's recently open sourced event bus) with Apache Pulsar. While the two systems have similar storage semantics and goals, the read paths and storage architecture are different between the two.

https://streaml.io/blog/comparing-logdevice-and-apache-pulsar

The Apache Kafka team has made major improvements to the performance of controlled shutdown of a broker—from over 6 minutes to 3 seconds in one experiment. This post describes the graceful shutdown process and the new improvements.

https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions

The Cockroach Labs blog has a post on their effort to build a cost-based optimizer for their database. It includes a quick intro to cost based optimizers and describes some of the implementation details, including their DSL for transforming query plans.

https://www.cockroachlabs.com/blog/building-cost-based-sql-optimizer/

Outlyer shares the evolution of their time series database for metrics, which was originally built with Riak and Redis, switched to an Erlang-based open source solution which they grew and maintained, and is now migrated to the Netflix system, Atlas (and Kafka). It's impressive how they've been able to evolve their architecture so quickly, and they share several important and hard earned lessons... the biggest of which is that it's probably not a good idea to build and maintain your own time series database.

https://www.outlyer.com/blog/why-not-to-build-a-time-series-database/

There are a lot of Spark posts that walk through a simple tutorial or cover runtime configuration/debugging. This post covers a different, important topic—how to architect your Scala code to work well with the Spark APIs. There are several code examples to demonstrate the key building blocks that the Adobe team uses.

https://medium.com/adobetech/spark-on-scala-adobe-analytics-reference-architecture-7457f5614b4c

An introduction to logging in Apache Airflow covering the basics and how to customize settings.

https://www.astronomer.io/guides/logging/

The dataArtisans blog highlights four things to consider when using Apache Flink's broadcast state.

https://data-artisans.com/blog/broadcast-state-pattern-flink-considerations

Dropbox stores metadata in a sharded MySQL cluster called Edgestore. They recently implemented multi-shard transactions using a two-phase commit protocol. They write about the design of the system (some interesting details about how they avoid write amplification using copy-on-write), how they tested it before launch, and how they launched in production in a "shadow" mode to test performance implications.

https://blogs.dropbox.com/tech/2018/11/cross-shard-transactions-at-10-million-requests-per-second/

Best practices for using the native postgres client and server functionality to avoid SQL injection. If you're not using prepared statements, it turns out there's a chance you could be in a less than ideal place.

https://tapoueh.org/blog/2018/11/preventing-sql-injections/

Jobs

Post a job to the Data Eng Weekly job board for \$99. https://jobs.dataengweekly.com/

News

Two Apache Hive vulnerabilities were disclosed this week. One is related to the EXPLAIN query and the other is related to HiveServer2 authorization.

https://lists.apache.org/thread.html/9d81948503adc4007d7cf47c4bd0e9cd9286d86fd7fd409fda8be551@%3Cannounce.apache.org%3E
https://lists.apache.org/thread.html/8503d127073204c15d66074459dbc5b2c822b646ed1769dc9a897a93@%3Cannounce.apache.org%3E

MapR announced version 6.1 of their platform and a new push to offer Cloudera and Hortonworks customers a free, comprehensive assessment of their current environment.

https://mapr.com/blog/the-mapr-clarity-program-is-your-clear-path-to-ai-hybrid-and-multi-cloud-containers-and-operational-analytics/

Datacoral has publicly launched and announced their Series A financing. The company's product aims to bring the data infrastructure of larger companies to everyone.

https://medium.com/datacoral/introducing-datacoral-a-secure-scalable-data-infrastructure-no-management-required-a7d832773838

This article is a comprehensive list of the core technologies, skills, and concepts (with links to more details on each) that are sufficient to become a data engineer. It also has a good contrast of data scientist and data engineer as well as the types of specialization within the data engineering profession.

https://www.analyticsvidhya.com/blog/2018/11/data-engineer-comprehensive-list-resources-get-started/

This post recaps the data engineering track of last week's DataEngConf. Themes include improving ETL processes, the rise of Kubernetes/containers (and what problems they help solve), and workflow engines.

https://medium.com/memory-leak/recapping-the-dataengconf-eba9d09f06ae

Another recap, not by an antendee but someone who looked at lots of the online presentations, covers the recent Kafka Summit SF. It's a good overview that highlights a number of key themes and takeaways.

https://manuzhang.github.io/2018/11/10/kafka-summit-sf-2018.html

Sponsor

Interested in extracting value from large quantities of data? That's what we do at Criteo and it's actually pretty much all we do. We have R&D job openings in data engineering, machine learning, and related areas. We use a large number of Big Data technologies, mainly open source, and like to contribute to the open-source community as much as possible. We are looking for people who are interested in tackling the challenges that come with ingesting and analyzing hundreds of terabytes of data per day. Never heard of Criteo? We are an international ad-tech company with offices all over the world. We have a pretty cool corporate culture: http://bit.ly/criteo-culture. We take Big Data seriously enough that we have our own conference, NABD http://bit.ly/NABD-Conf, which is guaranteed to be free of marketing talks.

Checkout our R&D jobs at our offices in Palo Alto, Ann Arbor, Paris, and Grenoble: http://bit.ly/criteo-rd-jobs

Releases

Apache Spark 2.4.0 was recently released - it adds support for Scala 2.12, improvers the Kubernetes integration, has better performance for parsing Avro data, and has a lot of other improvements.

https://spark.apache.org/releases/spark-release-2-4-0.html

Apache Hive 2.3.4 is out. It resolves two vulnerabilities and is a recommended upgrade.

https://lists.apache.org/thread.html/cbebd33cbc0b42d8eb5427310392665fcd69ccdb516f3ea3635cc31a@%3Cannounce.apache.org%3E

Version 2.5.1 of Apache Kylin, the OLAP service for Apache Hadoop, was released. It includes 30 bug fixes and improvements.

https://lists.apache.org/thread.html/fdd30d44f3ec7534939fc94ac6c9b572ccd391a39550eb706aab7d3c@%3Cannounce.apache.org%3E

Version 3.0 of ScyllaDB, which is a drop-in replacement for Apache Cassandra, was released. ScyllaDB touts its performance, since it's written in C++ rather than Java.

https://www.zdnet.com/article/scylladb-achieves-cassandra-feature-parity-adds-htap-cloud-and-kubernetes-support/

Apache Kafka 2.0.1 was released. It includes nearly 50 bug fixes and improvements.

https://lists.apache.org/thread.html/c5d448f850a751951aa1df8d82383eb335b792a42a7aad7341792176@%3Cannounce.apache.org%3E

Based on Apache Kafka 2.0.1, version 5.0.1 of the COnfluent Platform is out.

https://docs.confluent.io/current/release-notes.html#cp-5-0-1-release-notes

Events

Curated by Datadog ( http://www.datadog.com )

California

Next Generation Open Source Data Infra: Iceberg, Spark, and Databases (San Francisco) - Tuesday, November 13
https://www.meetup.com/SF-Big-Analytics/events/255803681/

Druid Bay Area Meetup @ Netflix (Los Gatos) - Wednesday, November 14
https://www.meetup.com/druidio/events/254994344/

Washington

Event Ingestion at Scale (Seattle) - Tuesday, November 13
https://www.meetup.com/Seattle-Scala-User-Group/events/253649644/

Texas

Streaming Data Using Kafka and NLP Use Case (Plano) - Monday, November 12
https://www.meetup.com/DFW-Data-Engineering-Meetup/events/255952404/

Minnesota

November Meetup: Apache Kafka (Minneapolis) - Tuesday, November 13
https://www.meetup.com/NodeMN/events/256146797/

Wisconsin

Zendesk Journey to Cloud Data Engineering/Science: Airflow, Tensorflow, and GCP (Madison) - Tuesday, November 13
https://www.meetup.com/BigDataMadison/events/255967610/

Ohio

Cleveland Big Data Meetup (Cleveland) - Monday, November 12
https://www.meetup.com/Cleveland-Hadoop/events/255117595/

Georgia

Druid: History and the Future of Operational Analytics Data Store (Atlanta) - Thursday, November 15
https://www.meetup.com/Apache-Druid-Atlanta/events/254819889/

North Carolina

Stream Processing Done Right Using Apache Kafka (Raleigh) - Tuesday, November 13
https://www.meetup.com/Raleigh-Apache-Kafka-Meetup-by-Confluent/events/256027605/

Pennsylvania

Enabling Analytics at Scale with Modern Data Pipelines (Philadelphia) - Tuesday, November 13
https://www.meetup.com/DataPhilly/events/255742822/

CHILE

Approaching the Cloud: Introduction to Apache Spark (Santiago) - Thursday, November 15
https://www.meetup.com/Cloud-Native-Chile/events/256240930/

UNITED KINGDOM

Julia and Big Data: Aviva/Spark/Hadoop/Hive (London) - Monday, November 12
https://www.meetup.com/London-Julia-User-Group/events/255797971/

Python Streaming Pipelines with Beam on Flink (London) - Tuesday, November 13
https://www.meetup.com/Apache-Flink-London-Meetup/events/255967796/

Exactly-Once Semantics in Apache Kafka, with Jay Kreps (London) - Tuesday, November 13
https://www.meetup.com/Apache-Kafka-London/events/255858267/

FRANCE

Flink SQL in Action, Low Latency Decision for Bot and Flink Goes Cloud Native (Paris) - Wednesday, November 14
https://www.meetup.com/Paris-Fast-Data-Meetup/events/256091493/

Beyond the Brokers: A Tour of the Kafka Environment (Talence) - Thursday, November 15
https://www.meetup.com/BordeauxJUG/events/256161168/

NETHERLANDS

Kafka, MQTT, Graph, KSQL, and More! (Amsterdam) - Monday, November 12
https://www.meetup.com/Amsterdam-Kafka-Meetup/events/255834445/

Real-Time Fast Data in a Microservice Architecture (Veldhoven) - Tuesday, November 13
https://www.meetup.com/MeraparTechnologies/events/255420185/

GERMANY

Connected Data Platform: Hive 3.0 and Kafka Management (Frankfurt) - Tuesday, November 13
https://www.meetup.com/futureofdata-frankfurt/events/254822321/

Ingest, Cleanse, Transform, and Provision Data on Hadoop with Talend (Frankfurt) - Thursday, November 15
https://www.meetup.com/Commerzbank-Big-Data-and-Advanced-Analytics-Events/events/256198211/

SWITZERLAND

Connected Data Platform: Containerization, Hive Optimization, and Ingest (Zurich) - Thursday, November 15
https://www.meetup.com/futureofdata-zurich/events/255777328/

ITALY

Upcoming Apache Spark 2.4 and MLflow: New Tools for Big Data and ML (Milano) - Thursday, November 15
https://www.meetup.com/Spark-More-Milano/events/255996001/

POLAND

Streaming ETL with Apache Kafka + More (Gdansk) - Wednesday, November 14
https://www.meetup.com/PyData-Trojmiasto/events/255711984/

ISRAEL

Kafka and DDS for Kubernetes: Connecting the Dots (Tel Aviv) - Wednesday, November 14
https://www.meetup.com/Kubernetes-Tel-Aviv/events/256141904/

PHILIPPINES

Apache Kafka and Airflow (Taguig) - Tuesday, November 13
https://www.meetup.com/Manila-BIG-DATA-Group/events/255511252/

INDONESIA

Introducing Apache Kafka (Jakarta) - Thursday, November 15
https://www.meetup.com/Jakarta-Kafka/events/255836745/