Data Eng Weekly Issue #287

28 October 2018

Several great illustrative articles this week on architecture components of PostgreSQL, Spark, MongoDB, Flink streaming, Pulsar, and more. Also, two posts on Kubernetes for Kafka and Flink, data center failover at Facebook, and a presentation that does a great job explaining Paxos for consensus. In news, there's a good peak at what's in store for the product merger of Hortonworks and Cloudera, an article on a botnet spreading across Hadoop clusters, and links to videos from a couple of recent conferences.

Sponsor

For data engineers who are frustrated with big bullshit data conferences and boring talks, we got a solution - is the first community powered No-Bullshit Data event designed for and by data engineers and scientists. Taking place this year at Columbia University in NYC on Nov 8-9th. They have 4 dedicated tracks this year: Data Engineering, Data Science, AI Products and the brand new Hero Engineering. Join geeks from the west & east coast within companies like Facebook, Salesforce, Netflix, WeWork, MIT, Beeswax, Lyft, Stitch Data, Datadog, Segment, Starburst, Datacoral, Columbia University, Capital One, TapRecruit, Figure Eight, Dia&Co and many more.

We are giving every DataEngWeekly reader a 20% discount code "DataEngWeekly" which can be redeemed for tickets here: http://buytickets.at/hakkalabs/192354/r/dataengweekly

Technical

Interesting post explaining how PostgreSQL stores data on disk, analyzing placement on disk using the ctid of a row to show location, how to repartition data by reading/sorting/writing back from Spark, and more.

https://lambda.grofers.com/why-physical-storage-of-your-database-tables-might-matter-74b563d664d9

This presentation describes the differences between classic and multi Paxos, how they relate to Zab (ZooKeeper) and Raft. It's a really good introduction to consensus algorithms.

http://hh360.user.srcf.net/slides/liberatingconsensus.pdf

Datalog is a language used to query both graph and relational data. This post is about KDatalog, an implementation of Datalog atop of Kafka Graphs using the Pregel algorithm.

https://yokota.blog/2018/10/23/kdatalog-kafka-as-a-datalog-engine/

If you're a visual learner like me, the great diagrams from this post provide a helpful supplement to the core Spark documentation. They cover how Spark interacts with YARN, how jobs relate to stages, and more.

https://medium.com/@pang.xin/spark-study-notes-core-concepts-visualized-5256c44e4090

Turnilo is a new open source business intelligence tool for Druid. It's based on Swiv, the open source fork of Pivot, with some change to the build process, some bug fixes, and other changes. Allegro tells the story of their transition to Druid from Hadoop and the use cases for which they're trying to solve with the Druid/Turnilo combo.

https://allegro.tech/2018/10/turnilo-lets-change-the-way-people-explore-big-data.html

Adobe writes about their enterprise Kafka deployment, called Pipeline, that powers the Adobe Experience Platform. Pipeline is replicated across 13 data centers and processes 10s of billions of events per day. They write about how they ingest data via RESTful APIs, replicate data across all of those data centers, and their management console.

https://medium.com/adobetech/creating-the-adobe-experience-platform-pipeline-with-kafka-4f1057a11ef

This Jepsen post covers how MongoDB performs in the face of network and other failure scenarios. There's a good overview of Mongo's sharding strategy, a description of the types of failure scenarios tested, an overview of causal consistency, and the results of these tests (a few new issues were found with MongoDB's causal consistency). There's a thorough discussion of the various tuning and configuration options in MongoDB that relate to these tests.

http://jepsen.io/analyses/mongodb-3-6-4

Wallaroo has an introduction to their connector API, which is used to get data into and out of Wallaroo. They have open source examples for Kafka, Kinesis, Redis, RabbitMQ, S3, and UDP.

https://blog.wallaroolabs.com/2018/10/introducing-connectors-wallaroos-window-to-the-world/

Distributed systems often span multiple data centers, and applications are designed to tolerate a total data center failure. Facebook has built a system to move traffic off of a data center, and they've published a paper about building and testing it. The morning paper has a good overview of their paper.

https://blog.acolyer.org/2018/10/24/maelstrom-mitigating-datacenter-level-disasters-by-draining-interdependent-traffic-safely-and-efficiently/

The AWS Big Data blog has a post from Equinox on their analytics platform that's built on Amazon S3 with Amazon EMR and AWS Glue. They convert data form CSV to Apache Parquet for better performance.

https://aws.amazon.com/blogs/big-data/closing-the-customer-journey-loop-with-amazon-redshift-at-equinox-fitness-clubs/

The Confluent blog has a post discussing the trade-offs of running Apache Kafka on Kubernetes vs. not. It touches on technical items, such as ability to try out new versions of Kafka and ease of scaling, as well as other practical items, such as potential bureaucratic-overhead.

https://www.confluent.io/blog/apache-kafka-kubernetes-could-you-should-you

The data Artisan's blog has a tutorial for running their Platform on Amazon EKS. There are instructions for how to deploy the EKS cluster both via the AWS web console and CLI. They then describe how to start the Flink Application Manager, run a job, and configure remote state checkpoints/savepoints in S3.

https://data-artisans.com/blog/how-to-get-started-with-data-artisans-platform-on-aws-eks

The Streamlio blog has an overview of writing Apache Pulsar Functions. It has examples for routing, filtering, transformations and the basics of windowing.

https://streaml.io/blog/eda-event-processing-design-patterns-with-pulsar-functions

This post covers the three types of backends for storing state in Apache Flink—memory, file system, and rocksdb. There's a discussion of trade-offs of each.

https://data-artisans.com/blog/stateful-stream-processing-apache-flink-state-backends

A good brief introduction to the analyze command for collecting statistics used by the PostgreSQL query optimizer.

https://www.smoothterminal.com/articles/analyze

Jobs

Senior Data Engineer, Wooga, Berlin https://jobs.dataengweekly.com/jobs/bd122b51-64ec-4a6a-a67e-73de9aa0fef3
Data Platform Engineer, Prezi, Budapest https://jobs.dataengweekly.com/jobs/06375e1a-6b74-4dba-b83d-d42c8386041b

Post a job to the Data Eng Weekly job board for $99. https://jobs.dataengweekly.com/

News

A good article on building a data science team, including how to communicate with the rest of the organization, making the team feel connected to the work, and more. While it's about data science, these ideas apply to a bunch of data teams.

https://hbr.org/2018/10/managing-a-data-science-team

Recap of the Apache Fink Bay Area Meetup (with links to slides), which had talks from MapR, Netflix, and Lyft.

https://data-artisans.com/blog/community-update-october-2018-apache-flink-bay-area-meetup

Datanami has a good article on what the Cloudera and Hortonworks merge means from a product consolidation perspective. They've pulled some slides out of the SEC filing that talk about the future "Unity" releases that they're planning.

https://www.datanami.com/2018/10/24/new-cloudera-plots-a-course-toward-a-unified-future/

ZDNet has coverage of a botnet that is spreading via unsecured Apache Hadoop clusters.

https://www.zdnet.com/article/new-ddos-botnet-goes-after-hadoop-enterprise-servers/

Videos from DataEngConf Barcelona have been posted. Catch up on the videos to get a good idea of what to expect for the upcoming conference in New York.

https://www.youtube.com/playlist?list=PLAesBe-zAQmHlmcRlXKT1beFQVaOyivqq

A roundup as well as the videos+slides from the recent Kafka Summit have been posted online. Slides are public but videos are behind an email-wall.

https://www.confluent.io/blog/kafka-summit-san-francisco-2018-roundup

Sponsor

We are giving every DataEngWeekly reader a 20% discount code "DataEngWeekly" which can be redeemed for tickets here: http://buytickets.at/hakkalabs/192354/r/dataengweekly

Releases

Hortonworks announced the release of HDPSearch 4.0. It's based on Apache Solr 7.4, which is the first to use Solr 7 (with autoscaling and two new types of replication).

https://hortonworks.com/blog/enterprise-search-hdp-search/

Apache Impala 3.0.1 was released. It includes security fixes related to two CVEs.

https://lists.apache.org/thread.html/fb6bd0d677eaf20a7c509f5a54daafc3a5e9c4c163977b533541c1b8@%3Cannounce.apache.org%3E

Version 2.2.0 of Apache Pulsar, the messaging platform, has been released. The release adds several major features including Pulsar SQL, several new connectors for HDFS, Flink, & JDBC, and other fixes.

https://lists.apache.org/thread.html/78eac458d7fbaa630c8f3dcdb1cfdb3450d06bf8d2928df661248d1c@%3Cannounce.apache.org%3E

Apache Kudu 1.8.0 is out with a new manual rebalancer and improved support for Pandas, Spark Streaming DataFrames, and the Kudu Python client.

https://lists.apache.org/thread.html/0b3fca3fcf9cbcb69b7e2513ce2dff4608d9ccd921de875901a179ad@%3Cannounce.apache.org%3E

Spring Cloud Data Flow has been released with new audit trail support, a new dashboard, and support for native cron jobs in Kubernetes.

https://content.pivotal.io/blog/audit-trails-new-gui-kubernetes-cronjob-integration-streaming-application-dsl-and-more-spring-cloud-data-flow-1-7-is-ga

PipelineDB, which is a PostgreSQL extension for aggregating time series data, released version 1.0.0. This post introduces the SQL-based API. PipelineDB is GPLv3 licensed.

https://www.pipelinedb.com/blog/pipelinedb-1-0-0-high-performance-time-series-aggregation-for-postgresql

Events

Curated by Datadog ( http://www.datadog.com )

Texas

Apache Kafka Introduction and Use Cases (Houston) - Monday, October 29
https://www.meetup.com/Houston-Kafka/events/255247089/

Florida

Building Modern Data Lakes with Minio, Hadoop, Spark & Unified Data Architecture (Jacksonville) - Tuesday, October 30
https://www.meetup.com/jaxbigdata/events/255173140/

New York

Scaling Kafka in Kubernetes with and without Datadog Kafka-Kit (New York) - Tuesday, October 30
https://www.meetup.com/Apache-Kafka-NYC/events/244427142/

CANADA

Fast Data Applications with the Alpakka Kafka Connector (Toronto) - Thursday, November 1
https://www.meetup.com/scalator/events/255401491/

IRELAND NewsWhip and Confluent Talk Kafka (Dublin) - Tuesday, October 30
https://www.meetup.com/Dublin-Apache-Kafka-Meetup-by-Confluent/events/254187520/

GERMANY

Apache Kafka & KSQL in Action: Use Cases & Demo (Wiesbaden) - Monday, October 29
https://www.meetup.com/Frankfurt-Apache-Kafka-Meetup-by-Confluent/events/254699081/

Apache Kafka: Lessons Learned (Koln) - Tuesday, October 30
https://www.meetup.com/REWE-Digital-Events-Cologne/events/255664731/

ITALY

Processing Streaming Data with KSQL w/ Special Guest (Milano) - Tuesday, October 30
https://www.meetup.com/Milano-Kafka-meetup/events/255094035/

HUNGARY

Big Data Meetup: Data, Data, Everywhere @ Morgan Stanley (Budapest) - Tuesday, October 30
https://www.meetup.com/Big-Data-Meetup-Budapest/events/255631119/

ISRAEL

Streaming Things with Kafka and Spark (Tel Aviv-Yafo) - Wednesday, October 31
https://www.meetup.com/Big-things-are-happening-here/events/255521836/

SINGAPORE

A Tour of Apache Kafka (Singapore) - Thursday, November 1
https://www.meetup.com/Singapore-Kafka-Meetup/events/255620191/

THAILAND

Our First Kafka Meetup in Bangkok with a Guest Speaker from Confluent (Bangkok) - Wednesday, October 31
https://www.meetup.com/Bangkok-Kafka/events/255619397/

AUSTRALIA

Automating Apache Kafka (Melbourne) - Thursday, November 1
https://www.meetup.com/melbourne-distributed/events/254783329/