08 April 2018
Lots of great content this week—stream processing with Apache Kafka, consistent hashing strategies, several posts on data infra architecture, and two posts related to Apache Hive performance. Flink Forward is this week and DataEngConf is next week (see News for a discount code). In releases, Apache Hadoop 3.1.0 is out, Apache Hive released a new version to fix some security issues, and ElastiCache is a cluster provisioning tool that recently gained support for Hadoop and Spark.
Are you looking to elevate your data culture http://bit.ly/culture-at-netflix? Do you want to work with stunning colleagues, solve complex data problems and contribute to the best video viewing experience in the industry? Come press ‘play’ with Netflix. We’re hiring for many types of data roles. Look for us at the upcoming DataEngConf and say hello to learn more.
http://bit.ly/netflix-jobs-data
This post describes one company's journey with BigQuery—the advantages it provides in performance and usability but also how that can be a double edged sword because it's easy to forget to optimize when it's so easy to use more compute (and money) for a query.
https://labs.unacast.com/one-year-with-bigquery-e3ebd73749cd
While it's been around for a while, I only recently came across the WSO2 streaming SQL system. In addition to analyzing data from Kafka, the open-source engine can take in data from lots of other sources (like HTTP, MQTT, JMS, and email). This post provides a good introduction to its SQL syntax, including how to use it for lots of common operations.
If you want to test out an unreleased or custom build of Spark, this post walks through the build and deploy steps for running on a YARN cluster or Kubernetes. The tutorial is tailored to Google Cloud Platform, but most of the steps are broadly applicable.
This post is a great overview of consistent hashing algorithms, which are a common building block in a distributed system. The post starts off with the classic ring-based consistent hashing mechanism and goes into more recent inventions like Jump Hash, Multi-Probe Consistent Hashing, Rendezvous Hashing, and Maglev Hash (three of the four are from Google).
https://medium.com/@dgryski/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8
This post summarizes a recent presentation on building and deploying ML models at Booking.com. They deploy docker containers using Kubernetes (which has some experimental support for GPUs) and, for serving, pull the pre-trained models from HDFS.
https://www.infoq.com/news/2018/04/booking-kubernetes-machine-learn
The Teads data analytics team writes about their data infrastructure, which is spread across Google Cloud Platform for BigQuery and DataFlow and Amazon Web Services for Kafka and Redshift. They describe a number of key architectural decisions that they made for ingesting, storing, and querying data in BigQuery as well as some limitations. Lots of good tips if you're considering a similar architecture (including some gotchas related to pricing to keep an eye on).
https://medium.com/teads-engineering/give-meaning-to-100-billion-analytics-events-a-day-d6ba09aa8f44
MR3 is a new execution engine for Apache Hadoop. It claims to have feature parity to Tez with improved performance. This two part series describes the performance gains when running Hive on MR3 vs Tez.
https://mr3.postech.ac.kr/blog/2018/04/02/performance-evaluation-sequential-tpcds/
Hortonworks has analyzed Hive LLAP runtime when data is stored in HDFS or Amazon S3 (tiered vs. decoupled storage strategies) for cloud workflows, and they found that there's a small fixed cost overhead of storing data in S3 rather than HDFS (at least for their test scenario).
https://hortonworks.com/blog/cloud-architectures-interactive-analytics-apache-hive/
A quick overview and performance evaluation of Spark RDDs and DataFrames. The performance eval uses the Wikipedia dataset and some simple aggregations both with Spark and PySpark.
https://mindfulmachines.io/blog/2018/4/3/spark-rdds-and-dataframes
Using the Syslog Apache Kafka Connect plugin, you can get syslog data into Kafka in Avro format for analysis. This post shows how to, once you've collected that data, perform analysis using KSQL.
https://www.confluent.io/blog/real-time-syslog-processing-apache-kafka-ksql-part-1-filtering
Great overview of the stream and table concepts in stream processing. The post has lots of animations to help illustrate the relationship between the two, and it has some examples in Scala and KSQL for building a table from a stream.
This tutorial walks through running Apache Kafka on Microsoft HDInsight and hooking it up to a Spark databricks cluster for stream processing.
https://lenadroid.github.io/posts/kafka-hdinsight-and-spark-databricks.html
Flink Forward is this week in San Francisco. DataArtistans have been previewing the sessions on their blog the past few weeks—here's part 6 of their coverage.
https://data-artisans.com/blog/flink-forward-san-francisco-preview-part-6-of-6-technology-deep-dive
DataEngConf, as mentioned last week, is in just over a week in San Francisco. The conference is offering 25% off for readers of Data Eng weekly, using the code DEWEEKLY.
https://www.eventbrite.com/e/dataengconf-sf-18-tickets-42458685070?discount=DEWEEKLY
The InfoQ podcast has an interview with Uber's Danny Yuan on their streaming systems, which include Apache Kafka, Apache Flink, Apache HDFS, and more. They process around 1 million messages per second, which feeds into OLAP systems, ML models, and more.
https://www.infoq.com/podcasts/Danny-Yuan-uber
Redmonk has analyzed the state of purpose-built time series database engines as well as some general purpose databases that are often used for time series data. They look at popularity on github and stack overflow as well as discuss the large amount of fragmentation we're currently seeing.
https://redmonk.com/rstephens/2018/04/03/the-state-of-the-time-series-database-market/
Are you looking to elevate your data culture http://bit.ly/culture-at-netflix? Do you want to work with stunning colleagues, solve complex data problems and contribute to the best video viewing experience in the industry? Come press ‘play’ with Netflix. We’re hiring for many types of data roles. Look for us at the upcoming DataEngConf and say hello to learn more.
http://bit.ly/netflix-jobs-data
HUE 4.2 is out. The release focuses on cloud, analytics, and supportability. More details on those improvements in the release notes.
http://gethue.com/hue-4-2-and-its-self-service-bi-improvements-are-out/
Apache Hadoop 3.1.0 was released with the caveat that it's not yet considered production-ready. Major enhancements include updates to S3 support/performance and new features of the capacity scheduler. YARN also gets CPU and FPGA features as well as support for long running services.
Apache Hive 2.3.3 is out. It contains security fixes related to 3 CVEs, one of which has been around as far back as Hive 0.6.0.
https://lists.apache.org/thread.html/cc41d3b2e1c176b10cba7518edd968e84fc95927deb5225967602310@%3Cuser.hive.apache.org%3E
http://bit.ly/CVE-2018-1282
http://bit.ly/CVE-2018-1284
http://bit.ly/CVE-2018-1315
Scio, the Scala library for Apache Beam and Google Cloud DataFlow, has released version 0.5.2 with new JavaConvertors implicits, custom validation for BigQuery, and more.
https://github.com/spotify/scio/releases/tag/v0.5.2
ElastiCluster, a tool for provisioning compute clusters, has added support for Hadoop and Spark clusters by way of BigTop. It's built with Ansible, has great documentation, and is licensed GPLv3.
http://elasticluster.readthedocs.io/en/latest/
Curated by Datadog ( http://www.datadog.com )
Airflow Meetup @ WePay (Redwood City) - Wednesday, April 11
https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/247127561/
Cleveland Red Hat User Group Meeting (Cleveland) - Tuesday, April 10
https://www.meetup.com/Cleveland-Red-Hat-Meetup/events/247175740/
High-Performance Data Analytics & Visualizations for Volume, Variety, Velocity (New York) - Tuesday, April 10
https://www.meetup.com/mysqlnyc/events/249163161/
Implementing Microservices with Domain Events + Replacing MirrorMaker (Vancouver) - Tuesday, April 10
https://www.meetup.com/vancouver-kafka/events/248682368/
GCP Meetup at Google (Stockholm) - Tuesday, April 10
https://www.meetup.com/gcp-stockholm/events/249164223/
Apache Kafka Meetup @ Zalando (Helsinki) - Wednesday, April 11
https://www.meetup.com/Helsinki-Apache-Kafka-Meetup/events/249025327/
Creating a Big Data Architecture with Apache Spark (Barcelona) - Thursday, April 12
https://www.meetup.com/Spark-Barcelona/events/249359355/
Collaborative Music with Kafka! (Bordeaux) - Thursday, April 12
https://www.meetup.com/IppEvents/events/249380005/
Lugano Tech Talks (Lugano) - Tuesday, April 10
https://www.meetup.com/Lugano-Tech-Talks/events/248955346/
Deep Learning on Hadoop (Prague) - Thursday, April 12
https://www.meetup.com/CS-HUG/events/249386944/
Event-Driven Architecture with Kafka Streams (Katowice) - Friday, April 13
https://www.meetup.com/Silesia-JUG/events/249090295/
RUSSIA
Introduction to Hadoop and Spark (Moscow) - Thursday, April 12
https://www.meetup.com/BigAlgo/events/247865467/
Real-Time Data Pipelines with Apache Kafka & Java Futurity (Bangalore) - Saturday, April 14
https://www.meetup.com/Core-Java-Meetup-Bangalore/events/249219945/
Sydney Data Engineering Meetup (Surry Hills) - Thursday, April 12
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/248326298/
ZA HUG #4 (Johannesburg) - Thursday, April 12
https://www.meetup.com/ZA-Hadoop-User-Group/events/247574002/