21 January 2018
Lots of great content this week, with an emphasis on stream processing (Kafka and Wallaroo) and data engineering patterns (an article on Functional Data Engineering and scaling analytics+data engineering at Wish). In news, there's coverage of Hadoop 3.0 and a podcast episode covering the last five years of Hadoop. And lots of releases—Apache Drill (from back in December), Apache Impala, Apache HBase 2.0 beta, and more.
This issue marks a key milestone—it's been five years since I started Hadoop Weekly. I've written about the milestone as well as reflecting on the past few years.
https://medium.com/@joecrobak/five-years-of-hadoop-weekly-7aa8994f140b
As stated in the above post, in the next few weeks Hadoop Weekly will be renamed to Data Eng Weekly. It's fair to say that the content of this newsletter has far outgrown Apache Hadoop and its core related products (as the above overview perfectly exemplifies!). 2018 should be an exciting year for data engineering. I’m excited to cover that news, and I hope you remain on for the ride (and help spread the word!).
This post describes the functional pattern for data engineering, in which there's a focus on repeatability, immutability, and decomposition to solve coordination and scaling problems. It also mentions Apache Airflow, which implements several of these patterns, and it describes patterns for handling dimensional data, late arriving data, and more.
Wish has written a thorough four-part series on scaling their analytics and data engineering teams. It's a good mix of technical and non-technical details, such as how they use Luigi for workflows and Prometheus for monitoring as well as how they built the team over time and quick descriptions for several jobs. There's quite a bit of good advice and lessons learned no matter where you are in building a data team.
https://medium.com/wish-engineering/scaling-analytics-at-wish-619eacb97d16
It may be counterintuitive at first, but there are some pretty compelling reasons to store multiple different types of events on the same Kafka topic. In particular, when implementing an event sourcing strategy, order of events is key for correctness. This post lays out that and other use cases as well as describes some changes to the Confluent Schema Registry Client to better support heterogenous schemas within a topic.
https://www.confluent.io/blog/put-several-event-types-kafka-topic/
With lots of folks using GPUs for neural networks and other machine learning use cases, we'll need new tools to productionize this new type of data infrastructure. This post gives a good example of running TensorFlow via Kubernetes on AWS.
https://banzaicloud.com/blog/tensorflow-on-k8s/
The Wallaroo stream processing system has added support for Go. This post uses the canonical streaming example—word count—to demonstrate some of the core abstractions like state computation and state partitions.
https://blog.wallaroolabs.com/2018/01/go-go-go-stream-processing-for-go/
The distinction between queuing and streaming is a subtle one, and it's confusion can often lead to design mistakes. This post does a good job explaining the differences, with some extra commentary on how Apache Pulsar supports both use cases.
https://streaml.io/blog/unified-queuing-streaming/
A brief overview of the main components of centralized logging—transport, storage, analysis, and alerting. There's a look at some of the options for transport, such as Kafka, as well as various tradeoffs when it comes to storage and analysis.
https://medium.com/eulercoder/part-1-building-a-centralized-logging-application-5a537033da0a
Confluent has a collection of demos, tutorials, technical blogs, and more for getting started with Kafka. Lots of great resources if you're looking to get started.
https://docs.confluent.io/current/tutorials.html
This overview of challenges facing enterprises looking to get into big data provides a pretty good overview of the major challenges even once a system is adopted. The hurdles include technical talent, managing metadata, integrating with legacy systems, and more.
https://dzone.com/articles/whats-preventing-big-data-success
An interview with Hortonwork's YARN and MapReduce lead, Vinod Kumar, covers topics like the new features in Apache Hadoop 3.0, how Hadoop fits into larger trends, and thoughts on the so-called "post-Hadoop" era.
It shouldn't be a surprise to anyone that's followed some of the trends in this newsletter that Kafka is accelerating a lot of the growth in stream processing.
https://www.datanami.com/2018/01/18/fueled-kafka-stream-processing-poised-growth/
Episode 70 of Roaring Elephant, the podcast on Apache Hadoop, covers the past five years of Hadoop by looking back at an article that captures the state of things in 2013.
https://roaringelephant.org/2018/01/16/episode-70-10-facts-about-hadoop-five-years-later/
Apache Drill 1.12.0 was released back in December. It now supports Kafka and OpenTSDB as sources of data, has improved throttling, has new functions for dealing with IP addresses/CIDR blocks/more, and adds new security features.
https://drill.apache.org/blog/2017/12/15/drill-1.12-released/
Apache Impala 2.11.0 was announced. It includes improvements to S3 integration (IAM role support), code gen, and Kudu support. In all, over 200 tickets are included in the release.
Apache HBase 2.0.0 is now in beta. As the announcement notes, there are over 2,000 changes included in the release.
IBM has announced the GA of their Big Replicate, version 2.1.2. It provides active-active replication for CDH and HDP distros—full details in the release announcement.
https://developer.ibm.com/hadoop/2018/01/18/announcing-big-replicate-v2-1-2/
MapR has published a docker container that contains the full complement of Drill, Spark, the MapR file system, MapR-DB, and more. It aims to be useful for local development and getting started with the platform.
https://mapr.com/blog/mapr-developer-container-demo/
Google research has announced a free notebook system, Colaboratory, which supports TensorFlow with GPUs.
https://www.kaggle.com/getting-started/47096#post271139
Stream Reactor is a collection of open-source Apache Kafka connectors. The new release (there are different binaries for Kafka 1.0 and 0.11) adds support for Apache Pulsar and has fixes/new features for the Cassandra, FTP, MQTT, JMS, Redis, and InfluxDB connectors.
http://www.landoop.com/blog/2018/01/stream-reactor-kafka-connectors-04/
Curated by Datadog ( http://www.datadog.com )
Big Data Integration and Management with Apache Gobblin, Dali, and Friends (San Francisco) - Thursday, January 25
https://www.meetup.com/Big-Data-Meetup-LinkedIn/events/246858500/
Big Data 2018.1 (Santiago) - Monday, January 22
https://www.meetup.com/Big-Data-Chile/events/246788429/
Dataframes in Apache Spark (Bristol) - Monday, January 22
https://www.meetup.com/Apache-Spark-South-West-UK/events/244075647/
Scaling Hive via Mesos (Warsaw) - Thursday, January 25
https://www.meetup.com/warsaw-hug/events/246897996/
Stream Event Processing in Scale with Apache Flink and Couchbase (Herzliya) - Tuesday, January 23
https://www.meetup.com/Big-things-are-happening-here/events/246778549/
Apache Flink Meetup Tel-Aviv @ Clicktale (Ramat Gan) - Wednesday, January 24
https://www.meetup.com/meetup-group-Apache-Flink-Meetup-Tel-Aviv/events/246639392/
Kafka RDBMS Bidirectional Integration (Tel Aviv) - Wednesday, January 24
https://www.meetup.com/ApacheKafkaTLV/events/246603459/
Apache Spark Best Practices (Johannesburg) - Thursday, January 25
https://www.meetup.com/ZA-Hadoop-User-Group/events/246830307/
Streaming Data Platforms with Apache Kafka (Melbourne) - Thursday, January 25
https://www.meetup.com/melbourne-distributed/events/245242068/