28 January 2018
ICYMI last week, Hadoop Weekly is now Data Eng Weekly. Same great content but a new name to reflect the coverage of more than just Hadoop. You can read more about the switch, as well as my reflections on five years of Hadoop Weekly, in the link below.
https://medium.com/@joecrobak/five-years-of-hadoop-weekly-7aa8994f140b
This week's issue is full of technical content covering streaming systems, distributed systems, Spark, Cassandra, Redshift, and more. In news, there are year-enders from Apache Beam and StreamSets, and the team at Dremio raised a series B. Finally, there's a chance to win a free trip to Big Data Tech Warsaw if you complete a survey.
If you're newer to stream processing and still trying to understand the main concepts, then checkout this post which has a few metaphors to help think about streams.
https://medium.com/capital-one-developers/three-ways-to-think-about-streaming-6cc39b99a56e
Landoop offers Lenses SQL, which is an alternative to Confluent's KSQL. It's integrated with Kubernetes for scale up/down of pipelines and has a rich web UI for creating, monitoring, and modifying SQL jobs. This post and accompanying video provide a brief overview of the main features.
https://medium.com/landoop/lenses-sql-kafka-stream-processors-scale-at-kubernetes-df697137685f
Small files have cause performance issues with big data processing systems for as long as I've used them. Unfortunately, this is no different with modern file formats like ORC and Parquet. To help alleviate these issues, Hive has a concatenate
command and parquet comes with a merge tool. This post describes these types of strategies for combining files as well as the types of performance improvements that you might see with IBM Big SQL after compacting.
The Principles of Programming Languages conference was a few weeks ago, and The Morning Paper covered several of its papers this week. Two are of interest if you like distributed systems—the first is on Disel, which helps design, implement, and verify distributed system correctness. The second shows why Jepsen's (of the Call Me Maybe series) randomized testing of distributed systems has proven so effective at finding bugs.
https://blog.acolyer.org/2018/01/22/programming-and-proving-with-distributed-protocols/
https://blog.acolyer.org/2018/01/23/why-is-random-testing-effective-for-partition-tolerance-bugs/
With the caveat that it is from a vendor with an alternative product, this article has some valid technical criticisms (around loading current state and consistent writes) of Apache Kafka for the event sourcing pattern.
https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
Magicpin has published a post describing their Golang-based stream processing system. Built on Kafka, Go is used for stream data ingestion and serving an API layer. For data storage and processing, they use Cassandra, Spark, and Elasticsearch.
Qubole has written about their Spark Tuning Tool, which analyzes Spark performance in the driver and executors. For the latter, they have a heuristic to estimate cluster utilization and wall clock time with various numbers of executors, and they can analyze individual stages to determine skew, parallelism, and more. While the tool isn't open-source, they post has a lot of good insights into understanding Spark performance.
https://www.qubole.com/blog/introducing-quboles-spark-tuning-tool/
Netflix has written about the evolution of their real-time time series database that's used to store viewing histories. Built on Apache Cassandra, the first performance enhancement was a caching layer. After that, they implemented a storage tiering strategy in which older data is compressed and stored in a separate table. The post goes through the design and strategy, and it describes what kinds of improvements they've realized.
https://medium.com/netflix-techblog/scaling-time-series-data-storage-part-i-ec2b6d44ba39
AWS has published a post describing eight best practices for Amazon Redshift. Most of the tips are straightforward with a couple of exceptions. For VACUUM, it has a good overview of how to get the most improvement out of this maintenance task. And on the topic of monitoring performance, it links to a collection of SQL queries for analyzing built-in metrics.
This post shows how to run the Spark History Server to explore event data stored in Amazon S3 for jobs ran on a transient cluster.
https://banzaicloud.com/blog/spark-history-server/
This site contains supplemental documentation for the Apache Airflow workflow system. It has tutorials, gotchas, tips, examples, and more.
https://gtoonstra.github.io/etl-with-airflow/
This post gives a good intro to B-Tree indices and how they're used in Postgres to improve concurrency.
https://rcoh.me/posts/postgres-indexes-under-the-hood/
This article has a look at TimescaleDB, which is a time series database built on Postgres. There's a brief overview of how Timescale scales for large data (by taking advantage of immutability) while continuing to offer SQL support.
https://www.nextplatform.com/2018/01/25/time-time-series-databases/
The Big Data Tech Warsaw conference is in just under a month. Participate in their survey for a chance to win a trip to the conference.
http://getindata.com/win-trip-warsaw-ticket-big-data-tech-2018
Apache Beam has published a year-end look back at 2017. It covers community growth and innovation in capabilities like cross-language portability and machine learning support. It also looks forward to areas of improvement and the overall project culture.
https://beam.apache.org/blog/2018/01/09/beam-a-look-back.html
Attunity and Hortonworks have partnered on a new book, Apache NiFi for Dummies, which is available for download behind an email/phone-wall.
https://hortonworks.com/blog/introduction-apache-nifi/
In another look back at 2017, StreamSets has put out a press release about their company growth (5x increase in revenue) and product releases (Edge and Control Hub).
Dremio, makers of data analytics software that aims to eliminate ETL, have announced a $25M series B. They're looking to grow their staff and expand to new markets.
https://www.datanami.com/2018/01/23/dremio-accelerates-growth-plans-following-25m-series-b/
Qubole has announced a new Dashboard feature. It supports periodic refresh and various presentation themes.
http://www.qubole.com/blog/announcing-dashboards-qubole-data-service/
Version 0.4.0 of Apache NiFi MiNiFi was released. It has some improvements and adds support for Apache NiFi 1.5.0.
Apache Phoenix 4.13.2 was released. It has some fixes, but the main new feature is compatibility with CDH (including parcels for Cloudera Manager).
Curated by Datadog ( http://www.datadog.com )
Self-Service Data Lakes, Apache Spark & Ignite, and More (Palo Alto) - Wednesday, January 31
https://www.meetup.com/BigDataApps/events/244915185/
Data Pipelining and Refining at Scale (Vancouver) - Monday, January 29
https://www.meetup.com/vancouver-kafka/events/245503004/
Apache Kafka Meetup @ Rovio (Espoo) - Tuesday, January 30
https://www.meetup.com/Helsinki-Apache-Kafka-Meetup/events/246641528/
Apache NiFi: Best Practices and Experiences (Paris) - Wednesday, January 31
https://www.meetup.com/futureofdata-paris/events/245994331/
Apache Kafka at Our First Meetup (Rome) - Thursday, February 1
https://www.meetup.com/Roma-Kafka-meetup-group/events/246631436/
Introduction to Distributed Processing with Spark Core API (Wroclaw) - Wednesday, January 31
https://www.meetup.com/WroclawJUG/events/246278965/
Kafka Streams vs Spark Structured Streaming (Warsaw) - Thursday, February 1
https://www.meetup.com/ITAkademiaj-labsWarszawa/events/246828306/
Big Data & Analytics Meetup (Tokyo) - Wednesday, January 31
https://www.meetup.com/Think-Big-Analytics-Japan/events/246261316/
In January 2018, Hadoop Weekly was renamed Data Eng Weekly. You can read more about this change here: https://medium.com/@joecrobak/five-years-of-hadoop-weekly-7aa8994f140b