20 May 2018
Several posts this week by companies building interesting things with data systems—Disney's streaming analytics pipeline, Hulu's Hadoop data center migration, and Eventbrite's new data platform. There are also excellent technical posts on Kafka, YARN, Pulsar, and more. In news and releases, there's a new proposed Graph Query Language, a look at the rise of multi-model databases, the 1.0 release of a CEP tool for Kafka Streams, and more.
At Datadog, we’re on a mission to build the best monitoring platform in the world. We're looking for Data Engineers to use modern tech to build and extend our core pipelines that process trillions of data points per day. More here:
The Disney ABC Television Group has built a realtime analytics pipeline to track video streaming using Amazon Kineses, AWS Lambda, MemSQL, and Looker. This article summarize a recent presentation on the architecture, including some of the design decisions.
https://www.datanami.com/2018/05/14/how-disney-built-a-pipeline-for-streaming-analytics/
Python 3.7's Data Classes share some similarity with Scala's Case Classes. This post has a good overview of how to use them.
https://realpython.com/python-data-classes/
Hulu runs multiple Hadoop clusters supporting many different technologies and over 30,000 applications. For their recent data center move, they implemented several new components to minimize HDFS downtime during the migration. Among these was a data center-aware name node, a data center-aware balancer, and DCTunnel—an application to replicate blocks across data centers. The post is full of solutions to interesting challenges, such as prioritizing data for replication with PID controller and the sequencing of moving data and shipping servers to a new location.
The Hortonworks blog has a brief tutorial that provides the steps necessary to run a dockerized application on Apache Hadoop YARN.
https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/
The Streamlio blog covers using Apache Pulsar for message queuing. It describes the key differences between pub-sub messaging and queued topics and provides some example Pulsar code for the pub-sub use-case.
https://streaml.io/blog/pulsar-message-queue/
This presentation expands on the typical Kafka streaming application by layering in a REST API built with Spring Boot 2.0 and server-sent events to update a client-side dashboard in real-time.
https://speakerdeck.com/hpgrahsl/stateful-and-reactive-streaming-applications-without-a-database
LinkedIn's open-source tool for monitoring Apache Kafka consumers, called Burrow, has just release version 1.1. In addition to major code quality improvements and bug fixes, it now has a more modular design and has improved support for topic deletion.
https://engineering.linkedin.com/blog/2018/05/revisiting-burrow--burrow-1-1-
A great refresher on B-Trees, which are popular in many relational database storage systems, and LSM-trees, which are used by Apache Cassandra and RocksDB.
https://queue.acm.org/detail.cfm?id=3220266
This article covers a number of data encodings, looking at them through the lens of messaging and services. Those discussed include JSON, CSV, MessagePack, Protobuf, and Apache Thrift. The author discusses the trade-offs of each type of format, and why he thinks Apache Avro is the best bet for systems dealing with sequences of data.
http://vasters.com/blog/data-encodings-and-layout/
Eventbrite write's about their migration from a Hadoop cluster running on reserved instances with Apache Oozie for workflow management to Amazon EMR and several other tools (including Luigi and Presto). The pull in data from MySQL using Sqoop and write log data to S3 using Secor.
https://www.eventbrite.com/engineering/looking-under-the-hood-of-the-eventbrite-data-pipeline/
Here's a tutorial for running Apache Airflow on AWS ECS—part one uses the EC2 backend and part two uses the fargate backend. The setup also includes AWS ElastiCache for Redis and RDS for postgres.
https://medium.com/@fartashh/scalable-data-engineering-platform-on-cloud-a557026aa02e
https://medium.com/@fartashh/serverless-data-engineering-platform-on-cloud-59741b5627a5
There are four listings for data engineering jobs in New York, San Francisco, Mountain View, Paris, and remote. Check them out or add your own!
https://jobs.dataengweekly.com
The Neo4j team has started an effort to standardize the query language for graph databases to the new Graph Query Language (GCL). As the post describes, there are currently three closely related proprietary languages.
This article notes the rise of multi-model databases—e.g. those that can be both SQL and a key value store and a graph database and more. They note that Microsoft's Azure Cosmo DB is the prime example.
https://www.zdnet.com/article/the-new-era-of-the-multi-model-database/
Cask is joining the Google Cloud team. There are a few FAQs covered in their announcement, including that they plan to continue the open source CDAP project.
http://blog.cask.co/2018/05/cask-is-joining-google-cloud/
Distributed Systems Observability is a free eBook (behind an email-wall). It's 30-pages, with chapters on monitoring and observability, coding and testing for observability, the three pillars of observability, and more.
http://distributed-systems-observability-ebook.humio.com/
The second edition of "Seven Databases in Seven Weeks" is out. It covers a number of distributed systems, including HBase and DynamoDB.
https://pragprog.com/book/pwrdata/seven-databases-in-seven-weeks-second-edition
At Datadog, we’re on a mission to build the best monitoring platform in the world. We're looking for Data Engineers to use modern tech to build and extend our core pipelines that process trillions of data points per day. More here:
DataStax Studio 6, which is a notebook interface for DataStax, has been released. The release adds support for Spark SQL among other updates.
https://www.datastax.com/2018/05/announcing-datastax-studio-6
Pentaho 8.1 was released with integration for Google Cloud (including BigQuery), Google Drive, and Spark on AWS EMR.
Version 1.5.0 of Apache ORC has been released. It adds a C++ Writer, a tool to convert CSV to ORC, updates to statistics, and more.
https://orc.apache.org/news/2018/05/14/ORC-1.5.0/
kafkastreams-cep is a Complex Event Processing service built atop of the Kafka Streams Processor API. It's just hit version 1.0.0. It has a straightforward API for defining CEP queries, which is outlined in the project README.
https://github.com/fhussonnois/kafkastreams-cep/releases/tag/1.0.0
Spring Cloud Data Flow 1.5 is out. It supports running locally as well as distributed on Cloud Foundry and Kubernetes. There are updates in the release to the metrics system, including support for Prometheus and InfluxDB.
Apache Accumulo has announced the 1.9.1 release, which includes a critical fix for bugs that could cause data loss during recovery. The bugs affect Accumulo 1.8.0 through 1.9.0.
https://accumulo.apache.org/release/accumulo-1.9.1/
Thanos is a new tool that provides a global view of data across Prometheus clusters, implements cold storage using an object store (like Amazon S3), and more.
https://improbable.io/games/blog/thanos-prometheus-at-scale
There are four listings for data engineering jobs in New York, San Francisco, Mountain View, Paris, and remote. Check them out or add your own!
https://jobs.dataengweekly.com
Curated by Datadog ( http://www.datadog.com )
Streaming Data Platform: Apache Kafka for Java Developers + the Java Puzzlers (San Francisco) - Monday, May 21
https://www.meetup.com/sfjava/events/250313995/
Introduction to Spark Streaming & Cryptocurrency Transactions (San Diego) - Thursday, May 24
https://www.meetup.com/San-Diego-Spark-and-Big-Data-Meetup/events/250593614/
Processing Streaming Data with KSQL (Downers Grove) - Wednesday, May 23
https://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/249598104/
Apache Kafka: Empowering the Move to Data-Driven Architecture (Herndon) - Thursday, May 24
https://www.meetup.com/DC-Metro-DevOps-Professionals/events/248905466/
Building a Data Lake Solution (Burlington) - Wednesday, May 23
https://www.meetup.com/Boston_BI/events/250335997/
IRELAND
Lambda Architectures + More (Dublin) - Thursday, May 24
https://www.meetup.com/Dublin-Microservices-User-Group/events/250417676/
Hitting the Rooftop of Modern Data Architecture (London) - Tuesday, May 22
https://www.meetup.com/Data-Science-Festival-London/events/250644831/
Introduction to Machine Learning with Spark 2.x and Scala (Lyon) - Wednesday, May 23
https://www.meetup.com/Lyon-Data-Science/events/250769367/
Resilient Distributed Datasets (Amsterdam) - Thursday, May 24
https://www.meetup.com/papers-we-love-amsterdam/events/250332594/
Spark on AWS: Best Practices & Lessons Learned (Munchen) - Wednesday, May 23
https://www.meetup.com/Hadoop-User-Group-Munich/events/250771823/
Vespa: The Open Source Big Data Serving Engine (Berlin) - Thursday, May 24
https://www.meetup.com/eBay-Europe-Technology/events/250408294/
Integrating Databases with Kafka + Swisscom Firehose, aka Kafka aaS (Zurich) - Thursday, May 24
https://www.meetup.com/Zurich-Apache-Kafka-Meetup-by-Confluent/events/250372637/
Event-Driven Architecture with Kafka Streams (Bielsko-Biala) - Tuesday, May 22
https://www.meetup.com/Bielsko-Biala-JUG/events/249968647/
Top 10 Data Engineering Mistakes (Vilnius) - Tuesday, May 22
https://www.meetup.com/SEB-Technology-Talks-Vilnius/events/250571933/
Introduction to Kafka and Druid (Bangalore) - Saturday, May 26
https://www.meetup.com/opensourceblr/events/250570925/
Kafka in Kubernetes & Streaming ETL with Apache Kafka and KSQL (Singapore) - Tuesday, May 22
https://www.meetup.com/Singapore-Kafka-Meetup/events/249973031/