Data Eng Weekly


Data Eng Weekly Issue #266

27 May 2018

This week's issue has several great walkthroughs and code examples, including CQRS with Kafka Streams, Spark on YARN with Docker, an interesting StreamSets Data Collector example, and a data pipeline with Google Cloud Dataflow. There are also interesting looks at Pinterest's recommendation system and the adoption of Scala at Criteo. Several other great technical posts, two Kafka-as-a-Service announcements, and several releases round things out.

Sponsor

Ember AI http://www.ember.ai/careers is a young AI startup with our own proprietary classification algorithm and has just won our first contract with a Fortune 500 company. On the back of this, we are expanding to offer scalable solutions for real-time Natural Language Processing (NLP). This will include a suite of products targeting multiple sectors but will start with real-time security and compliance solutions. Early employees will be able to help shape the direction of the company and the architecture of the technology. We are an entirely remote company.

We are looking for a Big Data Engineer that will work on the collecting, storing, processing, and analyzing of huge sets of data. The primary focus will be on choosing optimal solutions to use for these purposes, then maintaining, implementing, and monitoring them. You will also be responsible for integrating them with the architecture used across the company.

http://bit.ly/emberai-data-engineer

Technical

This post has a great walkthrough of building a streaming and serving platform using CQRS to to solve the scalability challenges common to a social network (fan out, large aggregations, etc). Lots of great Scala code examples and several recommended libraries.

https://medium.com/@joang/a-cqrs-approach-with-kafka-streams-and-scala-49bfa78e4295

Pixie is the system that powers recommendations at Pinterest. The paper discusses both the algorithms and implementation details including how they generate graph data in Hadoop and serve from high-memory instances on AWS.

https://blog.acolyer.org/2018/05/23/pixie-a-system-for-recommending-3-billion-items-to-200-million-users-in-real-time/

This post gives a good overview of the user-facing features of Dremio, the data service that makes it easy to analyze data across multiple sources. The Dremio web UI has lots of sharing and querying facilities, drawing the comparison of Google Docs for data. Dremio is open source, and there's an enterprise version with additional features.

https://medium.com/@tshiran_81544/google-docs-for-all-data-sources-an-approach-using-open-source-tools-26278a2fb86c

Scala seems to make inroads at many companies by way of the big data stack. At Criteo, Scala has grown to be used at other places, too. This post talks about some of those growing pains (and successes!)—common problems in switching a team from imperative to functional programming and how Criteo solved those problems.

https://medium.com/@CriteoEng/making-criteo-functional-5838d9ee9a2a

The Hortonworks blog has a post on running Spark on YARN in Apache Hadoop 3. They show how to use Docker to build images for SparkR and PySpark and how to access them through Apache Zeppelin.

https://hortonworks.com/blog/containerized-apache-spark-yarn-apache-hadoop-3-1/

In many systems, SOAP and WSDL are still pretty common. This post shows how to build a functional, reactive program that interfaces with a XML web service using JAX-WS.

https://blog.godatadriven.com/jaxws-reactive-client

This post has an example of using the StreamSets Data Collector for data from the Twitch streaming platform. While the use case doesn't require large scale processing, it does cover a non-trivial use case that interacts with multiple API endpoints across two services.

https://streamsets.com/blog/ingest-game-streaming-data-twitch-api/

AWS Step Functions are a service for visually building a data workflow. This post looks at using them with Apache Spark on Amazon EMR to run (and visualize) a multi-step workflow.

https://aws.amazon.com/blogs/big-data/orchestrate-apache-spark-applications-using-aws-step-functions-and-apache-livy/

This tutorial covers setting up a data platform that uses Debezium and Kafka to capture writes made in MySQL in MongoDB. It uses a simple Kafka streams application to filter and transform the change-data-capture events from Debezium into documents in Mongo.

http://iamninad.com/how-debezium-kafka-stream-can-help-you-write-cdc/

This post describes several different data pipeline architectures and walks through building a system based on Google Cloud DataFlow and Apache Beam. It includes sample code (full code on github) and screen shows to make it easy to follow along.

https://towardsdatascience.com/data-science-for-startups-data-pipelines-786f6746a59a

Segment has written about their solution for solving a distributed systems problem with interesting delivery and scalability challenges—delivering messages to downstream systems with high-failure rates. They have some good illustrations that describe the problem and why traditional solution (like Kafka) aren't a great fit, and they describe the major components that make up Centrifuge, their system for solving the problem.

https://segment.com/blog/introducing-centrifuge/

Postgres 10 introduced partitioned tables for sharding of data across servers or disks. The upcoming Postgres 11 release will have a lot of usability and performance improvements to them. This post has a summary of those new features.

https://pgdash.io/blog/partition-postgres-11.html

Data Eng Jobs

There are four listings for data engineering jobs in New York, Philadelphia, Mountain View, Paris, and remote. Check them out or add your own!

https://jobs.dataengweekly.com

News

Confluent has announced the public availability of Confluent Cloud Professional, a self-service Apache Kafka service. Alongside that announcement, they're also adding support for Google Cloud Platform (in addition to AWS) for both Professional and their existing Confluent Cloud Enterprise edition.

https://www.confluent.io/blog/confluent-cloud-pro-gcp/

Instaclustr also announced their Kafka-as-a-Service product, which adds to their offering that already includes Apache Cassandra, Apache Spark, and Elastic Search. Their offering is SOC2-certified, and it supports Amazon Web Services, Google Cloud Platform, and Microsoft Azure.

https://siliconangle.com/blog/2018/05/21/instaclustr-delivers-apache-kafka-managed-service/

Spark + AI Summit is just over a week away. This post highlights several talks from the program.

https://databricks.com/blog/2018/05/23/a-guide-to-developer-apache-spark-use-cases-and-deep-dives-talks-at-spark-ai-summit.html

Okera is a new company with a data product that provides governance through a high-performance data access service that brokers across lots of different data sources (e.g. S3, HDFS, Ceph, and RDBMS).

https://www.datanami.com/2018/05/23/okera-emerges-from-stealth-with-big-data-fabric/

The Apache ZooKeeper team disclosed a Critical CVE that relates to quorum authentication. Versions prior to 3.4.10 and some alpha and beta releases are affected.

https://lists.apache.org/thread.html/63be245bc6140d0b0c7e8aadea1680786c285ce9e5e8511226635e46@%3Cannounce.apache.org%3E

Four CVEs (one severe, two moderate, one low) for Apache NiFi were also disclosed. Update to version 1.6.0 for fixes.

https://lists.apache.org/thread.html/1d1ac15fb1e7626420dbf05d42fd12af1691eb5073a6458afbd36771@%3Cannounce.apache.org%3E

"Versioning in an Event Sourced System" is a new book. Proceeds go to a good cause—Code Club.

https://leanpub.com/esversioning

Sponsor

Ember AI http://www.ember.ai/careers is a young AI startup with our own proprietary classification algorithm and has just won our first contract with a Fortune 500 company. On the back of this, we are expanding to offer scalable solutions for real-time Natural Language Processing (NLP). This will include a suite of products targeting multiple sectors but will start with real-time security and compliance solutions. Early employees will be able to help shape the direction of the company and the architecture of the technology. We are an entirely remote company.

We are looking for a Big Data Engineer that will work on the collecting, storing, processing, and analyzing of huge sets of data. The primary focus will be on choosing optimal solutions to use for these purposes, then maintaining, implementing, and monitoring them. You will also be responsible for integrating them with the architecture used across the company.

http://bit.ly/emberai-data-engineer

Releases

Databricks 4.1 is out with several performance improvements including to S3 and generating data statistics used by the optimizer.

https://databricks.com/blog/2018/05/23/announcing-databricks-runtime-4-1.html

Hortonworks HDP 2.6.5 is GA. It includes Apache Kafka 1.0.0 and Apache Spark 2.3.

https://hortonworks.com/blog/announcing-general-availability-hortonworks-data-platform-hdp-2-6-5-apache-ambari-2-6-2-smartsense-1-4-5/

Apache Flink 1.5.0 was released. It includes a number of major new features—a new deployment and processing model that is better for Kubernetes, YARN, & Mesos, broadcast state in streaming applications, major improvements to performance in the network stack, and a new CLI for streaming SQL.

http://flink.apache.org/news/2018/05/25/release-1.5.0.html

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Enterprise Jupyter Deployment + Airflow in Production (San Francisco) - Wednesday, May 30
https://www.meetup.com/SF-Big-Analytics/events/249209851/

Uber Data Engineering Night (San Francisco) - Thursday, May 31
https://www.meetup.com/UberEvents/events/251029862/

New York

Faster Is Better: The Future of Analytics Is 1 Trillion Rows Per Second (New York) - Tuesday, May 29
https://www.meetup.com/mysqlnyc/events/250376806/

CANADA

Intro to Apache Spark Streaming with Apache NiFi and Kafka (Kanata) - Thursday, May 31
https://www.meetup.com/Big-Data-Developers-in-Ottawa/events/250642290/

UNITED KINGDOM

Kafka at the BBC + Kafka Ops at Sky B&G (Leeds) - Tuesday, May 29
https://www.meetup.com/Leeds-Kafka/events/250652097/

SWEDEN

IoT Stream Processing with Storm + Deep Learning for Human-Robot Collaboration (Stockholm) - Thursday, May 31
https://www.meetup.com/Knock-Data-Stockholm/events/250449202/

SPAIN

Zopa and Centrica Connected Homes Talk Kafka (Barcelona) - Thursday, May 31
https://www.meetup.com/Barcelona-Kafka-Meetup/events/250772611/

NETHERLANDS

Data Engineering and DevOps: Hadoop & Data Pipeline in a Cloud World (Utrecht) - Tuesday, May 29
https://www.meetup.com/ITNEXT/events/246830671/

ITALY

Credit Machine Learning + Hands-on HBase (Bologna) - Tuesday, May 29
https://www.meetup.com/Bologna-Big-Data-Meetup/events/250146859/

AUSTRIA

Hadoop Meets the Cloud @ Microsoft (Vienna) - Wednesday, May 30
https://www.meetup.com/Hadoop-User-Group-Vienna/events/250218135/

POLAND

Apache Spark vs the Rest of the World + Elephants in the Cloud (Warsaw) - Monday, May 28
https://www.meetup.com/warsaw-hug/events/250908036/

LITHUANIA

Big Data Vilnius Meetup #12 (Vilnius) - Tuesday, May 29
https://www.meetup.com/Vilnius-Hadoop-Meetup/events/250774467/

AUSTRALIA

Data Engineering Meetup 2 (Sydney) - Thursday, May 31
https://www.meetup.com/Big-Data-Sydney/events/249917592/