Data Eng Weekly Issue #267

03 June 2018

Lots of great technical content this week, including LinkedIn's real-time notification system, tuning Apache Spark jobs, bulking loading of data into Apache Phoenix, and implementing z-indexing. There are a few quick articles with good tips, too, like using the WITH keyword in postgres and Hadoop stack deploy guidelines for the cloud. In releases, Apache Pulsar (incubating) had a major release and Amazon Neptune (their graph DB) is now GA.

Sponsor

Ember AI http://www.ember.ai/careers is a young AI startup with our own proprietary classification algorithm and has just won our first contract with a Fortune 500 company. On the back of this, we are expanding to offer scalable solutions for real-time Natural Language Processing (NLP). This will include a suite of products targeting multiple sectors but will start with real-time security and compliance solutions. Early employees will be able to help shape the direction of the company and the architecture of the technology. We are an entirely remote company.

We are looking for a Big Data Engineer that will work on the collecting, storing, processing, and analyzing of huge sets of data. The primary focus will be on choosing optimal solutions to use for these purposes, then maintaining, implementing, and monitoring them. You will also be responsible for integrating them with the architecture used across the company.

http://bit.ly/emberai-data-engineer

Technical

The WITH keyword can be really useful for some quick analysis. It also supports recursive data lookups (e.g. if you have some tree data), which enable some powerful use cases. Here’s a brief overview of how to use recursive common table expressions that power WITH queries.

https://www.citusdata.com/blog/2018/05/15/fun-with-sql-recursive-ctes/

LinkedIn has a great post on the design behind Concourse, their system for sending personalized notifications. It replaces a batch system with one built on Apache Kafka and Apache Samza. One particularly interesting detail describes how they localize the computation to each data center to improve throughput.

https://engineering.linkedin.com/blog/2018/05/concourse--generating-personalized-content-notifications-in-near

Using Apache NiFi and the MiNiFi agent on an Raspberry Pi, this walkthrough demonstrates an easy to setup sensor data collection topography. MiNiFi supports warm deploys through a central command and control server, too.

https://medium.com/@abdelkrim.hadjidj/building-an-iiot-system-using-apache-nifi-mqtt-and-raspberry-pi-ce1d6ed565bc

Here's a quick walkthrough of using the Kafka .NET producer and consumer with protobuf.

https://www.matthowlett.com/2018-05-31-protobuf-kafka-dotnet.html

This post has a good set of tasks to run when burning in (or evaluating) new hardware (on-prem or in the cloud) for a big data workflow. it covers things like testing disk/network through, network latency, and measuring Hadoop/MapReduce with terasort.

http://blog.cloudera.com/blog/2018/05/evaluating-partner-platforms/

This article suggests that GDPR is a good wake-up call to get data teams to pay down tech debt. There are a number of suggestions for improving data management, including a documentation sprint, automated testing, implementing data provenance, and flagging data to be fixed (which may mean deleted/anonymized/etc)

https://medium.com/@kjarmul/gdpr-a-call-to-remove-technical-debt-from-data-science-c103a01c3102

Cloudera has updated their deployment recommendations for running HDFS, ZooKeeper, Kafka, and more in the AWS, Microsoft Azure, and Google Cloud Platform clouds. There are good tips across storage, networking, and high availability.

http://blog.cloudera.com/blog/2018/05/deploy-cloudera-edh-clusters-like-a-boss-revamped-part-3-cloud-considerations/

The Teads engineering mean has spent quite a bit of time with Spark for machine learning. In this post, they share Apache Spark 2.2.0 perf optimization tips, including how to get the most from the tungsten optimizer, identifying skew, when to use Spark's cache/broadcast functionalities, and important config options for Amazon EMR and S3.

https://medium.com/teads-engineering/spark-performance-tuning-from-the-trenches-7cbde521cf60

Banzai have updated their tutorial for running Kafka (a customized version with etcd-support) on Kubernetes with Pipeline. It now covers how to setup Kubernetes Persistent Volumes, which are currently in beta.

https://banzaicloud.com/blog/kafka-on-kubernetes/

This tutorial describes how to use Apache Spark to generate HFiles compatible with Apache HBase and Apache Phoenix. These HFiles can then be bulk loaded in a much more efficient way than other load schemes.

https://medium.com/hashmapinc/3-steps-for-bulk-loading-1m-records-in-20-seconds-into-apache-phoenix-99b77ad87387

Available both in transcript and podcast form, Software Engineering Daily has an interview with Zhenxiao Luo of Uber about their big data data platform. The interview covers their usage of Kafka, HDFS, Presto, Parquet, scalability challenges with schema management in such a large organization, using Presto to query across heterogeneous data sets, and more.

https://softwareengineeringdaily.com/2018/05/24/ubers-data-platform-with-zhenxiao-luo/

The AWS database blog has a post on implementing z-indexing, which is a strategy for using a single index to efficiently query over multiple attributes. The post is for DynamoDB but is broadly applicable to data engines that support range-based queries.

https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-2/

This tutorial covers how to use Apache Spark's JDBC support and Landoop's Kafka JDBC driver to 1) query Kafka data from a Spark Context and 2) write data from Spark back to Kafka.

http://www.landoop.com/blog/2018/06/spark-jdbc-kafka/

Data Eng Jobs

There are four listings for data engineering jobs in Barcelona, Philadelphia, Mountain View, and remote. Check them out or add your own!

https://jobs.dataengweekly.com

News

Datanami has coverage of Apache Flink 1.5, including its new CLI for SQL.

https://www.datanami.com/2018/05/29/apache-flink-gets-an-sql-client/

Releases

Talend Data Streams is a new product, built on Apache Beam, for stream processing in the cloud. It includes a web UI for building pipelines, which includes a live preview, support for lots of data sources, and an embedded python component. There's a free edition of Talend that you can try out on AWS.

https://www.talend.com/blog/2018/05/08/introducing-talend-data-streams-self-service-streaming-data-integration-for-everyone/

Amazon Neptune is now generally available. Neptune is a managed graph database for AWS.

https://aws.amazon.com/blogs/aws/amazon-neptune-generally-available/

Apache Pulsar 2.0.0-rc1-incubating was released. The Streamlio blog has an overview of new features, including Pulsar functions, a baked-in schema registry, and topic compaction. There are also a number of performance improvements and other changes in the release.

https://github.com/apache/incubator-pulsar/releases/tag/v2.0.0-rc1-incubating
https://streaml.io/blog/pulsar-2.0/

Sponsor

http://bit.ly/emberai-data-engineer

Events

Curated by Datadog ( http://www.datadog.com )

California

Spark + AI Summit: Bay Area Apache Spark Meetup (San Francisco) - Monday, June 4
https://www.meetup.com/spark-users/events/250659328/

Women in Big Data Luncheon at Spark Summit (San Francisco) - Wednesday, June 6
https://www.meetup.com/Women-in-Big-Data-Meetup/events/251037866/

Real-Time, Micro-Batch, Reactive Programming with Kafka and Event Hubs (San Francisco) - Wednesday, June 6
https://www.meetup.com/bayazure/events/251208296/

Joint SF Spark, Global Advanced Spark and TensorFlow, and Bay Area AI Megameetup (San Francisco) - Wednesday, June 6
https://www.meetup.com/SF-Spark-and-Friends/events/251030715/

Missouri

NiFi: The Good Parts (St. Louis) - Wednesday, June 6
https://www.meetup.com/St-Louis-Big-Data-IDEA/events/250077941/

Illinois

Automating the Deployment and Management of Apache Kafka (Chicago) - Tuesday, June 5
https://www.meetup.com/Chicago-SQL/events/250146945/

Georgia

Modernizing a Hadoop Database with Spark, DC/OS, and a Read-Only Cassandra Database (Atlanta) - Tuesday, June 5
https://www.meetup.com/BigData-Atlanta/events/250819301/

BRAZIL

Intro to Big Data: Hadoop Ecosystem (Sao Paulo) - Thursday, June 7
https://www.meetup.com/Big-Data-Like-a-Boss/events/251255168/

UNITED KINGDOM

Apache Beam Meetup 5: Talend Use Case + Portability and Schema Support in Beam (London) - Thursday, June 7
https://www.meetup.com/London-Apache-Beam-Meetup/events/251029990/

FRANCE

Implementation of a Big Data Architecture (Toulouse) - Thursday, June 7
https://www.meetup.com/bigdata-Toulouse/events/250873060/

GERMANY

Microservices & Events: Architecture with Kafka and Atom (Karlsruhe) - Wednesday, June 6
https://www.meetup.com/Java-User-Group-Karlsruhe/events/251039199/

Event-Driven Microservices with Apache Kafka (Hamburg) - Thursday, June 7
https://www.meetup.com/jug-hamburg/events/250726305/

ROMANIA

June Meetup: Distributed Search Engines and a Use Case for Hadoop (Bucharest) - Thursday, June 7
https://www.meetup.com/Bucharest-Big-Data-Meetup/events/250657845/

LATVIA

DevOps for Big Data (Riga) - Tuesday, June 5
https://www.meetup.com/Riga-Data-Advanced-Analytics-AI-Meetup/events/250607458/