Data Eng Weekly Issue #252

18 February 2018

Lots this week on stream processing including coverage of the Pravega streams system, exactly-once in Apache Flink, new features in Hortonworks Data Flow, getting started with Pivotal Cloud Data Flow, and building an application with Confluent KSQL. Qubole also has a great post on some optimizations they've made to query performance in Presto. In releases, Apache Oozie, Apache Storm, and Apache Flink all have new versions out this week.

Technical

The Pravega project came across my radar for the first time. Open-source and from Dell EMC, it's a distributed system implementing streams with similarities to Kafka and Apache Pulsar. Key differentiators are automatic movement of cold data to HDFS or other tier two storage and auto scaling of segments. That auto scaling functionality is one of the topics of the following post, which also looks at the API for sending to and consuming from Pravega.

http://blog.pravega.io/2018/02/12/streams-in-and-out-of-pravega/

This is a good article on how to pick the right tech stack to quickly stand up a data warehouse. AWS provides the plumbing with Kineses, Redshift, Glue, Lambda, and more. Lots of good tips if you go down a similar route or are using these (or related) technologies in AWS.

https://aws.amazon.com/blogs/big-data/how-i-built-a-data-warehouse-using-amazon-redshift-and-aws-services-in-record-time/

Qubole has implemented two optimizations for Presto—join reordering and dynamic filtering. The post describes how these improvements are implemented and how they improve performance in certain situations. The article also details performance results from an analysis with TPC-DS queries (2.8-14x speedup and several more queries run to completion than before). While these optimizations are available in Qubole Presto, they're also working with the community to get them into the main branch.

https://www.qubole.com/blog/sql-join-optimizations-qubole-presto/

If you're doing any SQL database programming from Scala, Doobie looks like a useful library for writing your JDBC queries. It enables writing of raw SQL queries but has a bunch of functionality, including to convert results to case classes (with checks for types) and to execute prepared statements. This post has a good overview of how to get started.

https://blog.godatadriven.com/doobie-monix-jdbc-example

Another open-source project that's new to me—Arango—is a NoSQL database that supports different types of data, including graph, key/value, and document storage. This post describes the results of some recent benchmarking against postgres, mongodb, neo4j, and orientdb. While there are good disclaimers in the post, it's always important to benchmark with your own use cases and data. With that said, for a single-node use case the results are impressive.

https://www.arangodb.com/2018/02/nosql-performance-benchmark-2018-mongodb-postgresql-orientdb-neo4j-arangodb/

If you're using Pivotal Cloud Foundry or Spring, Spring Cloud Data Flow might be a great way to get started with stream processing. This post gives a brief tour of how to get setup (installing the various components with the cf tools) and build a simple log parsing application with the Data Flow shell.

https://content.pivotal.io/blog/building-flexible-data-pipelines-with-spring-cloud-data-flow-for-pcf

Apache Flink's checkpointing has provided exactly-once semantics within a Flink application for some time now. With the 1.4.0 release, they've also added the ability to ensure exactly-once delivery to a Apache Kafka or Pravega data sink. This post details the two-phase commit implementation that powers the exactly-once delivery.

https://data-artisans.com/blog/end-to-end-exactly-once-processing-apache-flink-apache-kafka

Hortonworks Data Flow (HDF) includes the open-source Streaming Analytics Manager for UI-driven definition of streaming applications. HDF 3.1 added a test mode (with fake data) and the ability to unit test these applications. This post describes how to do both.

https://hortonworks.com/blog/hortonworks-dataflow-hdf-3-1-blog-series-part-4-unit-testing-continuous-integration-delivery-streaming-analytics-apps/

Confluent has a post that describes (complete with lots of code examples) consuming change data capture events from an Oracle database, applying a number of transformations and aggregations with KSQL, and storing the resulting data in Elasticsearch for analysis with Kibana. There's lots of great stuff in here, but it does a particularly good job of demonstrating the differences between streams and tables as well as between event and processing time.

https://www.confluent.io/blog/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data

Sponsor

Hey, Data Eng readers: Which big data company’s location tech is embedded in 125K services and apps, from Apple to Uber? The answer—Foursquare. We have a 16 PB cluster that runs 10k jobs a day on Spark, Scalding and Presto. And we’re looking for engineers.

Our website: http://bit.ly/foursquare-location-intelligence-data-eng-weekly

News

dotScale, a conference on distributed systems and scalability, takes place in Paris from May 31-June 1. Readers of Data Eng Weekly can get 20% of a ticket with promo code DATAENGWEEKLY.

https://2018.dotscale.io/tickets?promocode=DATAENGWEEKLY

Datanami predicts that 2018 will be the year of data engineer (I guess I was right to rename the newsletter!). It notes some relevant stats, such as job postings for data engineers outnumbering those of data scientists by 4-5x.

https://www.datanami.com/2018/02/05/2018-will-year-data-engineer/

Sponsor

Hello Fresh: Change the way people eat forever. Work with our data technology to deliver healthy meals to millions of customers, with a cutting-edge tech stack (Hadoop, Kafka, Impala, pyspark, AWS, Airflow) and time for personal and engineering development. Click the link for more info on becoming a Data Engineer at Hello Fresh in Berlin!

http://bit.ly/hello-fresh-data-eng-weekly

Releases

Apache Oozie, the workflow engine, disclosed a vulnerability this week. There's a new 4.3.1 release out with a mitigation along with several other bug fixes and minor improvements.

https://lists.apache.org/thread.html/66f80fc772c309a0b9423c4cd634e10ff31bbd55e4a250772f0d774e@%3Cannounce.apache.org%3E
https://lists.apache.org/thread.html/aa370b9e87b92ea987be1d4f12bcf11170611f7e1af67affab580ab9@%3Cannounce.apache.org%3E

Qubole and Snowflake have a new integration for using Qubole to query data from and write data to Snowflake. It leverages the Snowflake Spark connector and is integrated into the Qubole console.

https://www.qubole.com/blog/qubole-snowflake-getting-started-machine-learning-big-data-cloud-data-warehouses-1-3/

Version 1.4.1 of Apache Flink, the big data stream processing system, was released. It contains 60 fixes/minor improvements.

http://flink.apache.org/news/2018/02/15/release-1.4.1.html

Apache Storm 1.2.0 was released with improved Kafka integration and a new metrics systems built on dropwizard metrics. It also adds a new HBase state backend.

http://storm.apache.org/2018/02/15/storm120-released.html

Apache Storm 1.0.6 and 1.1.2 were also released. Both releases include improvements to the Kafka integration and several other bug fixes.

http://storm.apache.org/2018/02/14/storm106-released.html
http://storm.apache.org/2018/02/15/storm112-released.html

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Morning Workshop with Confluent and IBM (San Francisco) - Tuesday, February 20
https://www.meetup.com/KafkaBayArea/events/247433783/

Monitoring Apache Kafka with Gwen Shapira (Foster City) - Tuesday, February 20
https://www.meetup.com/KafkaBayArea/events/247434305/

Engineering Real-Time Event-Driven Processing (Palo Alto) - Thursday, February 22
https://www.meetup.com/UberEvents/events/247776619/

Kubernetes Day: Running Apache Spark, Apache Pulsar & Heron (Santa Clara) - Saturday, February 24
https://www.meetup.com/datariders/events/244891085/

Colorado

Messaging, Storage, or Both: The Real-Time Story of Pulsar & Apache DistributedLog (Boulder) - Thursday, February 22
https://www.meetup.com/Boulder-Denver-Big-Data/events/247263308/

Minnesota

Big Data Monthly Meetup (Minnetonka) - Wednesday, February 21
https://www.meetup.com/TwinCities-Bigdata-Analytics/events/247327789/

Wisconsin

Ebb & Flow: Data in Motion (Milwaukee) - Thursday, February 22
https://www.meetup.com/Milwaukee-Internet-of-Things/events/247386654/

Georgia

Macy’s Omni-Catalog: A Real-Time Fast Data Story Using Spark & Cassandra (Johns Creek) - Wednesday, February 21
https://www.meetup.com/BigData-Atlanta/events/246839720/

Intro to Building a Distributed Pipeline for Real-Time Analysis of Uber's Data (Atlanta) - Thursday, February 22
https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/245913352/

UNITED KINGDOM

Microservices and End-To-End Topologies with Kafka (London) - Wednesday, February 21
https://www.meetup.com/Apache-Kafka-London/events/247649422/

Panta Rhei: Designing Applications with Distributed Streams (London) - Wednesday, February 21
https://www.meetup.com/Apache-Flink-London-Meetup/events/246965765/

SPAIN

Let's Start the Season with the Spark Frameless Library (Barcelona) - Thursday, February 22
https://www.meetup.com/Spark-Barcelona/events/247575307/

NETHERLANDS

From a Little Spark May Burst a Flame (Amsterdam) - Tuesday, February 20
https://www.meetup.com/Reactive-Amsterdam/events/247356587/

Data Processing Mayhem (Amsterdam) - Thursday, February 22
https://www.meetup.com/Software-Circus/events/247614977/

LITHUANIA

Discuss EU GDPR and Spark on AWS (Vilnius) - Thursday, February 22
https://www.meetup.com/Vilnius-Hadoop-Meetup/events/247160553/

TURKEY

Create Data Pipeline Demo: MySQL-Kafka-Elastic Stack (Istanbul) - Tuesday, February 20
https://www.meetup.com/Turkey-Elastic-Fantastics/events/247707990/

PHILIPPINES

Manila Big Data Tech Meetup (Manila) - Wednesday, February 21
https://www.meetup.com/Manila-BIG-DATA-Group/events/247596076/

AUSTRALIA

Enterprise Grade Hadoop & High-Performance Applications with HPE and Aerospike (Melbourne) - Wednesday, February 21
https://www.meetup.com/Big-Data-Analytics-Meetup-Group/events/247749961/

NEW ZEALAND

Easy and Fast Stream Processing (Christchurch) - Monday, February 19
https://www.meetup.com/Christchurch-Big-Data-Meetup/events/246278422/