Data Eng Weekly Issue #253

25 February 2018

Lots of variety this week, from streaming data systems Pulsar, Kafka, and Hortonworks DataFlow to more general articles on data infrastructure for machine learning, db replication, and batch jobs. In news, there's an interesting take on DataTorrent (including some coverage of their latest release), a couple of good interviews, and a link out to the vote to add Druid to the Apache incubator. A couple of releases round out the newsletter—should be something here for everyone.

Sponsor

Hi, all. Foursquare here. We’ve used our location data & cluster to predict iPhone sales, the impact of McDonald’s all-day bkfst, and more. Hedge funds and global brands are clamoring for it. Intrigued & want a cool job?

Check out http://bit.ly/foursquare-jobs-data-eng-weekly and send your resume to recruiting@foursquare.com.

Technical

One of the first lessons in productionizing a machine learning-powered feature is that deployment and other infrastructure tasks are often more challenging than the data science. This post from Cloudera advocates for focusing on a building building a smooth deployment pipeline before investing in advanced models. It provides an example of this using the Cloudera Oryx project, which implements the Lambda architecture.

http://blog.cloudera.com/blog/2018/02/production-recommendation-systems-with-cloudera/

This post describes how Hortonworks DataFlow integrates Apache NiFi with Apache Atlas for tracking lineage of data across Kafka, Hive, and more.

https://hortonworks.com/blog/hdf-3-1-blog-series-part-6-introducing-nifi-atlas-integration/

The Streamlio team has written about exactly once in Apache Pulsar—the idempotent producer and reader APIs that it provides as well as how it leverages BookKeeper for Total Order Atomic Broadcast. It's a good overview of the building blocks for exactly once and how they all fit together.

https://streaml.io/blog/pulsar-effectively-once/
https://streaml.io/blog/pulsar-effectively-once-end-to-end/

For recovery in case of failure, a long-running Spark streaming job requires checkpoints of both data and metadata. The metadata is needed by the driver, and is typically snapshotted to a file system. This post describes how to configure it along with some additional settings that are needed when running Spark on Kubernetes.

https://banzaicloud.com/blog/spark-checkpointing/

Confluent has a post on the settings required to use KSQL with a secure Kafka cluster. While the post is just a brief overview, there are links out to videos in other documentation about securing a Kafka cluster and related components.

https://www.confluent.io/blog/securing-ksql/

CitusData has a good post on data replication and backup strategies. While the post focuses on Postgres, you can also apply these strategies for other databases, and the discussion of the tradeoffs is useful on that front.

https://www.citusdata.com/blog/2018/02/21/three-approaches-to-postgresql-replication/

Version 0.3.0 of Apache Arrow's JavaScript package just came out. This post has a good introduction to those bindings, which make data analysis and transformations from JavaScript much more practical. To that end, the post covers loading some data from a file or over the internet, computing column-level statistics, running queries with indices, and more. There's also a discussion of what's coming next in the SDK.

https://beta.observablehq.com/@theneuralbit/introduction-to-apache-arrow

Last year, Google Cloud moved the Cloud Storage metadata backend from Megastore to Cloud Spanner. In doing so, they've enabled consistent reads and listings including for multi-regional buckets. If you've ever dealt with the problems created by eventual consistency in a blob store, this is an exciting development.

https://cloudplatform.googleblog.com/2018/02/how-Google-Cloud-Storage-offers-strongly-consistent-object-listing-thanks-to-Spanner.html

The Powerspace team has written about their usage of Kafka alongside Google PubSub. The post has good advice for running Kafka Connect (they recommend the Landoop Kafka UI to monitor connect jobs) with PubSub (some nuances when it comes to keys and serializers). They also have a few pieces of general advice, including a recommendation against enabling automatic creation of topics, when to use unclean leader election, and how to detect lagging brokers.

https://powerspace.tech/how-to-stream-data-from-google-pubsub-to-kafka-with-kafka-connect-dbef1c340a76

Sponsor

Every week you receive news about Apache Kafka, but do you have the knowledge of how it works and the expertise to use it ? Stephane Maarek has created the Apache Kafka Series (online courses) and you can find the first volume below with a special discount for Data Eng Weekly subscribers. Packing over 4 hours of video with theory and practice lessons, it is the best way to learn Kafka.

http://bit.ly/learn-apache-kafka

News

Flink Forward is in April in San Francisco. There's a new blog series that's previewing talks. For readers of Data Eng Weekly, the Flink Forward team is offering a 15% discount on conference and training. Use promo code: DataEng15.

https://data-artisans.com/blog/flink-forward-san-francisco-preview-part-1-6-apache-flink-use-cases-applications
https://sf-2018.flink-forward.org/

ZDNet has coverage of DataTorrent, a vendor in the real-time stream processing space (more about a recent release of their product below). The article describes their philosophy as a vendor—although they're aligned with Apache Apex, they aim to be flexible in tools when supporting their customers.

http://www.zdnet.com/article/datatorrents-hard-code-around-streaming-data-philosophy-in-90-days/

Syncsort has a three-part interview with Cloudera's Mike Olson. The discussion covers topics like Cloudera's investment in machine learning, how Cloudera and Gartner don't have the same conclusions on big data, and diversity in tech.

http://blog.syncsort.com/2018/02/big-data/cloudera-machine-learning-cloud/

The Software Engineering Daily podcast has an interview with Gwen Shapira of Confluent. The interview, which is also available as a transcript, covers data synchronization/reconciliation across Kafka clusters, topic hierarchy design, and some example use cases.

https://softwareengineeringdaily.com/2018/02/20/kafka-design-patterns-with-gwen-shapira/

This week's Roaring Elephant podcast discusses the five year anniversary of this newsletter, the Strava heat maps, and an analysis of tweets about Star Wars.

https://roaringelephant.org/2018/02/20/episode-75-roaring-news/

There's a vote underway to accept Druid into the Apache Incubator. You can also see the full proposal in the vote announcement email.

https://lists.apache.org/thread.html/a9186930d46eba8136e6f2d29646efab33390b213515b76deea71c57@%3Cgeneral.incubator.apache.org%3E

Releases

The 3.10 release of DataTorrent RTS has a bunch of new features for the stream processing system. These include the addition of Druid for OLAP, Drools integration, and the new Apoxi framework and applications built on Apoxi like fraud prevention and a real-time personalized recommender.

https://www.datatorrent.com/blog/technical-details-for-datatorrent-rts-3-10-release/

Apache Beam, the programming SDK for stream and batch, has announced version 2.3.0. The new release moves to Java 8 and Spark 2.x, improves the AWS S3 integration, better supports the Python SDK, and contains a number of other fixes.

https://beam.apache.org/blog/2018/02/19/beam-2.3.0.html

There's a new Java SDK for the Cloudera Atlus PaaS system for building big data clusters and executing jobs. This post gives a brief tour of the SDK, including some code snippets.

http://blog.cloudera.com/blog/2018/02/altus-sdk-for-java-blog/

Stitchfix has open sourced Flotilla, which is a job schedule for Amazon ECS clusters. It's built for data scientists with an emphasis on self-service via either API or a web UI.

https://multithreaded.stitchfix.com/blog/2018/02/22/flotilla/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Building a Real-Time Streaming Platform, with Gwen Shapira (San Diego) - Thursday, March 1
https://www.meetup.com/Data-Engineering-San-Diego/events/247817112/

Wisconsin

Timeseries Analytics for Big Fast Data (Madison) - Tuesday, February 27
https://www.meetup.com/BigDataMadison/events/245245642/

North Carolina

Processing Streaming Data with KSQL (Raleigh) - Thursday, March 1
https://www.meetup.com/Raleigh-Apache-Kafka-Meetup-by-Confluent/events/247968121/

Virginia

Use KSQL to Process Your Apache Kafka Streams (Richmond) - Wednesday, February 28
https://www.meetup.com/Richmond-Java-Users-Group/events/247631162/

New York

Building Modern Data Pipelines by Unifying Apache Pulsar, Heron, and BookKeeper (New York) - Monday, February 26
https://www.meetup.com/New-York-City-Stream-Processing-User-Group/events/247841891/

MEXICO

Our First Kafka Meetup (Mexico City) - Thursday, March 1
https://www.meetup.com/Mexico-Kafka/events/247063963/

UNITED KINGDOM

LeedsDevops: February Meetup (Leeds) - Tuesday, February 27
https://www.meetup.com/LeedsDevops/events/247715625/

FRANCE

From Relational to Hadoop HDFS (Paris) - Thursday, March 1
https://www.meetup.com/Data-ScienceTech-Institute-Meetups/events/247573404/

HUNGARY

Meet Hadoop 3! (Budapest) - Wednesday, February 28
https://www.meetup.com/futureofdata-budapest/events/248022437/

TURKEY

Building Real-Time Streaming Pipelines with Kafka and Stream Analytics (Istanbul) - Thursday, March 1
https://www.meetup.com/Istanbul-Spark-Meetup/events/248054656/

ISRAEL

A Glance at Big Data & Spark (Tel Aviv) - Monday, February 26
https://www.meetup.com/she-codes-tlv/events/247936594/

INDIA

Structured Streaming with Kafka (Bangalore) - Saturday, March 3
https://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/247417960/

Building a Serverless Data Lake @ Amazon (Bangalore) - Saturday, March 3
https://www.meetup.com/meetup-group-mdhldGoH/events/247731514/

AUSTRALIA

Data Engineering Meetup: 2018 Kickoff! (Sydney) - Thursday, March 1
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/247597754/