Data Eng Weekly


Data Eng Weekly Issue #261

22 April 2018

Lots of great content this week, including a couple of system migration stories (Hive+Sqoop to Spark, Cron to Airflow), paradigms of stream processing, DynamoDB at Nike, and LinkedIn's Aeon system for latency tracking. In news, there's a great post on data engineering vs. data science roles and video interviews from DataWorks. In releases, Apple open sourced FoundationDB and Apache Hadoop 2.7.6 is out.

Sponsor

Data Eng Weekly is starting a job board! For the next month, postings are discounted at $149 (regularly $249) for 31 days. Hopefully this will be a useful service for both job seekers and companies hoping to reach the data engineering community. Questions or comments? info@dataengweekly.com

https://dataengweekly.seeker.company

Technical

This is a great, Azure-focussed whirlwind tour of Hadoop (and briefly MapReduce), Pig (on Tez), Storm (with Azure Event Hubs), and Spark. It uses Powershell and the Azure UI to deploy clusters to crunch data from the Global Database of Events, Language, and Tone (GDELT) dataset.

https://medium.com/@ylashin/big-data-using-hdinsight-a-journey-in-the-zoo-ecosystem-c78b913a5ed9

Wayfair has written a post on the scalability and reliability issues they faced with a large Graphite deploy. They're moving to a new system built on InfluxDB with Kafka as the transport.

https://tech.wayfair.com/2018/04/time-series-data-at-wayfair/

Qubole's AIR platform analyzes data about Hive, Spark, and Presto clusters. The system is powered by Apache Airflow for job orchestration, which is the topic of this post. Qubole discusses why they chose Airflow and what some of the pitfalls have been as they've worked with it.

https://www.qubole.com/blog/hood-building-air-qubole/

This post describes how use Scala for data prep in Apache Spark. Once that's done, Spark SQL and Apache Zeppelin can be used to query and visualize the results. This type of hybrid solution seems like a great way to make sure you're using the best tool at each step of your analysis.

https://medium.com/ml2vec/data-transformation-and-visualization-on-the-youtube-dataset-using-spark-f23e8abfaa14

TechTarget has coverage of the Flink Forward conference talks by Capital One and Comcast. There are some interesting insights into how the companies are supporting data science and machine learning—e.g. both are using Jython to bridge the gap between data science and production systems.

https://searchenterpriseai.techtarget.com/feature/Getting-to-machine-learning-in-production-takes-focus

Here's a story of migrating from a Hive+Sqoop setup to a Spark-based one. The system is running in AWS, integrates with S3 and Redshift, and uses Zeppelin notebooks for Spark.

https://medium.com/@servatj/migrating-from-hive-to-spark-251ef74925ab

This post covers a few paradigms for stream processing. The "Kafka abstraction funnel" is a new one to me—it describes the fallback approach of using KSQL first, then the Streams DSL, then the Processor API, and then the raw Producer/Consumer APIs. There's also a new project that demos "hello world" in a bunch of different stream processing frameworks, to give you a good flavor of each.

https://yokota.blog/2018/04/19/the-hitchhikers-guide-to-stream-processing/

Videoamp migrated from Cron to Apache Airflow, and they have a lot of lessons learned (both good things and some pitfalls!) to share about the transition.

https://medium.com/videoamp/what-we-learned-migrating-off-cron-to-airflow-b391841a0da4

This presentation introduces Pachyderm, a data management for Kubernetes that includes data versioning, data provenance, and more.

https://docs.google.com/presentation/d/e/2PACX-1vQiecupZEIq2q7SV6tUSjAflOP6ifwWUVP5SgdhFetbmvVbM7HFVTdoe4kgchlsew9Os3pD0X-Ow4Mo/pub

Apache Heron (incubating) supports running streaming jobs via HashiCorp Nomad for cluster scheduling. This post walks through the steps to get that setup up and running.

https://streaml.io/blog/apache-heron-on-nomad/

The LinkedIn engineering blog has a post about Aeon, their event and latency tracking system built on Apache Kafka and Apache Samza.

https://engineering.linkedin.com/blog/2018/04/samza-aeon--latency-insights-for-asynchronous-one-way-flows

Apple open-sourced FoundationDB this week, which got a lot of attention. Snowflake is a happy user of FoundationDB, which powers their metadata store. They share some more details in this post.

https://www.snowflake.net/how-foundationdb-powers-snowflake-metadata-forward/

Nike is using DynamoDB as the data store for many of their microservices. Dynamo replaces Couchbase and Cassandra—the main advantages are operational overhead and additional features like secondary indexes and encryption at rest. It's not without its pitfalls, though—hot-spotting (in which cases requests are throttled) and big launch events require careful consideration.

https://medium.com/nikeengineering/becoming-a-nimble-giant-how-dynamo-db-serves-nike-at-scale-4cc375dbb18e

If you have a lot of small, time sensitive data tasks, then Apache Airflow might not be the best fit. This post describes this situation and the tradeoffs of switching to Celery for task scheduling.

https://medium.com/@manuelmourato25/when-airflow-isnt-fast-enough-distributed-orchestration-of-multiple-small-workloads-with-celery-afb3daebe611

ScaleFlux has published a whitepaper that describes how they've achieved performance speedups to HBase. Using hardware acceleration, their solution achieves GZIP compression ratios with Snappy speed and throughput.

http://scaleflux.com/downloads/ScaleFlux_HBase_Solution_Brief.pdf

News

This is a great article on the core competencies of data engineers and data scientists, several negative scenarios that might occur when a data scientist is spending their time on data engineering, and the growing role of machine learning engineer, which sits in the middle.

https://www.oreilly.com/ideas/data-engineers-vs-data-scientists

theCUBE was at DataWorks summit EU last week, and they have posted a number of interviews with folks from Hortonworks, IBM, Accenture, and more.

https://www.thecube.net/dw-ew-2018

The ARCHITECHT podcast has an interview this week with Jay Kreps. Topics covered include the rise of Kafka alternatives, big data IPOs, and open source business models.

https://architecht.io/jay-kreps-talks-cloud-native-kafka-competitors-and-a-resurgence-of-enterprise-it-innovation-4011a483ed84

Videos and slides from the Women in Big Data meetup have been posted. There are talks on Python+JVM, DevOps for Data Scientists, and visualizations with big data.

https://databricks.com/blog/2018/04/17/women-in-big-data-and-apache-spark-bay-area-spark-meetup-summary.html

As mentioned above, Apple has open sourced FoundationDB, which is a distributed key-value store with ACID transactions. The project is getting off the ground and building a community.

https://www.foundationdb.org/blog/foundationdb-is-open-source/

This post recaps a number of distributed systems talks from QCon London, which took place earlier this year. There's a summary and a link to slides for each.

https://sap1ens.com/blog/2018/03/12/qcon-london-2018/

Sponsor

Data Eng Weekly is starting a job board! For the next month, postings are discounted at $149 (regularly $249) for 31 days. Hopefully this will be a useful service for both job seekers and companies hoping to reach the data engineering community. Questions or comments? info@dataengweekly.com

https://dataengweekly.seeker.company

Releases

Luigi 2.7.5 was released. It includes all the features of the 2.7.4 release (some fixes and new features), as well as a fix for a cross-site scripting vulnerability in the visualizer UI.

https://github.com/spotify/luigi/releases/tag/2.7.5

At DataWorks Summit, Hortonworks announced Data Steward Studio. It's a security and governance focused product with applications to GDPR. SiliconANGLE has more coverage.

https://siliconangle.com/blog/2018/04/19/hortonworks-data-steward-studio-release-is-both-timely-and-reassuring-dws18/

Version 4.1 of the Confluent Platform includes the GA of KSQL (less than a year after the developer preview started). The release also includes enhanced clients and Apache Kafka 1.1.

https://www.confluent.io/blog/confluent-platform-4-1-with-production-ready-ksql-now-available/

Apache Oozie 5.0.0 was released. Highlights include JDK 8 support, a new YARN-based launcher, and several updates.

https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces32

This tool for running a Kafka cluster via docker compose has been updated to support version 4.1 of the Confluent Platform.

https://github.com/simplesteph/kafka-stack-docker-compose/releases/tag/v4.1.0

Apache Hadoop 2.7.6 is out. It includes a total of 46 bug fixes and optimizations.

https://lists.apache.org/thread.html/3d19e0cd7b03bd60aca3b8f185c00109e2a9efd805edd03ac20d94f4@%3Cgeneral.hadoop.apache.org%3E

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

First StreamSets User Group Meetup: Scale Out with StreamSets (San Francisco) - Tuesday, April 24
https://www.meetup.com/San-Francisco-StreamSets-User-Group-Meetup/events/249354276/

Colorado

First Denver Data Engineering Meetup (Denver) - Thursday, April 26
https://www.meetup.com/Denver-Data-Engineering/events/249162937/

Georgia

Spark 2.3 Update, Machine Learning Pipelines Intro, and CI/CD How-to (Atlanta) - Thursday, April 26
https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/249393678/

Virginia

Spark 2.3 and Azure Databricks (Reston) - Wednesday, April 25
https://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/248999491/

CANADA

2 Billion Messages in Kafka (Montreal) - Tuesday, April 24
https://www.meetup.com/Big-Data-Montreal/events/249848242/

UNITED KINGDOM

Apache Beam Meetup 4: Use Case on Beam + Becoming a Committer + Discussions (London) - Tuesday, April 24
https://www.meetup.com/London-Apache-Beam-Meetup/events/249647193/

Recap and Summary from Flink Forward SF 2018 (London) - Tuesday, April 24
https://www.meetup.com/Apache-Flink-London-Meetup/events/249385788/

Streaming with KSQL + Monitoring Kafka Like a Pro (London) - Wednesday, April 25
https://www.meetup.com/Apache-Kafka-London/events/249701805/

SPAIN

Processing Hierarchical Tables with Spark, by Jose Luis Sanchez from Zurich (Barcelona) - Thursday, April 26
https://www.meetup.com/Spark-Barcelona/events/249825814/

FRANCE

Disaster Recovery Solutions for Hadoop Clusters (Neuilly-Sur-Seine) - Tuesday, April 24
https://www.meetup.com/futureofdata-paris/events/249071275/

NETHERLANDS

SageMaker, DeepLens, & Message-Driven Architecture (Amsterdam) - Tuesday, April 24
https://www.meetup.com/aws-ams/events/246899169/

GERMANY

KSQL and Stream All Things with Gwen Shapira and Matthias J. Sax (Berlin) - Wednesday, April 25
https://www.meetup.com/Berlin-Apache-Kafka-Meetup-by-Confluent/events/249424820/

PHILIPPINES

Big Data Architecture 101 & Kafka 101 (Taguig) - Wednesday, April 25
https://www.meetup.com/Manila-BIG-DATA-Group/events/249591183/