Data Eng Weekly


Data Eng Weekly Issue #262

29 April 2018

This week's content runs the gamut from Apache Hadoop+Docker to Apache Spark Streaming+MQ to monitoring Apache Kafka to getting going with Apache Airflow. There are two interesting new tools to check out—a Scala DSL for AWS Data Pipeline and a deployment tool for Apache Flink. In news, there lots of conference presentation and keynote videos to catch up on and an interesting analysis from Qubole of the migration to Hadoop 2.0.

Sponsor

We’re Shopify. Our mission is to make commerce better for everyone http://bit.ly/shopify-careers and in the last few years we have built a data warehouse that is now ready to power data products and insights for all of our 500,000+ merchants! We thrive on change, operate on trust, and leverage the diverse perspectives of people on our team in everything we do. We solve problems at a rapid pace and at scale. In short, we get shit done. Join us!

http://bit.ly/join-us-shopify-data

Technical

This tutorial walks through how to run Hadoop inside of Docker on a single machine by leveraging Docker networks.

https://medium.com/@rubenafo/some-tips-to-run-a-multi-node-hadoop-in-docker-9c7012dd4e26

This is a fantastic article on event-driven architectures. In the process of busting four myths, it covers a lot of ground including event-driven programming models, managing shared data, and the importance of a central event bus to enabling an event-driven architecture.

https://www.infoworld.com/article/3269207/enterprise-architecture/busting-event-driven-myths.html

With the speed at which the big data ecosystem is evolving, it can be hard to stay on top of all the software systems. This post is a good summary of what tools are available (and a few data points about each) across batch processing, SQL batch processing, data warehousing, and RDBMS. It doesn't cover everything, but it seems to capture the popular choices.

https://mindfulmachines.io/blog/2018/4/24/series-big-data-batch-processing

Nearly all of the articles about stream processing focus on Apache Kafka or architecturally similar services for data transport. In many cases, though, it might make sense to build off of MQ. This four part series looks at integrating Spark streaming with IBM MQ, including an overview of delivery semantics and the Spark connector code.

https://medium.com/@srnghn/processing-data-from-mq-with-spark-streaming-part-1-introduction-to-messaging-jms-mq-7d30d9beb003

Dremio is a system for speeding up SQL queries on and improving access to a data lake. This post describes some of its core concepts—virtual datasets and materialized caches—as well as the the new learning engine that's part of Dremio 2.0. That engine is able to detect relationships between data sets, including characteristics like snowflake schemas.

https://www.dremio.com/introduction-to-starflake-data-reflections/

Skyway provides a new memory model for distributed systems that avoids the overhead of serializing and deserializing data during network transfer. The implementation requires some changes to the JVM, but shows significant performance improvements across evaluations on Spark and Flink.

https://blog.acolyer.org/2018/04/26/skyway-connecting-managed-heaps-in-distributed-big-data-systems/

Kafka Summit EU was last week in London. This presentation from the conference covers the three most important Kafka metrics to monitor—under replicated partitions, request handlers, and request timing. There's an overview of each and what metric graphs look like when something goes wrong.

https://www.slideshare.net/ToddPalino/urp-excuse-you-the-three-kafka-metrics-you-need-to-know

Using DynamoDB cross-region replication, AWS Lambda and a few other AWS services, this post describes how to build a multi-data center API service complete with DNS-based failover.

https://read.acloud.guru/building-a-serverless-multi-region-active-active-backend-36f28bed4ecf

This presentation covers the key concepts of Apache Airflow, some examples, and describes how to get started. On that last note, the Astro CLI is a tool to bootstrap an Airflow project.

http://blog.tedmiston.com/momentum-2018-airflow-talk/

In Loco has written about their data infrastructure that ingests over 15TB/day and powers their analytics & business intelligence platform. They use Kafka, Presto, Airflow, Spark, and more.

https://medium.com/inlocotech/data-infrastructure-at-in-loco-5d954cb69b98

Shazam uses AWS Data Pipeline for data processing. Rather than configuring jobs through JSON files, they've moved to a custom Scala DSL, which provides a lot of convenience and correctness advantages. This post introduces the DSL and links to the github repo for the new project.

https://blog.shazam.com/announcing-a-scala-dsl-for-aws-data-pipeline-3797ba7fa79

Job Board

Data Eng Weekly is starting a job board! For the next month, postings are discounted at $149 (regularly $249) for 31 days. Questions or comments? info@dataengweekly.com

Check out our first posting! https://dataengweekly.seeker.company

News

Qubole, as a big data-as-a-service vendor, is able to see some interesting trends based on their customers' usage. In this article, they looked at the migration to Hadoop 2. Since December 2016, Hadoop 2 usage is up 364% while Hadoop 1 usage is down 308%. They've also found that customers are using spot instances much more with Hadoop 2.

https://www.qubole.com/blog/evolution-of-hadoop/

ZDNet has a deeper look at the takeaways and trends from Dataworks Summit, including Hortonworks new Dataplane project, Hadoop 3.0, and the IBM/Hortonworks partnership.

https://www.zdnet.com/article/modernizing-hadoop-reaching-the-plateau-of-productivity/

This post summarizes the Women in Big Data Panel that took place at Dataworks Summit.

https://dataworkssummit.com/blog/women-big-data-lunch-panel-dataworks-summit-berlin/

Videos of the keynotes from last week's Kafka Summit have been posted.

https://kafka-summit.org/events/kafka-summit-london-2018/

Also posted are the videos and slides from Flink Forward.

https://data-artisans.com/flink-forward-san-francisco-2018

The Call for Papers for Spark+AI Summit Europe closes on May 6th. That conference takes place this October in London.

https://databricks.com/sparkaisummit/eu

Sponsor

We’re Shopify. Our mission is to make commerce better for everyone http://bit.ly/shopify-careers and in the last few years we have built a data warehouse that is now ready to power data products and insights for all of our 500,000+ merchants! We thrive on change, operate on trust, and leverage the diverse perspectives of people on our team in everything we do. We solve problems at a rapid pace and at scale. In short, we get shit done. Join us!

http://bit.ly/join-us-shopify-data

Releases

ING's Wholesale Banking Advanced Analytics team has open sourced a tool for managing Flink jobs. Written in Go, it automates some common tasks. For instance, its update functionality creates a save point, cancels the existing job, and starts the new one.

https://medium.com/@ingwbaa/flink-deployer-8c0db4c94fe4
https://github.com/ing-bank/flink-deployer

Landoop announced version 2.0 of Lenses, their management platform for Apache Kafka. The release adds a JDBC driver for Kafka, a CLI, clients for Python & Go, features for data governance and multi-tenancy, and more. On data governance, there's a new feature for obfuscating sensitive fields. The post has an overview of all the new features.

http://www.landoop.com/blog/2018/04/lenses-2-0/

Apache Accumulo 1.9.0 is out. While the changes are relatively small, but the minor version was bumped due to some API changes.

https://lists.apache.org/thread.html/1fe89f7249d1634d5001d3befb0d52cd2fbac3a0e9dcf29d2c8e94e3@%3Cannounce.apache.org%3E

Dremio 2.0 is out (see the post above about one of the main new features). Other new features include an expanded REST API and performance optimizations.

https://www.dremio.com/2.0-announce/

Google Cloud announced a number of changes to their database services—Cloud SQL for PostgreSQL is now GA, Cloud Spanner now includes commit timestamps, and there are two new betas—Cloud Bigtable replication and Cloud Memorystore for Redis.

https://cloudplatform.googleblog.com/2018/04/Accelerating-innovation-for-cloud-native-managed-databases.html

Job Board

Data Eng Weekly is starting a job board! For the next month, postings are discounted at $149 (regularly $249) for 31 days. Questions or comments? info@dataengweekly.com

Check out our first posting! https://dataengweekly.seeker.company

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Data Engineering SD Meetup (San Diego) - Wednesday, May 2
https://www.meetup.com/Data-Engineering-San-Diego/events/249986985/

Colorado

Streaming Data with Apache Kafka + Processing Streaming Data with KSQL, with Tim Berglund (Centennial) - Tuesday, May 1
https://www.meetup.com/DOSUG1/events/248893361/

Texas

Streaming Analytics and the Internet of Things (Plano) - Monday, April 30
https://www.meetup.com/futureofdata-dallas/events/249269993/

Missouri

Data Modeling in Hadoop (Saint Louis) - Wednesday, May 2
https://www.meetup.com/St-Louis-Hadoop-Users-Group/events/250077921/

Minnesota

Kafka as a Service and KSQL for Apache Kafka (Eden Prairie) - Wednesday, May 2
https://www.meetup.com/TwinCities-Apache-Kafka/events/249668155/

Georgia

Fire & Forget: How to Build IoT Message-Based Microservices Apps (Atlanta) - Wednesday, May 2
https://www.meetup.com/Atlanta-Hadoop-Users-Group/events/247349389/

Pennsylvania

Develop Smarter Event-Driven Apps with Fast Data Ingestion (Philadelphia) - Tuesday, May 1
https://www.meetup.com/Big-Data-Developers-in-Philadelphia/events/249881470/

ISRAEL

AWS Big Data Demystified #1 (Tel Aviv) - Sunday, May 6
https://www.meetup.com/AWS-Big-Data-Demystified/events/248921720/