Data Eng Weekly


Hadoop Weekly Issue #233

17 September 2017

Lots of great technical content this week, including posts on Kafka, SparkR, and Amazon EMR. And if you're looking for more, the Kafka Summit videos and slides are online. In releases, Kudu, Impala, Kafka, and Storm all have new versions out this week.

Technical

While most of the articles I highlight target distributed systems, this one covers some python tools for training a model and serving results. It focusses on the data engineering aspects of that work—from how to get started with a simple example to scaling up in a production environment.

https://content.pivotal.io/blog/automated-machine-learning-deploying-automl-to-the-cloud

Confluent has written about the testing of Apache Kafka. They highlight that before any code is written, the design is "tested" through an open Kafka Improvement Process. From there, code undergoes unit tests, integration tests in a single process, and system tests that involve performance and correctness testing across multiple instances and with injected faults and more.

https://www.confluent.io/blog/apache-kafka-tested/

This post provides an interesting overview of the types of IO supported by the Linux kernel (vectored IO, memory mapping, and async IO) and the trade-offs between them. While not necessarily useful everyday, many of the data systems covered in this newsletter make use of some of the advanced features of mmap like fadvise.

https://medium.com/@ifesdjeen/on-disk-io-part-2-more-flavours-of-io-c945db3edb13

Good tutorial walking through how to use R's dplyr library to analyze data about bike trips in NYC. For scalability, the application makes use of Amazon EMR and Spark.

https://content.pivotal.io/blog/using-sparkr-to-analyze-citi-bike-data

Hortonworks has done a performance analysis of Apache HBase and Apache Cassandra using the Yahoo Cloud Serving Benchmark. For the testing, the services are configured to read and write data on AWS' attached SSD storage. Unsurprisingly, HBase was faster for reads and Cassandra performed better when workflows are write-heavy.

https://hortonworks.com/blog/hbase-cassandra-benchmark/

Confluent and Kafka co-founder Jay Kreps writes about several use-cases in which Apache Kafka is a good choice for your data's source of truth. These include a centralized log of changes, powering of in-memory cache for online systems, kappa architecture use-cases, and change data capture. Jay argues that Kafka is the commit log for the datacenter but at the same time it won't replace traditional databases—Kafka doesn't plan to support arbitrary queries.

https://www.confluent.io/blog/okay-store-data-apache-kafka/

Amazon EMR has enabled the ability to build a cluster with a custom-built AMI. This post walks through what's needed to build and use a custom AMI for EMR as well as what some of the benefits are.

https://aws.amazon.com/blogs/big-data/create-custom-amis-and-push-updates-to-a-running-amazon-emr-cluster-using-amazon-ec2-systems-manager/

News

Spark Summit Europe takes place in October 24-26 in Dublin. The schedule is available on the conference website. It features speakers from a number of startups and even more larger companies.

https://spark-summit.org/eu-2017/

Slides and videos from the recent Kafka Summit have been posted on the Confluent website. Access is behind an email/phone form.

https://www.confluent.io/kafka-summit-sf17/resource/

Releases

Qubole has announced general availability of their AIR (Alerts, Insights, Recommendations) service. AIR does usage based ranking and context aware suggestions when writing a query, enables search across column and table names, provides usage reports, statistics, data preview, and provides actionable recommendations for improving data models.

https://www.qubole.com/blog/air-data-intelligence-qubole/

IBM has released a sandbox for trying out its Big SQL platform, which is based on HDP 2.6.2, using a single docker image.

https://developer.ibm.com/hadoop/2017/09/13/announcing-ibm-big-sql-sandbox/

Apache Kafka 0.11.0.1 was released, which contains a number of bug fixes and minor improvements.

https://lists.apache.org/thread.html/bece1560aedfd104167ac31a19d49650f828058a1afb57a421db72fe@%3Cannounce.apache.org%3E

Version 1.5.0 of Apache Kudu has been released. While it's a minor release, there are new features like the ability to tolerate disk failures at startup and improvements to client tools (e.g. exporting CSV files and a tablet move operation). There are also a number of optimizations and bug fixes. The release notes contain some details for anyone considering the upgrade.

https://lists.apache.org/thread.html/2abfaaf95f86c42e4e99c9b432711222d70d3cbac788b81a2e4cf0cb@%3Cannounce.apache.org%3E

Apache Impala (incubating) has released version 2.10.0. The release contains over 250 tickets for new features, improvements, bug fixes, and more.

https://lists.apache.org/thread.html/481b0b828b1364fb617e6f5aefc70eb6db4450244b71765e87c8344b@%3Cannounce.apache.org%3E

Apache Storm 1.0.5 was released with seven bug fixes.

http://storm.apache.org/2017/09/15/storm105-released.html

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

BDAM: Rules Engine, Apache Airflow & Exactly-Once Processing with Apache Kafka! (Palo Alto) - Wednesday, September 20
https://www.meetup.com/BigDataApps/events/241658466/ Apache Spark, Apache Flink, and Apache Ignite: Where Fast Data Meets the IoT (San Francisco) - Wednesday, September 20
https://www.meetup.com/SF-Spark-and-Friends/events/242935496/

Washington

Spark Structured Streaming: Introduction and Internals (Bellevue) - Wednesday, September 20
https://www.meetup.com/Seattle-Data-Science-and-Data-Engineering/events/241418432/

Illinois

Spark ML with Holden Karau + Building a Recommendation Engine with Cars.com (Chicago) - Thursday, September 21
https://www.meetup.com/Chicago-Spark-Users/events/243264992/

Pennsylvania

Intro to Alluxio and Spark (Philadelphia) - Thursday, September 21
https://www.meetup.com/PhillyBigData/events/241601891/

New York

Best Practices Building Enterprise Data Infrastructure with WeWork (New York) - Tuesday, September 19
https://www.meetup.com/Analytics-Data-Science-by-Dataiku-NY/events/243022414/

UNITED KINGDOM

Streaming Data Pipelines and Kafka as a Message Queue (London) - Wednesday, September 20
https://www.meetup.com/Apache-Kafka-London/events/242981989/

SPAIN

Workshop: Linking Hadoop with Classic DBMS (Madrid) - Monday, September 18
https://www.meetup.com/Big-Data-Madrid/events/243112407/

SWITZERLAND

Our First Kafka Meetup with 2 Amazing Speakers from Confluent (Zurich) - Tuesday, September 19
https://www.meetup.com/Zurich-Apache-Kafka-Meetup-by-Confluent/events/242063921/

ITALY

Back2Spark Meetup: AgileRai and SparkSearch! (Milan) - Wednesday, September 20
https://www.meetup.com/Spark-More-Milano/events/242903730/

FINLAND

Open Source Tools for Big Data (Helsinki) - Tuesday, September 19
https://www.meetup.com/Exove-Extends/events/242923059/

HUNGARY

Deep Dive into Apache Ranger and Atlas (Budapest) - Tuesday, September 19
https://www.meetup.com/futureofdata-budapest/events/242953630/

ROMANIA

September Meetup: Real-Time Transaction Streaming and Big Data in Bioinformatics (Bucharest) - Tuesday, September 19
https://www.meetup.com/Bucharest-Big-Data-Meetup/events/242216817/

GREECE

Riding the Streaming Wave with Kafka (Athens) - Tuesday, September 19
https://www.meetup.com/Athens-Big-Data/events/242856317/

TURKEY

Why Apache Kafka? (Ankara) - Wednesday, September 20
https://www.meetup.com/Ankara-Cloud-Meetup/events/242640036/

AUSTRALIA

Spark Meetup @ DataWorks Summit (Sydney) - Tuesday, September 19
https://www.meetup.com/Sydney-Apache-Spark-User-Group/events/242809823/

Apache NiFi and MiNiFi: Edge to Core (Sydney) - Tuesday, September 19
https://www.meetup.com/futureofdata-sydney/events/243289719/