Data Eng Weekly


Hadoop Weekly Issue #228

13 August 2017

Great articles this week on Apache Metron, Apache Airflow, and building an Apache Kafka connector. There are also some good papers from VLDB on Kafka and Apache Flink, and a new releases of Apache Hadoop and Apache Drill.

Technical

Stellar is a scripting language built into the Apache Metron infosec analytics project. It has a number of functions for common use cases, supports basic data structures, and provides a REPL. The post provides a brief intro to Metron, motivates the need for Stellar, has a brief introduction (with examples) to the core features, and describes how to build custom functions.

https://hortonworks.com/blog/apache-metron-streaming-architecture-stellar-excel-cybersecurity/

Qubole has a post that summarizes a talk from the Data Platforms conference on migrating from a physical data center to the cloud. Lessons learned include the importance of embracing technology and finding the right tool for the right job (especially in the cloud world where vendors offer a number of tools with different strengths). The post also includes a video of the presentation.

http://www.qubole.com/blog/dont-fear-new-technology-future-moves-fast/

Ambari 2.5 has a bunch of new features, including improved log management/search, improved config management, and new monitoring features (including high availability of the metrics collector). These and other new features are detailed on the Hortonworks blog.

https://hortonworks.com/blog/ambari-2-5/

While Apache Kafka is typically used as the message bus for Spark streaming, Spark also supports reading data from Amazon Kineses. This post looks at how to get started with Kinesis—defining the schema, connecting to a Kinesis stream, and doing some basic processing.

https://databricks.com/blog/2017/08/09/apache-sparks-structured-streaming-with-amazon-kinesis-on-databricks.html

Qubole has a post on configuring a Jupyter notebook to talk to a remote Spark cluster using the Livy Job Server (server side) and the sparkmagic package (installed locally). A few parts of the post are Qubole-specific, but it's generally a good tutorial for doing this type of setup.

https://www.qubole.com/blog/connecting-jupyter-remote-qubole-spark-cluster-aws-ms-azure-oracle-bmc/

This is a good tutorial for getting started with Apache Airflow. It describes how to install/get setup (including a few gotchas) and walks through defining a DAG and a few tasks.

https://blog.godatadriven.com/practical-airflow-tutorial

The Proceedings of the Very Large DataBases has several great papers from the past year. They just released the August papers, and of particular interest to this group may be the one on a modified "Queryable Kafka" and another on how Apache Flink does state management.

http://www.vldb.org/pvldb/vol10.html
http://www.vldb.org/pvldb/vol10/p1646-falk.pdf
http://www.vldb.org/pvldb/vol10/p1718-carbone.pdf

Hadoop and several other tools are inspired by similar systems from Google. Given that papers are published several years into development, Google tends to be far ahead of open-source alternatives. While I don't necessarily agree with the premise of the question in this article, it does have some good points. In particular, we're seeing Google (and other tools) come full circle on certain design trade-offs (such as removing SQL then bringing it back).

https://medium.com/@garyorenstein/did-google-send-the-big-data-industry-on-a-10-year-head-fake-9c94d553925a

Great post on the considerations of building a Kafka Connect Source connector. It looks at several of the trade-offs and design challenges in implementing a Kafka connector including poling vs. change data capture, initial data bootstrapping, extracting keys for topics, and schema evolution.

https://medium.com/@criccomini/so-you-want-to-build-a-kafka-connector-source-edition-357dce7d55f4

Releases

A couple of weeks back, Apache Drill 1.11.0 was released. Major new features include crypto functions, support for PCAP files, and support for network encryption. The release also includes a number of bug fixes and improvements.

https://drill.apache.org/docs/apache-drill-1-11-0-release-notes/

Apache Hadoop 2.7.4 was released. It includes over 250 bug fixes and improvements to HDFS, YARN, and MapReduce.

https://lists.apache.org/thread.html/329648dfa6a1f878eb16b584e12a18116eb97a4c61e4dc7906850654@%3Cgeneral.hadoop.apache.org%3E

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

#SD BigData Meetup #22 (San Diego) - Tuesday, August 15
https://www.meetup.com/sdbigdata/events/241648428/

Kafka Connect Implementation at GumGum and More (Santa Monica) - Tuesday, August 15
https://www.meetup.com/Los-Angeles-Apache-Kafka-Meetup/events/241311940/

Ask the Expert: Data Science Workbench (Irvine) - Wednesday, August 16
https://www.meetup.com/OCBigData/events/239639531/

Washington

eBay’s Billion-Item Inventory in the Public Cloud (Bellevue) - Wednesday, August 16
https://www.meetup.com/Big-Data-Bellevue-BDB/events/241747664/

Spark ML at Scale + Building Continuous Applications with SnappyData (Bellevue) - Thursday, August 17
https://www.meetup.com/Seattle-Spark-Meetup/events/234726939/

Utah

Big Data Utah Meeting @ Overstock (Midvale) - Wednesday, August 16
https://www.meetup.com/BigDataUtah/events/240352994/

Missouri

Apache Nifi Deep Dive (Columbia) - Wednesday, August 16
https://www.meetup.com/devcomo/events/240923760/

Wisconsin

Sensors, Spark, and Kafka: Applied Machine Learning (West Allis) - Tuesday, August 15
https://www.meetup.com/MKE-Big-Data/events/239664827/

New York

How to Process 50 Billion Monthly Messages with Full Availability & Performance (New York) - Tuesday, August 15
https://www.meetup.com/mysqlnyc/events/242062843/

CANADA

Log Ingestion & Analysis w/ Kafka, Apache NiFi, and Hadoop (Vaughan, ON) - Saturday, August 19
https://www.meetup.com/Everything-Big-Data-Hadoop-Google-Bigquery-Azure-Warehouse/events/242263151/

CHINA

Shanghai Apache Spark (Shanghai) - Saturday, August 19
https://www.meetup.com/Shanghai-Apache-Spark-Meetup/events/242431501/

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit https://hadoopweekly.com