Data Eng Weekly


Hadoop Weekly Issue #217

21 May 2017

Several great technical posts including coverage of Spark, Kafka, Druid, and Hive. There's also a link to the recent paper on Google's Spanner and a new tool for writing streaming applications against data in Kafka with Go. Finally, there are several new releases this week, including Samza, Beam, and Spring Cloud DataFlow.

Technical

This post provides a walkthrough of using Amazon EMR with Apache Spark. It shows how to use both the command line and the AWS console to spin up a cluster and run some Spark jobs.

https://blog.sqreen.io/amazon-emr-spark/

The Hortonworks blog has a post that motivates the need for a shared schema registry, especially for streaming applications. They are planning on shipping, as part of the next HDF release, their own schema registry that will eventually integrate with Apache Atlas and Apache Ranger in addition to Kafka.

https://hortonworks.com/blog/part-2-hdf-blog-series-shared-schema-registry-important/

Cloudera has recently integrated Apache Kafka, Apache Spark, and Apache Ranger to provide encryption and authorization for Spark jobs interacting with Kafka. This post describes the implementation, and why some of the design decisions were made the way they were.

http://blog.cloudera.com/blog/2017/05/reading-data-securely-from-apache-kafka-to-apache-spark/

The Hortonworks blog has a post that demos the row and column-level access control and data masking support in Hive and Spark SQL via Apache Ranger.

https://hortonworks.com/blog/row-column-level-control-apache-spark/

After last week's paper from the Amazon Aurora team, this week I'm including a link to the Google Spanner paper. Published as part of the SIGMOD '17 proceedings, this paper focuses on the query planning and execution components of Spanner (vs. the original paper that focussed on its fault tolerance, scalability, etc).

https://dl.acm.org/citation.cfm?id=3056103

The Pyrocast blog has a post on event stream processing. It describes some key insights into the logic behind the push to an event log for capturing state vs. a relational database. The post introduces the notion of "Data Loss by Design" and points out that a traditional database doesn't let you scale reads independently from writes (among other limitations). With a log-centric architecture, though, "each query gets it own schema" and reads can be decoupled/scaled separately from writes.

http://pyroclast.io/blog/2017/05/17/the-future-of-event-stream-processing.html

The Databricks blog has a post on some of the operational concerns for Spark's Structured Streaming, particularly related to observability. There are new APIs to get the status, get recent progress (including input and processing rates), and the ability to track progress via callback. The post also has a quick overview of alerting, failure recovery, and updates.

https://databricks.com/blog/2017/05/18/taking-apache-sparks-structured-structured-streaming-to-production.html

This post describes how to build an OLAP table in Druid from data in Apache Hive. It's the second part in a series on integrating Hive with Druid, and the first one has some more context for when or why this might be a good idea.

https://hortonworks.com/blog/sub-second-analytics-hive-druid/

News

The Data Platforms 2017 conference is this week in Phoenix. A press release on the Qubole blog mentions several of the speakers.

http://www.qubole.com/company/data-platforms-2017-expands-speaker-line-up/

Releases

Goka is a new project for writing stream processing applications in Go backing them with data in Kafka.

https://github.com/lovoo/goka

Apache Flink has published Docker images to Docker Hub. The companion documentation has examples of multi-node clusters using Docker Compose and Docker Swarm.

http://flink.apache.org/news/2017/05/16/official-docker-image.html

Version 0.13 of Apache Samza has been released. Samza was an early entry into the distributed stream processing space, but it has had a more low-level programming model than other systems. This release adds a new high=level API, support for rolling upgrades, improved failure detection, and more.

https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces11

Spring Cloud DataFlow is a data processing system that supports both streaming and batch. There are a number of connectors and features, including support for the reactive programming model and new experimental support for Kafka Streams in the 1.2 release. Lots more features to learn about in the release notes.

https://content.pivotal.io/blog/announcing-spring-cloud-data-flow-1-2

Version 1.1.0 of Apache CarbonData, the columnar data format, has been released. Major features of the release include a new V3 data format that improves scan performance and a batch sort operation to speedup data load.

https://lists.apache.org/thread.html/19ef95243c245bfbc2f55ade3e34df75b1c0d8375508be73bae27856@%3Cuser.carbondata.apache.org%3E

Apache Beam 2.0.0 was released this week. Highlights of the release include support for stateful data processing, support for file systems (built-in support for HDFS), and a metric system. The major version also signifies API stability for all future releases on the 2.x branch.

https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces12

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Apache Kafka and the Rise of Real-Time, with Neha Narkhede (Hollywood) - Monday, May 22
https://www.meetup.com/Los-Angeles-Apache-Kafka-Meetup/events/239693106/

Stream Processing with Apache Kafka and Apache Samza (Sunnyvale) - Wednesday, May 24
https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/238303422/

Unified, Efficient, and Portable Data Processing with Apache Beam (Santa Clara) - Wednesday, May 24
https://www.meetup.com/futureofdata-siliconvalley/events/239369704/

Netflix's Performance Optimization of Recommendation Pipeline Using Spark SQL (Los Gatos) - Thursday, May 25
https://www.meetup.com/SF-Big-Analytics/events/239642830/

North Carolina

Building Next-Generation Systems of Record with MapR Streams (Charlotte) - Thursday, May 25
https://www.meetup.com/CharlotteHUG/events/235107622/

UNITED KINGDOM

Stream Analytics with SQL on Apache Flink (London) - Tuesday, May 23
https://www.meetup.com/Apache-Flink-London-Meetup/events/239930950/

DevOps & Data (Glasgow) - Wednesday, May 24
https://www.meetup.com/DevOpsGlasgow/events/239901246/

Kubernetes and Kafka for Fun and Profit (London) - Wednesday, May 24
https://www.meetup.com/London-MongoDB-User-Group/events/239379888/

TensorFlow + Kubernetes + Spark + ElasticSearch + Beam + BigQuery + Dataflow (London) - Wednesday, May 24
https://www.meetup.com/gdgcloud/events/237703920/

Spark Strata Highlights! (London) - Thursday, May 25
https://www.meetup.com/Spark-London/events/240107606/

FRANCE

The Discovery of Kafka (Toulouse) - Tuesday, May 23
https://www.meetup.com/Toulouse-DevOps/events/239479265/

NETHERLANDS

"Fast Data in Supply Chain Planning" by Jeroen Soeters (Rotterdam) - Tuesday, May 23
https://www.meetup.com/Domain-Driven-Design-Nederland/events/240051904/

BELGIUM

"Fast Data in Supply Chain Planning" by Jeroen Soeters (Halle) - Wednesday, May 24
https://www.meetup.com/dddbelgium/events/239877651/

ISRAEL

Big Data Analytics in Real Time (Tel Aviv-Yafo) - Wednesday, May 24
https://www.meetup.com/Big-Data-Israel/events/240037171/

Kafka & Mongo: The Master of Clusters (Tel Aviv-Yafo) - Wednesday, May 24
https://www.meetup.com/Big-things-are-happening-here/events/239606130/

INDIA

Kafka Meetup @ Linkedin (Bangalore) - Saturday, May 27
https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/239902829/

Spark Streaming (Bangalore) - Saturday, May 27
https://www.meetup.com/Real-Time-Data-Processing-and-Cloud-Computing/events/239849479/

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit https://hadoopweekly.com