Data Eng Weekly


Hadoop Weekly Issue #192

30 October 2016

This week's issue has posts on new features in several recently released projects—Apache Kafka, Apache Ambari, and Apache NiFi. In news, there's a recap of the recent Spark Summit EU and a new partnership between Qubole and IBM.

Technical

The Confluent blog has an overview of interactive queries—a new feature in Apache Kafka 0.10.1. By exposing a new set of APIs, servers participating in a Kafka Streaming application can now serve interactive queries based on their local state. This feature can great simplify the process of serving up the results of a streaming application—an intermediate data store like HBase, Redis, or Cassandra is no longer needed.

http://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/

This presentation gives a thorough overview of Apache Ambari, which is a tool for managing Hadoop (and much more) clusters. Topics covered include metrics, alerts, log search, security setup, and cluster upgrades.

http://www.slideshare.net/hortonworks/apache-ambari-past-present-future

This article on DZone describes several open-source Online Analytical Processing (OLAP) systems—Apache Kylin, Druid, and Apache Lens. There are a number of links to background reading on each.

https://dzone.com/articles/olap-for-big-data

Whenever I post a benchmark, I caveat that the results aren't generally applicable. In fact, two vendors can often paint their own system as the best under similar constraints. Amazon Redshift and Google BigQuery seem to be undergoing one such horserace right now. In this benchmark, Redshift is shown to be 6x faster than BigQuery on a TPC-DS workload.

https://aws.amazon.com/blogs/big-data/fact-or-fiction-google-big-query-outperforms-amazon-redshift-as-an-enterprise-data-warehouse/

One of the biggest limitations in distributed databases is the reliance on a primary key to distribute data. While some systems have come up with solutions to handle queries on other data elements, Replex is a fresh take on how to efficiently query on other columns. As usual, the morning paper has a great overview of the highlights of this USENIX 2016 best paper awardee.

https://blog.acolyer.org/2016/10/27/replex-a-scalable-highly-available-multi-index-data-store/

Hortonworks recently ran a webinar on new features in Hortonworks Dataflow 2.0 (built on Apache NiFi 1.0). They've posted a summary of major highlights, the slides from the webinar, and several answers to questions from the audience. The post is a great way to get familiar with the main features of NiFi.

http://hortonworks.com/blog/guide-new-features-hortonworks-dataflow-2-0/

pgpool is a reverse proxy for PostgreSQL that is compatible with Amazon Redshift. This post describes how to use its caching features to reduce latency of repeated queries to a Redshift cluster by taking advantage of AWS ElasticCache.

https://aws.amazon.com/blogs/big-data/using-pgpool-and-amazon-elasticache-for-query-caching-with-amazon-redshift/

MapR has a tutorial describing how to get started with the MapR Sandbox by launching it on a Microsoft Azure VM. It looks like a great way to get familiar with Azure, MapR, or both.

https://www.mapr.com/blog/7-steps-deploy-mapr-sandbox-microsoft-azure

News

This post introduces the notion of "lean big data" to make a smart investment in big data technologies, and it describes five common pitfalls that can lead to a failed project. These include deploying big data tech when you don't have big data, separating application and platform roles too soon, and building without a use case in mind.

http://getindata.com/blog/post/lean-big-data-how-to-avoid-wasting-money-with-big-data-technologies-and-get-some-roi/

Spark Summit EU was this week in Brussels. The Databricks blog has a recap of keynotes and technical presentations from the Databricks team. The major themes seem to be real-time processing and advanced data science (such as deep learning). The post also mentions that videos from the conference are expected to be published online this week.

https://databricks.com/blog/2016/10/26/day-1-databricks-voices-spark-summit-eu.html
https://databricks.com/blog/2016/10/27/day-2-databricks-voices-spark-summit-eu-2016.html

IBM and Qubole announced a partnership in which the Watson Data Platform can leverage the Quoble Data Service for public cloud compute and Apache Spark.

https://www.qubole.com/blog/product/ibm-and-qubole-take-data-science-and-apache-spark-to-the-public-cloud/

Releases

Databricks announced that they're adding support for Machine Learning with GPUs. Compared to a pure Spark/scala code, running TensorFlow on GPUs via Databricks can save time and money.

https://databricks.com/blog/2016/10/27/gpu-acceleration-in-databricks.html

Qubole announced that its support for the Airflow workflow engine is now generally available.

https://www.qubole.com/blog/product/airflow-as-a-service-on-qds-is-generally-available/

Apache Bahir, which is a library providing extensions for Apache Spark, announced release 2.0.1. Built to run on Spark 2.0.1, it adds support for Akka, MQTT, Twitter, and ZeroMQ streaming.

https://lists.apache.org/thread.html/5ed2fca71d8482c60e795798c790ce320ebcea651a3078e308bf2468@%3Cdev.bahir.apache.org%3E

Version 1.9.0 of Apache Parquet MR was released. It includes a number of bug fixes and improvements. There are also some small new features, such as support for delta encoding of 64 bit integers.

https://www.mail-archive.com/announce@apache.org/msg03488.html

Amazon EMR 5.0.3 was released with updates to Hadoop, Presto, and Spark.

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html#d0e1201

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Stream Processing Meetup at LinkedIn (Sunnyvale) - Wednesday, November 2
http://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/234454163/

Ingest at Intuit + An Intro to StreamSets Dataflow Performance Manager (Mountain View) - Thursday, November 3
http://www.meetup.com/SF-Bay-Area-Data-Ingest-Meetup/events/234859565/

Colorado

Integrating Real-Time Data Streams with Spark and Kafka (Centennial) - Tuesday, November 1
http://www.meetup.com/DOSUG1/events/233399900/

Texas

Hands-On Intro to Apache Spark for Data Engineers, Data Scientist, and Developers (Farmers Branch) - Tuesday, November 1
http://www.meetup.com/Big-Data-Developers-in-Dallas/events/235157450/

SQL on Hadoop Meetup: Open Source Presto Query Engine (Austin) - Wednesday, November 2
http://www.meetup.com/Austin-SQL-on-Hadoop-Meetup-Group/events/234782680/

UNITED KINGDOM

Meet with Scott Gnau, CTO of Hortonworks, on the Future of Data (London) - Tuesday, November 1
http://www.meetup.com/futureofdata-london/events/235174621/

November HUGUK Meetup (London) - Thursday, November 3
http://www.meetup.com/hadoop-users-group-uk/events/234099911/

Apache Flink: State of the Union and What's Next (London) - Thursday, November 3
http://www.meetup.com/Apache-Flink-London-Meetup/events/235075480/

29th Big Data London Meet-Up at Big Data LDN (London) - Thursday, November 3
http://www.meetup.com/big-data-london/events/234348517/

NORWAY

Big Data, No Fluff: Let’s Get Started with Hadoop #10 (Oslo) - Thursday, November 3
http://www.meetup.com/Oslo-Hadoop-Big-Data-Meetup/events/231886473/

FRANCE

Spark Meetup with Databricks, Criteo, Qucit, Talend (Paris) - Wednesday, November 2
http://www.meetup.com/Paris-Spark-Meetup/events/235148583/

LITHUANIA

Spark Streaming + Testable ETL (Vilnius) - Wednesday, November 2
http://www.meetup.com/Vilnius-Hadoop-Meetup/events/234912123/

UNITED ARAB EMIRATES

Cloudera Sessions (Dubai) - Tuesday, November 1
http://www.meetup.com/UAE-Big-Data-Group/events/234236120/

CHINA

Flink Meetup (Hangzhou) - Saturday, November 5
http://www.meetup.com/Apache-Flink-Hangzhou-Meetup/events/234730280/

AUSTRALIA

Rethink SQL for Big Data with Apache Drill (Sydney) - Thursday, November 3
http://www.meetup.com/Sydney-Big-Data-Converged-SQL-NoSQL-and-Real-Time/events/233462437/