Data Eng Weekly


Hadoop Weekly Issue #182

14 August 2016

Welcome to a double-issue of Hadoop Weekly. There's tons of content this week—lots of technical posts covering Impala, Spark, Metron, Kafka, Hadoop, and more. In news, Hortonworks announced earnings, MapR announced Series E financing, and there are several upcoming conferences. Finally, there are eight releases (including two new open-source projects from Blizzard and Uber).

Technical

The Cloudera blog has an overview of speed-ups in the recently released Apache Impala (incubating) 2.6. The largest jump in performance is with encryption enabled, and there are also speedups due to improvements in runtime filters, dynamically sized bloom filters, improved codegen for top-N queries, faster decimal arithmetic, and more.

http://blog.cloudera.com/blog/2016/08/bi-and-sql-analytics-with-apache-impala-incubating-in-cdh-5-8-3x-faster-on-secure-clusters/

Another big part of the Impala 2.6 release is better support for data stored in Amazon S3. This post describes some of the optimizations in and architectural changes to Impala's read and write pipelines to make the most of S3. There are some benchmarks showing the doubling the size of an Impala cluster roughly doubles throughput.

http://blog.cloudera.com/blog/2016/08/analytics-and-bi-on-amazon-s3-with-apache-impala-incubating/

Apache Spark has the momentum of a large developer community and many big companies behind it. But there are several other big data systems that are emerging with technical differentiators—Apache Apex, Apache Flink, Heron (from Twitter), and Onyx.

http://www.infoworld.com/article/3101729/big-data/big-data-brawlers-4-challengers-to-spark.html

Mobius is a .NET framework for Apache Spark. This post gives an introduction to the library and provides an example of using the API from F#.

https://databricks.com/blog/2016/08/03/developing-apache-spark-applications-in-net-using-mobius.html

While stream processing is getting a lot of attention, real-time data not always needed. Sometimes a near real-time (i.e. > 5 minutes latency) is good enough, which is the case for many applications at Uber. With relaxed latency requirements, you can build a much more efficient (both in terms of CPU resources and developer time) pipeline by using short-lived, batch processing applications that operate on "mini batches." Given that Uber is operating at tremendous scale, this post offers an interesting take grounded in practical experience.

https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop

Confluent's monthly "Log Compaction" post has an overview of Apache Kafka and related news. it describes a number of features (e.g. APIs to managed ACLS, query-able state for Kafka Streams) that are in discussion or under development for future Kafka releases.

http://www.confluent.io/blog/log-compaction-highlights-in-the-apache-kafka-and-stream-processing-community-august-2016

Paige Roberts has published an interview with Ryan Merriman, an architect of the Apache Metron project. Metron is a real-time information security platform built on open-source tools like Apache Kafka and Apache Storm. The first part of the interview contains a good overview of Metron's architecture from a high-level, and the second part includes a discussion of why the project chose Apache Storm instead of other tools like Apache Flink and Apache Spark.

http://bigdatapage.com/cyber-security-apache-metron-and-storm/

CPU scaling is a mechanism for making processors more power efficient. It's often a great idea, but it can cause mysterious performance problems when working with a big data system like Hadoop (speaking from experience). This post describes how to determine if CPU scaling is enabled and how to disable it.

https://developer.ibm.com/hadoop/2016/08/03/is-cpu-scaling-impacting-your-hadoop-performance/

This post describes how to detect similar items across two lists using TF-IDF and cosine-similarity. From there, it explains how to use numpy and pyspark to do matrix multiplication and find the common items. There's some example code for this pipeline.

https://labs.yodas.com/large-scale-matrix-multiplication-with-pyspark-or-how-to-match-two-large-datasets-of-company-1be4b1b2871e

Apache Commons Crypto is a library that takes advantage of the AES-NI (Advanced Encryption Standard New Instructions) instruction sets on modern hardware for improving performance of encrypting/decrypting data. A new initiative is using this library to add encryption to data shuffled by Spark. The post includes some performance benchmark's that show little overhead compared to unencrypted shuffle.

http://blog.cloudera.com/blog/2016/08/securing-apache-spark-shuffle-using-apache-commons-crypto/

The AWS Big Data Blog has a crash course in Apache Bigtop (the packaging system for the Hadoop ecosystem) with an eye towards adding another application to an Amazon EMR cluster via bootstrap actions.

http://blogs.aws.amazon.com/bigdata/post/TxNJ6YS4X6S59U/Building-and-Deploying-Custom-Applications-with-Apache-Bigtop-and-Amazon-EMR

Hadoop clusters have traditionally married compute with storage to increase performance by taking advantage of data locality. But with improvements in network throughput and the adoption of the cloud, more deployments are decoupling compute and storage. The Pivotal blog has a post about this topic including some recommendations of when to use network attached storage rather than disk attached storage.

https://blog.pivotal.io/big-data-pivotal/products/the-das-vs-nas-debate-for-apache-hadoop

The post describes the Hadoop daemon code that is part of the 3.x branch, and the flexibility that the new shell code presents. It includes an example of adding support for launching the daemons with cgexec (for linux cgroups).

https://effectivemachines.com/2016/08/10/taking-control-of-daemons-in-apache-hadoop/

Some posts end up being too hypothetical/not grounded in real world examples. To that end, I'm not entirely sure what to make of this post that's based on a hypothetical Facebook feature from an Onion News article. It does seem to be an interesting application of Apache Kafka Streams though—it illustrates a number of features like KTables, joins, and windowing.

http://timothyrenner.github.io/engineering/2016/08/11/kafka-streams-not-looking-at-facebook.html

News

The DBMS2 blog has two posts covering Databricks, the company's relationship with the Spark project, and Spark. The first captures some recent progress—Spark as a replacement for MapReduce, the success of Databricks' cloud platform, and Databricks role in the Apache Spark community. The second captures details of recent improvements to Spark and the Databricks cloud.

http://www.dbms2.com/2016/07/31/notes-on-spark-and-databricks-generalities/
http://www.dbms2.com/2016/07/31/notes-on-spark-and-databricks-technology/

The Apache Beam (incubating) project recently celebrated its sixth month of incubation. This post highlights the progress of the project, including accomplishments like contributions from 45 contributors, a new Flink runner, and work on a Spark 2.0 runner.

http://beam.incubator.apache.org/blog/2016/08/03/six-months.html

Hortonworks announced Q2 results last week. They reported a loss of $64.2 million, but a 46% year-over-year jump in revenues. They have published two posts reflecting on their momentum and describing how some of their customers are using Hortonworks.

http://siliconangle.com/blog/2016/08/04/big-data-leader-hortonworks-shares-plunge-25-percent-on-q2-sales-miss/
http://hortonworks.com/blog/the-journey-ahead/
http://hortonworks.com/blog/the-opportunity-ahead/

MapR announced that they've raised $50 million in a Series E round of financing.

http://fortune.com/2016/08/09/mapr-series-e-50-million/

Doug Meil, co-founder of the Cleveland Big Data meetup has written about the evolution of the meetup over the years. The post is full of tips for how to organize good talks and engage the meetup community.

https://themeildeal.blogspot.com/2016/08/how-to-create-successful-technical.html

The schedule for FlinkForward, which takes place September 12-14 in Berlin, has been posted.

http://flink-forward.org/program/sessions/

Mesosphere and Confluent announced a partnership to bring the Confluent Platform to the Mesosphere Enterprise Datacenter Operating System. Mesosphere is calling this "container 2.0" because it's defined with systems (like Kafka) that store state in mind.

http://www.zdnet.com/article/mesosphere-datastax-confluent-lightbend-container-2-0-but-is-it-complicated/

A half-day HBaseCon East has been announced. It takes place in New York on September 26th. The call for proposals is open until September 4th.

http://www.meetup.com/HBase-NYC/events/233024937/

Hadoop Summit Melbourne takes place in just over two weeks, on August 31st and September 1st.

http://hortonworks.com/blog/fun-facts-figures-3-weeks-hadoop-summit-melbourne/

Releases

Uber has open-sourced their Kafka replication system, uMirrorMaker. The system is built on Apache Helix and addresses some of the shortcomings of Apache Kafka's MirrorMaker, such as faster rebalancing, dynamic topic addition/removal, and no restarts. uMirrorMaker has been running inside Uber for eight months.

https://eng.uber.com/umirrormaker/

Amazon EMR 5.0 was released. This version upgrades to Spark 2.0, Hive 2.1, makes Tez the default execution engine for Hive and Pig, and more.

http://blogs.aws.amazon.com/bigdata/post/Tx3KG7STXIZV5QZ/Use-Spark-2-0-Hive-2-1-on-Tez-and-the-latest-from-the-Hadoop-ecosystem-on-Amazon

Apache Flink released version 1.1.0 (and a fast follow-up 1.1.1 to fix a dependency issue). The release has a number of fixes, improvements, and new features. They include a support for Amazon Kinesis, an Apache Cassandra sink, a table API (and SQL support), a Scala version of the CEP library, and more.

http://flink.apache.org/news/2016/08/08/release-1.1.0.html
http://flink.apache.org/news/2016/08/11/release-1.1.1.html

Apache Kafka 0.10.0.1 was released. It includes a number of bug fixes and improvements.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201608.mbox/%3CCAD5tkZboAB_J8rE-zACc32ug=6dFL2NBZfdQtiVxQzQQuM9oAA@mail.gmail.com%3E

Blizzard has open sourced node-rdkafka, a Node.js library for Apache Kafka built on librdkafka. The readme has a quick introduction to building and using the library.

https://github.com/Blizzard/node-rdkafka

Apache Samza 0.10.1 was released this week. It includes a number of enhancements (including host affinity when restoring a worker) and performance improvements. The release announcement also describes some future enhancements (REST API, disk quotas, high-level language for Samza).

https://blogs.apache.org/samza/entry/announcing_the_release_of_apache5

Apache Storm 1.0.2 was released. It includes bug fixes and performance enhancements.

https://storm.apache.org/2016/08/10/storm102-released.html

Amazon Web Services announced a new Amazon Kinesis Analytics, which is a new hosted stream processing system built with SQL. This post details how to get started by writing and deploying a streaming query.

http://blogs.aws.amazon.com/bigdata/post/Tx2D4GLDJXPKHOY/Writing-SQL-on-Streaming-Data-with-Amazon-Kinesis-Analytics-Part-1

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Getting Hot and Cold with Spark and Big Data (San Diego) - Tuesday, August 16
http://www.meetup.com/San-Diego-Spark-and-Big-Data-Meetup/events/232712937/

53rd Bay Area Hadoop User Group Meetup (Sunnyvale) - Wednesday, August 17
http://www.meetup.com/hadoop/events/232789468/

Interactive Applications on Spark: Big and Small Lessons Learned (Redwood City) - Wednesday, August 17
http://www.meetup.com/RWC-Next-Gen-Data-Preparation-Meetup/events/232353772/

Oregon

Performance in Spark 2.0 (Portland) - Thursday, August 18
http://www.meetup.com/Portland-Spark-User-Group/events/232705430/

Colorado

Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data with Mike Percy (Greenwood Village) - Wednesday, August 17
http://www.meetup.com/Denver-Cloudera-User-Group/events/232782782/

Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data with Mike Percy (Broomfield) - Thursday, August 18
http://www.meetup.com/Boulder-Denver-Big-Data/events/232056701/

Ohio

Spark Meetup 2016 #3 (Mason) - Wednesday, August 17
http://www.meetup.com/Cincinnati-Apache-Spark-Meetup/events/232538903/

Florida

Streaming Design Patterns, Revolutionizing Architectures Using Kafka (Jacksonville) - Wednesday, August 17
http://www.meetup.com/Jacksonville-JAVA-User-Group-JaxJUG/events/233016933/

Georgia

Cutting Edge with HBase (Atlanta) - Wednesday, August 17
http://www.meetup.com/Atlanta-Hadoop-Users-Group/events/230344766/

District of Columbia

Cloudera on Microsoft Azure for Real-Time Workloads (Washington) - Wednesday, August 17
http://www.meetup.com/DC-Cloudera-User-Group/events/232929253/

New York

Apache NiFi - MiNiFi: Taking Dataflow Management to the Edge (New York) - Thursday, August 18
http://www.meetup.com/futureofdata-newyork/events/231042461/

Massachusetts

Apache NiFi - MiNiFi: Taking Dataflow Management to the Edge (Boston) - Wednesday, August 17
http://www.meetup.com/futureofdata-boston/events/231196482/

CANADA

A Voldemort Introduction by a LinkedIn Engineer (Montreal) - Tuesday, August 16
http://www.meetup.com/Big-Data-Montreal/events/233197738/

MEXICO

Big Data Science Applications (Mexico City) - Wednesday, August 17
http://www.meetup.com/Spark-Ciencia-de-Datos-BigData-Analytics-y-Matematicas/events/232977069/

NORWAY

Big Data, No Fluff: Let’s Get Started with Hadoop #8 (Oslo) - Thursday, August 18
http://www.meetup.com/Oslo-Hadoop-Big-Data-Meetup/events/231886288/

UNITED ARAB EMIRATES

Big Data Infrastructure (Dubai) - Tuesday, August 16
http://www.meetup.com/UAE-Big-Data-Group/events/233010786/

INDIA

Spark & Solr Integration (Bangalore) - Saturday, August 20
http://www.meetup.com/Bangalore-Big-Data-Search/events/231670022/