Data Eng Weekly

Hadoop Weekly Issue #155

31 January 2016

Stream processing remains a hot topic this week with a proof of concept message queue built on Kudu, an update on Samza at LinkedIn, a post about delivery semantics from Spark streaming to Kafka, and more. There are also a few posts this week about machine learning with Spark (including Google's TensorFlow and H2O). And last but not least, this week marks Hadoop's 10th birthday, and there are a couple of articles to mark the occasion.


Apache Kudu (incubating) is a new distributed storage engine with similarities to HBase and Cassandra. This post demonstrates a proof of concept Kafka-like queue system built using Kudu. Kudu has a few architectural differences with Cassandra/HBase that make it a better fit for this use-case. The post contains a few performance numbers and discusses next steps (the POC requires a patch to Kudu, and there are some other potential gotchas).

Google recently open sourced TensorFlow, its framework for machine learning and data flow graphs, to much fanfare. The Databricks blog has a an example of using TensorFlow with Spark via TensorFlow's Python bindings. Parallelizing the computation allows speedup of computation and lowering the error rates.

The LinkedIn engineering blog has a post about Samza at LinkedIn. There are some Kafka numbers (1.3 trillion events per day), is a description of how they use Databus (their DB change capture system) with Kafka and Samza, and is an overview of several use-cases for Samza at LinkedIn. With these use-cases in mind, the post walks through the features that enable large-scale stream processing with Samza (e.g. local state with RocksDB). Finally, the post describes several new features from the Samza 0.10 release—host affinity, the broadcast stream, the coordinator stream, RocksDB TTL, and mroe.

The Altiscale blog has an in-depth post on dynamic partitioning in Hive. Dynamic partitioning (i.e. loading data into partitions via a Hive query/job) is useful when loading existing unpartitioned data, removing/adding partition columns, and when the values of partition values are unknown. The post describes these use-cases, some relevant settings that might need to be tweaked, and provides a full example of dynamically partitioning data as it's loaded into a new Hive table.

The Cloudera blog has a post demoing Spark MLlib and H2O for training a linear model in SparkR and Spark python. The code is pretty straightforward, and there's a discussion of each of the four examples.

The MapR blog has an in-depth look at large scale stream processing. In addition to looking at stream processing platforms, it discusses processing goals (e.g. volume and timeliness), delivery semantics, and some areas we're likely to see progress soon (e.g. disaster recovery, security, administration).

The AWS Big Data Blog has an example of using the Campanile framework, which combines boto and MapReduce streaming to orchestrate large data copies in S3. The post describes how the framework, whose code is available on github, combines MapReduce and Hive to automate the operations.

With the Super Bowl next weekend, MapR has a post showing how to use Spark to predict the over/under for the big game based on historical data. Using a handful of features (such as time of day, temperature, roof type), the post uses Spark's k-nearest neighbors implementation to find the most similar games. Based on the similarity data, they can predict how many points will be scored in next week's game.

Mosts posts about Spark Streaming and Kafka focus on pulling data from Kafka. This post looks at the reverse process—sending data to Kafka from Spark with new Producer API. The post gets into several concrete details, such as the important settings for reliable message processing (e.g. ack=all, min.insync.replicas=2). The code, including an implicit class that adds a sendToKafka method to a Spark RRD, is available on github.


RDBMS-on-Hadoop company, Splice Machine, has announced that they've raised an additional $9 million in financing.

The DMBS2 has a post about what the recently released Cloudera Director 2 means for Cloudera and the cloud. It notes a number of technical challenges (mostly related to an object store like Amazon S3), describes what Cloudera sees as its competitive advantage, and discusses the chief competitors in each of the main cloud environments (Amazon, Azure).

Hortonworks is opening a new European office in Cork, Ireland. They plan to expand to 50 people across technical, sales, and administrative roles.

Two additional posts on the DBMS2 blog cover Kafka and Confluent. The first post has a quick introduction to Kafka and its semantics/guarantees. There are some details that can help to level-set expectations when starting up with Kafka—typical throughput, message sizes, and message formats. There's also a discussion of the origins of the name "Kafka," and in the second post some details on Confluent's value atop of open source Kafka.

HBaseCon 2016 has been announced. It will take place in San Francisco on May 24, 2016. The call for papers is open until February 28, and early bird registration is now available. A post on the Cloudera blog links to presentations and photos from previous years.

This week marks the 10th birthday of Hadoop. The Cloudera blog has a post by Hadoop co-creator Doug Cutting about the past and future of Hadoop. And Datanami has an article that includes interviews with a number of Hadoop developers about the early days of Hadoop at Yahoo.


Apache Hadoop 2.7.2 was released this week. It includes a number of bug fixes across HDFS, YARN, and Hadoop.

CVE-2015-7521, an authorization bug in Apache Hive was announced this week. The Hive team has released a supplemental jar for Hive 1.0, 1.1, and 1.2 that can be deployed to mitigate the vulnerability.

Version 0.5.6-incubating of Apache Zeppelin, the web-based notebook for data analytics, was released this week. The release contains new backend support for Spark versions through 1.6.0, Elasticsearch, Hive, and Scalding. In addition, there are fixes and improvements to pyspark, YARN, Cassandra, and MapR support. There is also new support for importing and exporting notebooks, search, and storing notebooks with Git.


Curated by Datadog ( )



Simplifying Hadoop with RecordService (San Francisco) - Tuesday, February 2

The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Co-founder Hortonworks (Los Angeles) - Wednesday, February 3

Panel Discussion: Upcoming First Release of the ODPi! (Palo Alto) - Thursday, February 4

H2O Rains with Databricks Cloud (San Francisco) - Thursday, February 4


Bigger, Faster, Easier: Using Hadoop at Scale (Bellevue) - Thursday, February 4


Protecting Sensitive Data in Hadoop (Tempe) - Wednesday, February 3


Different Data Sources and NoSQL, Featuring Jim Bates of MapR (oklahoma city) - Thursday, February 4


Spark as Part of a Hybrid RDBMS Architecture (Saint Louis) - Wednesday, February 3


Join Doug Cutting, the Creator of Hadoop, for a Look into "The Future of Data" (Atlanta) - Monday, February 1


Integration of Apache Flink and Apache NiFi (Vienna) - Thursday, February 4


Learning Apache Spark: Practically Speaking (Philadelphia) - Thursday, February 4

New York

Apache Flink: What, How, Why, Who, Where? (New York) - Tuesday, February 2


Apache NiFi Introduction (Toronto) - Wednesday, February 3


Real-Time Analytics with Spark and Cassandra & Running Cassandra on Amazon's ECS (Manchester) - Thursday, February 4


Spark (Madrid) - Thursday, February 4


Couchbase & Spark (Munich) - Thursday, February 4


Dive into Hadoop (HDInsight): Common Big Data Analysis Scenarios on Microsoft Azure (Krakow) - Wednesday, February 3


SQL on Hadoop at Scale (Rannana) - Tuesday, February 2


Streaming in Big Data World & Bufferserver (Pune) - Wednesday, February 3

Big Data Processing with Apache Spark (Hyderabad) - Saturday, February 6

Interactive Analytics Using Apache Spark (Bangalore) - Saturday, February 6


Azure Data Lake: Analytics + Storage (Melbourne) - Tuesday, February 2