Data Eng Weekly

Hadoop Weekly Issue #116

12 April 2015

This week, there are two articles on integrating Kafka—the first describes how Druid uses Samza and Kafka and the second is a tutorial for building a multi-arm bandit application with Kafka and Hadoop. In news, there are articles on using Hadoop for cybersecurity, how two companies are using Spark streaming, and the history of Hadoop. And in releases, Ambari and Mahout both had big releases with lots of new features this week.


This tutorial describes how to run Hue on a Mac to interface with a Hadoop cluster running inside of a VM. In addition to Hue proper, it walks through setting up all the prerequisite, such as installing maven/mysql and creating a host alias in /etc/hosts.

Metamarkets, makers of the open-source distributed analytics system Druid, have a blog post about their experiences switching Druid to use Apache Samza for stream processing. The post describes how they typically process ad-tracking data (impression and click events) in three stages (which map to three Samza jobs): shuffle, join, output. The post describes some of the advantages of Samza (e.g. operational consistency, local state API), some of the downsides (e.g. it can be slow to replay local state), and some future plans (e.g. network isolation, exactly-once semantics).

The GoDataDriven blog has a look at building a multi-arm bandit application using Kafka, Hadoop, and the open-soruce Divolte Collector. The post walks through building a website using Python and Redis, creating a Kafka consumer in Python to process Avro-encoded data, and writing a model-evaluator using numpy.

Hue 3.8 is switching from Django 1.4.5 to 1.6.0, which will result in some breaking changes for existing apps. The team has highlighted the API changes that app developers need to beware of when upgrading.

The AWS Big Data Blog has a guest post about how Nasdaq uses EMR and S3 for ad hoc analysis. The post describes some of the advantages of EMR (e.g. multiple clusters, easy experimentation) and how Nasdaq has implemented client-side encryption. The overview of encryption describes how they use custom software to integrate with a key management server in their own data center via EMR bootstrap actions.

The Hortonworks blog has two posts this week on rolling upgrades. The first post describes the rolling upgrade process for HDFS, which consists of upgrading the standby NameNode, failing over to the standby, upgrading the other NameNode, upgrading the DataNodes, and finalizing the upgrade. The second post describes rolling upgrades in Apache Ambari 2.0, which fully automates the process.


The Apache Flink community has made a lot of progress over the past month. This article highlights all the changes—from new wiki articles about Flink internals to presentations to changes that have landed in master but aren't yet part of a release (e.g. Flink on Tez, exactly-once for streaming jobs).

Big data startup AtScale has come out of stealth and announced details of their Intelligence Platform for data in Hadoop. The tools sits on a gateway node to a Hadoop cluster and exposes the features and interface of an OLAP cube. This cube is computed dynamically by leveraging existing SQL-on-Hadoop engines.

The online edition of "Advanced Analytics with Spark" is now available. The book covers topics such as the Alternating Least Squares recommendation algorithm, decision trees, and k-means clustering with Spark.

This week, the Cloudera blog has a preview of the operations track at the upcoming HBaseCon. Presentations from this track include speakers from Flurry, Yahoo, Pinterest, Arista Networks, Xiaomi, Adobe, and Rocket Fuel.

Datanami has a post on two systems that are using leveraging Hadoop to build security tools. First, Platfora and MapR have announced a joint solution called Big Data Analytics for Security. The system is aimed at providing the tools to monitor threats and perform security analyses. Second, Niara is building a network security tool (currently in beta). Niara just raised $20 million in Series B funding.

Spark Summit East was held last month in NYC. The Databricks blog has a recap of the keynotes, links to slides from several talks at the event, and the course material for Spark training.

SearchBusinessAnalytics has an article profiling two ad companies, Altitude Digital and Sharethrough, that are moving to Spark and Spark streaming. Altitude Digital is excited about the improvements in speed (since rerunning a failed Hive job can cause huge backlogs) and Sharethrough plans to use Spark streaming to optimize click-through rates on variations of ads in near real-time.

Spark in Action author Marko Bonaci has published a post entitled "The history of Hadoop," which was originally written to be part of his book. The article describes the origins, early years, evolution of Hadoop (notably YARN), and much more. It's a great telling of the Hadoop story.


Apache Ambari 2.0 was released this week. Hortonworks has more details on the release, which includes several new features. The most notable are support for rolling upgrades, integration with Kerberos and Apache Ranger for security, and a new alerts framework.

MapR has announced the availability of version 4.0.2 of their distribution as part of Amazon EMR. This new release includes new versions of YARN and Hive.

Cloudera released version 5.3.3 of Cloudera Enterprise. The update includes bug fixes for HDFS, Hive, Hue, YARN, Cloudera Manager, and more.

Apache Mahout 0.10.0 was just released earlier today. With this new version, Mahout is aiming to be a library for and interactive environment for scalable linear algebra and machine learning algorithms. There's an Apache Spark backend and H20 and Apache Flink support are planned for the future. In addition to the official announcement, there's a good post describing the goals and features of new release and how it relates to Spark's MLlib.


Curated by Datadog ( )



Predictive Analytics for Sales: A Use Case of Scala and Spark (Mountain View) - Monday, April 13

Machine Learning on Streaming Data: H2O Storm (Mountain View) - Tuesday, April 14

Running Production Hadoop Clusters in Docker Containers (San Francisco) - Wednesday, April 15


Sparkly Notebook: Interactive Analysis and Visualization with Spark (Seattle) - Wednesday, April 15


Cassandra and Apache Spark (Scottsdale) - Thursday, April 16


Roman Shaposhnik on the Open Data Platform Alliance (Austin) - Wednesday, April 15


Hadoop Needs Devops! (Columbus) - Tuesday, April 14

Delve Deeper into Spark (Mason) - Wednesday, April 15

New York

Elastic Analytics on Mesos and Docker (New York) - Tuesday, April 14

Big BI: How to Analyze Hadoop Data with Today’s BI Tools (New York) - Thursday, April 16


April 2015 Meetup (London) - Monday, April 13


Putting Apache Spark to life (Espoo) - Thursday, April 16


Pre-Hadoop Summit Meetup (Brussels) - Tuesday, April 14

Data Science and Hadoop (Brussels) - Tuesday, April 14

Pre-Hadoop Summit Gathering (Brussels) - Tuesday, April 14

Birds of Feather Sessions: YARN, HDFS, Tez, Storm, Hive, Ranger (Brussels) - Thursday, April 16


Second Prague Hadoop Meetup (Prague) - Thursday, April 16


Hadoop: Looking to the Future, and YARN: Past, Present and Future (Budapest) - Friday, April 17


Spark & Python Sessions (Ramat-gan) - Monday, April 13


Productionalizing Spark (Bangalore) - Saturday, April 18