Data Eng Weekly

Hadoop Weekly Issue #7

03 March 2013

With ApacheCon and Strata this week, there were lots of product announcements. I was on vacation the first part of the week, so I might have missed some... but hopefully not! Lots of interesting articles to share this week.


Apache Pig 0.11 added a ton of new and exciting features and improvements. I mentioned some of them in last week's newsletter, including DateType datatype, RANK, CUBE and ROLLUP operators. This article goes into much more detail about the new features.

MapR has claimed the new minute sort record by sorting 1.5TB in 59 seconds using 3,003 instances in Google Compute Engine. They did a number of interesting tweaks and code changes to get the running time down to 59 seconds from ~70 seconds. They claim that the tweaks are specific to MapR's hadoop implementation, but the details (e.g. multithreaded reduce-side sorts) are broadly interesting.

RedShift is Amazon's MPP database-as-a-service aimed at analytics workflows. The folks at AirBnB evaluated RedShift vs. Hive on EMR. They provide some numbers, which are impressive, and also provide some advice on loading data into RedShift.

With all the Hadoop distributions out there (and a few more announced this week!), it can be difficult to keep track of them all or decide which one to use for a new project. Monash Research has a overview and a list of the strengths of each distribution.

Apache Hadoop is typically used for batch processing, which can leave gaps if your products require a tight feedback loop or computation for a cold-start. Yahoo! has started using Storm, and they are planning on integrating Storm and Hadoop, including adding the ability to run Storm and MapReduce on the same cluster via YARN.

Steve Watt presented on profiling and tuning Hadoop. He talks about some strategies for collecting performance data like Linux SAR, and then using that information to decide on what your baseline hardware should look like.

Apache Cassandra is a distributed key-value store. The latest version has a bunch of new features, including a beta version of a netty-based transport protocol. Aaron Morton presented on the Cassandra internals at ApacheCon, and the slides cover a lot of interesting information.

Pig and Hive are sometimes considered two sides of the same coin, but switching between SQL and PigLatin syntax isn't automatic. This post includes an overview of the similarities and differences between PigLatin and SQL, and it has a few example queries written in both SQL and PigLatin.

The Apache Hadoop 2.0.3 includes an interesting new feature -- pluggable map and reduce-side sort implementations. In some cases, a full-sort of data isn't required on the reduce side -- e.g. a hash-based grouping could be enough. The folks at Syncsort are bullish on improved performance for certain types of applications. Their blog post details use-cases in which they anticipate performance improvements.


R integration with Hadoop has made a lot of progress in the past 18 months. Revolution Analytics, a company focusing on commercial needs of R, offers a enterprise version of R. They recently announced support for the Hortonworks Data Platform and Intel's Distribution for Hadoop (in addition to support for Cloudera's Distribution including Apache Hadoop and IBM BigInsights 2).

Intel has announced its own distribution of Hadoop and "Project Rhino". Their distribution includes all the standard components of the hadoop stack, but they've patched Hive, YARN, HDFS, and HBase to include optimizations for Xeon processors. Project Rhino is a codename for some changes to Hadoop to include public-key encryption/decryption in MapReduce and HBase. Intel has published design docs and seem to have a number of developers working on it.

Greenplum has release HAWQ, which is a HDFS-based version of Pivotal HD. Pivotal HD is Greenplum's MPP database that typically runs on Isilon OneFS Storage. The new release provides the ability to read and store data in HDFS, including accessing data that's already stored as Avro, Sequence Files, and more. It looks like other similar systems, HAWQ uses a daemon on each node in the cluster to perform queries.

Continuuity has announced a public beta of it's Developer Suite and Developer Sandbox. Continuuity's AppFabric is a PaaS system that sits atop of Apache Hadoop. Their suite includes an eclipse plugin, command line tools, and a REST API. The public beta gives developers access to 8 cores, 8GB of RAM, and 240GB of storage.

LinkedIn has open-sourced DataBus, its system for capturing "data change events" from OLTP databases such as MySQL or Oracle for replay on search indexes and read replicas. It has some interesting features such as low-latency, support for thousands of consumers, and "infinite lookback" (i.e. the ability to bootstrap a new consumer from scratch).


Hortonworks and Microsoft partnered a year ago to work on Apache Hadoop for Microsoft Windows. They announced beta support in Hortonworks Data Platform 1.1. This version supports Windows Server 2008 and 2012.

Spring for Hadoop 1.0 has been released. The release has a number of utilities and base classes for running and building Hadoop workflows (MR, Hive, Pig, etc), including integration with Spring Batch. Since EMC owns both Greenplum and Springsource, the release has also been tested against the new Greenplum Hadoop release.


South West Big Data "Hadoop Operations at LinkedIn" Friday, March 15, 2013 Bristol, UK