Data Eng Weekly

Hadoop Weekly Issue #213

23 April 2017

Some great technical posts this week on Kafka, StreamSets, Flink, and more. And with all the posts on stream processing, there's a well-timed "Making Sense of Stream Processing" post that offers a taxonomy for streaming projects.


The Confluent blog has an article that describes three common issues that tend to occur when running an Apache Kafka cluster in production. These are told in a narrative story, and each has its own moral.

Red Pill Analytics has a tutorial that walks through the steps necessary to install and configure StreamSets, with the end goal of copying a CSV file to Amazon S3.

The Rittman Mead blog has an overview of the architectures behind Apache Impala (incubating) and Apache Drill. The post describes the key components (such as impalad and the Drillbit daemon), gives a high-level overview of the query processing mechanism, and provides a comparison between the two engines.

In the data infrastructure ecosystem, the term "streaming" has many different meanings. This post includes a proposal for classifying projects into processing frameworks (like Storm and Spark Streaming), stream processing APIs (such as Apache Beam and Kafka Streams), and streaming data systems (like Flume, NiFi, and StreamSets).

This post describes an effort to introduce modern data infrastructure tools and setup into a big data setup. The system was composed of a custom ETL tool (with a common framework and set of tools to do things like run a Hive query), standardized "edge nodes" for running etl jobs, and JupyterHub for ad hoc analysis.

The Hortonworks blog has an overview of the features that have been added to the Spark HBase Connector over the last year. These include support for Apache Phoenix, connection caching, support for duplicate column families, and a filter optimization by implementing the UnhandledFilters API.

The Cloudera blog has a post with instructions for configuring and running deep learning frameworks (cuDNN, TensforFlowOnSpark, CaffeOnSpark, DL4J and BigDL) with CDH and the Cloudera Data Science Workbench.

The AWS Big Data blog has a tutorial, complete with code on GitHub and cloud formation templates, for running Apache Flink on Amazon EMR to process data from data in Amazon Kineseis and writing outputs to Amazon Elasticsearch. The post includes a number of code snippets to show relevant configuration and key pieces of the Flink application.


Cloudera is getting closer to IPO, and the expected price range and number of shares has been set. The offering will raise on the order of $200 million.

The Data Intelligence Conference takes place June 23-25 in McLean, VA. The call for papers is open through May 1st.


Cloudera Enterprise 5.11 is released. The highlights include a new S3Guard for consistent reads for clients using Amazon S3, support for the Azure Data Lake Store, support for S3 encryption at rest, Spark lineage, and speed improvements to Hive-on-S3. More details of the release are available on the Cloudera blog.

StreamSets Data Collector 2.5 was released with new support for MQTT and Websockets, improved throughput for JDBC imports, improved Spark integration, and more.

Apache Kudu 1.3.1 was released to fix some critical bugs in the 1.3.0 release.


Curated by Datadog ( )



HBase Meetup @ Visa (Palo Alto) - Tuesday, April 25

Big Data Meetup @ LinkedIn (Sunnyvale) - Wednesday, April 26

Bay Area Apache Spark Meetup @ Grammarly and Adobe (San Francisco) - Thursday, April 27


V Is for Veracity: Securing and Governing Hadoop (Madison) - Tuesday, April 25


Integrating Real-Time Video Data Streams with Spark and Kafka (Ann Arbor) - Thursday, April 27


Distributed Graph Processing Using Spark and GraphX (Jacksonville) - Tuesday, April 25


Getting Started With Structured Streaming (Roswell) - Tuesday, April 25

PCI Compliance with Hadoop (Dunwoody) - Wednesday, April 26

North Carolina

Kafka Architecture and Design (Raleigh) - Wednesday, April 26


Kafka Connect and Kafka Streams with Jay Kreps (Tysons) - Tuesday, April 25


Introduction to Kudu (Philadelphia) - Wednesday, April 26


Apache Spark (Mexico City) - Friday, April 28


Big Data Meetup #2 (Buenos Aires) - Wednesday, April 26

IRELAND The Data Analytics Platform at Indeed (Dublin) - Wednesday, April 26


April Hadoop Users Group Meetup (London) - Tuesday, April 25


Big Data, No Fluff: Let’s Get Started With Hadoop #11 (Oslo) - Thursday, April 27


Stratio: Spark Early Adopters (Barcelona) - Thursday, April 27


Apache Kafka (Nuremberg) - Thursday, April 27

Flink 1.3, Queryable State & Spark Streaming (Unterfohring) - Thursday, April 27


Extreme Apache Spark: How to Build a Pipeline for Processing 2.5B Rows/Day in 3m (Krakow) - Tuesday, April 25


Azure Taiwan Meetup #6 ft. Hadoop (Taipei) - Wednesday, April 26

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit