Hadoop Weekly Issue #213

23 April 2017

Some great technical posts this week on Kafka, StreamSets, Flink, and more. And with all the posts on stream processing, there's a well-timed "Making Sense of Stream Processing" post that offers a taxonomy for streaming projects.

Technical

The Confluent blog has an article that describes three common issues that tend to occur when running an Apache Kafka cluster in production. These are told in a narrative story, and each has its own moral.

https://www.confluent.io/blog/stories-front-lessons-learned-supporting-apache-kafka/

Red Pill Analytics has a tutorial that walks through the steps necessary to install and configure StreamSets, with the end goal of copying a CSV file to Amazon S3.

http://redpillanalytics.com/streammeup/

The Rittman Mead blog has an overview of the architectures behind Apache Impala (incubating) and Apache Drill. The post describes the key components (such as impalad and the Drillbit daemon), gives a high-level overview of the query processing mechanism, and provides a comparison between the two engines.

https://www.rittmanmead.com/blog/2017/04/sql-on-hadoop-impala-vs-drill/

In the data infrastructure ecosystem, the term "streaming" has many different meanings. This post includes a proposal for classifying projects into processing frameworks (like Storm and Spark Streaming), stream processing APIs (such as Apache Beam and Kafka Streams), and streaming data systems (like Flume, NiFi, and StreamSets).

https://streamsets.com/blog/making-sense-stream-processing/

This post describes an effort to introduce modern data infrastructure tools and setup into a big data setup. The system was composed of a custom ETL tool (with a common framework and set of tools to do things like run a Hive query), standardized "edge nodes" for running etl jobs, and JupyterHub for ad hoc analysis.

https://www.svds.com/data-pipelines-hadoop/

The Hortonworks blog has an overview of the features that have been added to the Spark HBase Connector over the last year. These include support for Apache Phoenix, connection caching, support for duplicate column families, and a filter optimization by implementing the UnhandledFilters API.

https://hortonworks.com/blog/spark-hbase-connector-a-year-in-review/

The Cloudera blog has a post with instructions for configuring and running deep learning frameworks (cuDNN, TensforFlowOnSpark, CaffeOnSpark, DL4J and BigDL) with CDH and the Cloudera Data Science Workbench.

http://blog.cloudera.com/blog/2017/04/deep-learning-frameworks-on-cdh-and-cloudera-data-science-workbench/

The AWS Big Data blog has a tutorial, complete with code on GitHub and cloud formation templates, for running Apache Flink on Amazon EMR to process data from data in Amazon Kineseis and writing outputs to Amazon Elasticsearch. The post includes a number of code snippets to show relevant configuration and key pieces of the Flink application.

https://aws.amazon.com/blogs/big-data/build-a-real-time-stream-processing-pipeline-with-apache-flink-on-aws/

News

Cloudera is getting closer to IPO, and the expected price range and number of shares has been set. The offering will raise on the order of $200 million.

http://www.cbronline.com/news/big-data/cloudera-sets-ipo-price-range/

The Data Intelligence Conference takes place June 23-25 in McLean, VA. The call for papers is open through May 1st.

http://data-intelligence.ai/

Releases

Cloudera Enterprise 5.11 is released. The highlights include a new S3Guard for consistent reads for clients using Amazon S3, support for the Azure Data Lake Store, support for S3 encryption at rest, Spark lineage, and speed improvements to Hive-on-S3. More details of the release are available on the Cloudera blog.

http://blog.cloudera.com/blog/2017/04/cloudera-enterprise-5-11-is-now-available/

StreamSets Data Collector 2.5 was released with new support for MQTT and Websockets, improved throughput for JDBC imports, improved Spark integration, and more.

https://streamsets.com/blog/sdc-2-5

Apache Kudu 1.3.1 was released to fix some critical bugs in the 1.3.0 release.

http://kudu.apache.org/releases/1.3.1/docs/release_notes.html

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

HBase Meetup @ Visa (Palo Alto) - Tuesday, April 25
https://www.meetup.com/hbaseusergroup/events/239291716/

Big Data Meetup @ LinkedIn (Sunnyvale) - Wednesday, April 26
https://www.meetup.com/Big-Data-Meetup-LinkedIn/events/239052766/

Bay Area Apache Spark Meetup @ Grammarly and Adobe (San Francisco) - Thursday, April 27
https://www.meetup.com/spark-users/events/238830244/

Wisconsin

V Is for Veracity: Securing and Governing Hadoop (Madison) - Tuesday, April 25
https://www.meetup.com/BigDataMadison/events/234268453/

Michigan

Integrating Real-Time Video Data Streams with Spark and Kafka (Ann Arbor) - Thursday, April 27
https://www.meetup.com/AABigDataMeetup/events/238926522/

Florida

Distributed Graph Processing Using Spark and GraphX (Jacksonville) - Tuesday, April 25
https://www.meetup.com/jaxbigdata/events/238798341/

Georgia

Getting Started With Structured Streaming (Roswell) - Tuesday, April 25
https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/238680046/

PCI Compliance with Hadoop (Dunwoody) - Wednesday, April 26
https://www.meetup.com/Atlanta-Hadoop-Users-Group/events/238318045/

North Carolina

Kafka Architecture and Design (Raleigh) - Wednesday, April 26
https://www.meetup.com/futureofdata-triangle/events/239232163/

Virginia

Kafka Connect and Kafka Streams with Jay Kreps (Tysons) - Tuesday, April 25
https://www.meetup.com/Apache-Kafka-DC/events/238571005/

IRELAND The Data Analytics Platform at Indeed (Dublin) - Wednesday, April 26
https://www.meetup.com/hadoop-user-group-ireland/events/238467904/

UNITED KINGDOM

April Hadoop Users Group Meetup (London) - Tuesday, April 25
https://www.meetup.com/hadoop-users-group-uk/events/238368183/

NORWAY

Big Data, No Fluff: Let’s Get Started With Hadoop #11 (Oslo) - Thursday, April 27
https://www.meetup.com/Oslo-Hadoop-Big-Data-Meetup/events/238931071/

SPAIN

Stratio: Spark Early Adopters (Barcelona) - Thursday, April 27
https://www.meetup.com/Spark-Barcelona/events/239303416/

GERMANY

Apache Kafka (Nuremberg) - Thursday, April 27
https://www.meetup.com/Nuernberg-Big-Data/events/238411761/

Flink 1.3, Queryable State & Spark Streaming (Unterfohring) - Thursday, April 27
https://www.meetup.com/Apache-Flink-Meetup-Munich/events/237883833/

POLAND

Extreme Apache Spark: How to Build a Pipeline for Processing 2.5B Rows/Day in 3m (Krakow) - Tuesday, April 25
https://www.meetup.com/datakrk/events/238923425/

TAIWAN

Azure Taiwan Meetup #6 ft. Hadoop (Taipei) - Wednesday, April 26
https://www.meetup.com/Azure-Taiwan/events/238613102/

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit https://hadoopweekly.com

Data Eng Weekly