Data Eng Weekly


Hadoop Weekly Issue #165

10 April 2016

This week, there were a number of big releases including new open source projects from LinkedIn and Airbnb. There's quite a bit of technical content covering stream processing—Spark, Flink, Kafka, and more. In news, the conference program for both Spark Summit and HBaseCon have been released.

Technical

Zalando has published a post about how they choose Apache Flink as their stream processing framework. The post talks about the evaluation criteria for and proof of concepts built towards the decision, and it describes the major reasons—consistently low latencies at high throughputs, true stream processing, and developer support.

https://tech.zalando.com/blog/apache-showdown-flink-vs.-spark/

The Cloudera blog has a post from developers of Wargaming.net, where they describe their real-time infrastructure built on Kafka, HBase, Drools, and Spark. In addition to describing the flow of data, they describe how they optimized HBase lookups and serialization, data locality between HBase and Spark, and Spark computation.

http://blog.cloudera.com/blog/2016/04/inside-wargamings-data-driven-real-time-rules-engine/

InfoQ has a presentation and video about streaming at scale with the SMACK (Spark, Mesos, Akka, Cassandra, and Kafka) stack. Among the topics discussed, the presentation describes why a stack like this solves the same problems as the Lambda Architecture much more simply.

http://www.infoq.com/presentations/stream-analytics-scalability

The Confluent "Log Compaction" blog series has an update on what's happened with the Kafka project in March. There are a number of interesting developments, including progress on rack awareness, Kerberos support, and time-based indexes in Kafka. Lots of great content if you (like me) don't have time to keep up with the latest development efforts.

http://www.confluent.io/blog/log-compaction-highlights-in-the-kafka-and-stream-processing-community-april-2016

Apache Flink 1.0 introduced a new complex event processing (CEP) library. For those who aren't familiar, CEP offers a way to (among other things) detect patterns of events. This post introduces Flink's CEP Pattern APIs though a potential use-case of anomaly detection based on sensor readings from servers in a data center.

http://flink.apache.org/news/2016/04/06/cep-monitoring.html

The Genome Analysis Toolkit (GATK) recently announced that its next release (currently in alpha) will support Apache Spark. This post gives a brief introduction to the toolkit and shows how Spark is leveraged to detect duplicate DNA fragments.

http://blog.cloudera.com/blog/2016/04/genome-analysis-toolkit-now-using-apache-spark-for-data-processing/

InfoWorld has an overview of the plans for structured streaming, which is part of Spark 2.0. While microbatch will still be around, there are useful new primitives like infinite data frames and first-class support for repeated queries.

http://www.infoworld.com/article/3052924/analytics/what-sparks-structured-streaming-really-means.html

The AWS big data blog has a post on loading data into S3 and Redshift using encryption keys stored in the AWS Key Management Service (KMS). In addition to the required steps, the post describes the how encryption with KMS keys works for data in AWS S3.

http://blogs.aws.amazon.com/bigdata/post/Tx2Q3ZBOZO9DHVQ/Encrypt-Your-Amazon-Redshift-Loads-with-Amazon-S3-and-AWS-KMS

The Confluent blog describes how to use Kafka Connect and Kafka Streams for a non-trivial "hello world" program. Specifically, the example program pulls Wikipedia data from IRC, parses the messages, and computes various statistics. The post has a number of code snippets showing how the entire process is implemented.

http://www.confluent.io/blog/hello-world-kafka-connect-kafka-streams

This post walks through converting some simple schemas from Postgres to Cassandra, and it describes several of the major differences—replication, data types (no JSON support in Cassandra), primary keys, and eventual consistency.

http://neovintage.org/2016/04/07/data-modeling-in-cassandra-from-a-postgres-perspective/

News

The ESG blog has a recap of the recent Strata+Hadoop World conference. It notes some themes of the conference, such as building momentum for Spark, machine learning, and cloud services.

http://blog.esg-global.com/riding-high-at-stratahadoop-world

InformationWeek also has a recap from Strata, focussing on Keynotes from MapR, from Pivotal, on artificial intelligence, and more.

http://www.informationweek.com/big-data/ai-public-data-sets-real-time-strata-+-hadoop-keynote-sampling/d/d-id/1324943?

The agenda for Spark Summit 2016, which will be held from June 6-8 in San Francisco, has been announced. The conference has two days of session spread across five tracks.

https://databricks.com/blog/2016/04/04/agenda-announced-for-sparksummit-2016-in-san-francisco.html

Forbes has an interview with Cloudera CEO Tom Reilly, in which he discusses the companies biggest opportunity, the competitive market, plans to take the company public, and more.

http://www.forbes.com/sites/roberthof/2016/04/06/ceo-tom-reilly-makes-the-case-for-cloudera-and-its-ipo/

Datanami has an article on the rise of Apache Kafka as the backbone for stream processing. It includes an interview with Confluent co-founder and CTO Neha Narkhede in which she discusses the recently launched Kafka Connect and Kafka Streams.

http://www.datanami.com/2016/04/06/real-time-rise-apache-kafka/

HBaseCon takes place in San Francisco on May 24th, and the agenda has just been announced. There are 20+ sessions across three tracks.

http://blog.cloudera.com/blog/2016/04/hbasecon-2016-speaker-lineup-announced/

Releases

Apache HBase 0.98.18 and 1.1.4 were both recently released. The 1.1.4 release has a number of fixes including nine or so correctness fixes. The 0.98.18 release has just shy of 50 resolved issues (bugs, improvements, and two new features).

http://mail-archives.apache.org/mod_mbox/hbase-user/201603.mbox/%3CCANZa%3DGu-mAxKEtfoRjctHcE0KD7z52oE010Fgsf6AMmW2tDZLA%40mail.gmail.com%3E
http://mail-archives.apache.org/mod_mbox/hbase-user/201603.mbox/%3CCA%2BRK%3D_CtZ1L07nS6Og2ekfVwet0qTE7jw-bmyD2pp5UPweUehQ%40mail.gmail.com%3E

Apache Lens, the unified analytics interface, which has support for the Hadoop ecosystem (and many other) execution engines and data stores, released 2.5.0-beta. This release resolves 87 tickets, with a focus on bug fixes and improvements over new features.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201604.mbox/%3CCAL3kmZj60kpopRPpOVEs9o7oTg7YuaC_=c8zncBeMyUESrZsmQ@mail.gmail.com%3E

Airbnb has open-sourced Caravel, their data exploration system. Caravel supports a number of features found in commercial products and can be hooked up to any system that supports an SQL-dialect (via SQLalchemy). Notably, it supports Druid for real-time analytics.

https://medium.com/airbnb-engineering/caravel-airbnb-s-data-exploration-platform-15a72aa610e5

MapR has announced support for Apache Drill 1.6 for their distribution. Highlights of the release include a new storage plugin for MapR-DB, new SQL window function support, and end-to-end security. The introduction has some examples of using the MapR-DB API to load data and then querying it with Drill.

https://www.mapr.com/blog/apache-drill-16-mapr-converged-platform-gearing-new-generation-stack-json-enabled-big-data

Apache Flink has announced a bugfix release for the 1.0.x line. The release resolves 23 issues and is recommended for all users of 1.0.0.

http://flink.apache.org/news/2016/04/06/release-1.0.1.html

Cloudera Enterprise 5.7 was released with updates to Spark, HBase, Impala, Kafka, and more. Highlights include the promotion of Hive-on-Spark and HBase-Spark from Cloudera Labs, major performance improvements for Impala, and support for the HBase WAL on SSD.

http://blog.cloudera.com/blog/2016/04/cloudera-enterprise-5-7-is-released/

Apache Tajo, the data warehouse system built on Hadoop, released version 0.11.2. The new version adds support for Kerberos, fixes ORC table support for Hive, and more.

http://tajo.apache.org/releases/0.11.2/announcement.html

LinkedIn has open-sourced Dr. Elephant, their tool for diagnosing performance issues with Hadoop and Spark jobs. Based on metrics collected from the YARN resource manager on completed jobs, Dr. Elephant evaluates heuristics to generate diagnostic reports for things like data skew, GC overhead, and more. LinkedIn reports that it solves around 80 percent of problems.

https://engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Mohammed Guller: Demystifying Big Data and Apache Spark (Redwood City) - Monday, April 11
http://www.meetup.com/Scala-Bay/events/229524842/

IOT Big Data Ingestion and Processing in Hadoop by Silver Spring Networks (San Jose) - Thursday, April 14
http://www.meetup.com/Apex-Bay-Area-Chapter/events/228787336/

Washington

Seattle Apache Kafka Meetup (Bellevue) - Friday, April 15
http://www.meetup.com/Seattle-Apache-Kafka-Meetup/events/229844894/

Minnesota

Hadoop Operations for Production Systems (Eden Prairie) - Wednesday, April 13
http://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/229794585/

Illinois

Apache Kafka and the Confluent Platform: Overview and Roadmap, with Jay Kreps (Chicago) - Thursday, April 14
http://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/228620608/

Michigan

Managing Automotive Sensor Big Data Using Hadoop (Ann Arbor) - Tuesday, April 12
http://www.meetup.com/Predictive-Analytics-S-E-Michigan/events/229662378/

Pennsylvania

Understanding Spark Streaming (Philadelphia) - Thursday, April 14
http://www.meetup.com/Philadelphia-Spark-Meetup/events/229627071/

New Jersey

Real-Time Aggregation, Approximation, Similarities, and Recommendations at Scale (Princeton) - Thursday, April 14
http://www.meetup.com/nj-datascience/events/229736128/

IRELAND OrientDB: Unlock the Value of Document Data Relationships + Apache Spark & GraphX (Dublin) - Monday, April 11
http://www.meetup.com/hadoop-user-group-ireland/events/229509552/

Hadoop Summit Dublin: Hops.io Distro and ALOJA Big Data Benchmarking (Dublin) - Tuesday, April 12
http://www.meetup.com/BDOOP-BigData-Operations-On-Perfomance-Barcelona/events/229695026/

Data Flow Using Apache NiFi (Dublin) - Tuesday, April 12
http://www.meetup.com/futureofdata-london/events/229827779/

Hands-On Introduction to Spark & Zeppelin (Dublin) - Tuesday, April 12
http://www.meetup.com/futureofdata-dublin/events/229793869/

Hadoop and MongoDB Scaling at Datahug and RAFTlike MongoDB Elections (Dublin) - Tuesday, April 12
http://www.meetup.com/DublinMUG/events/228251875/

Hadoop Summit Night (Dublin) - Tuesday, April 12
http://www.meetup.com/hadoop-user-group-ireland/events/229862323/

UNITED KINGDOM

Real-time Search and Insights with Apache Kafka (London) - Wednesday, April 13
http://www.meetup.com/Apache-Kafka-London/events/229636395/

GERMANY

Spark Kick Off Meetup (Munich) - Thursday, April 14
http://www.meetup.com/Hadoop-User-Group-Munich/events/228725964/

SWITZERLAND

18th Swiss Big Data User Group Meeting (Zurich) - Monday, April 11
http://www.meetup.com/swiss-big-data/events/229098258/

ISRAEL

Data Processing @SCALE (Tel Aviv-Yafo) - Monday, April 11
http://www.meetup.com/Tech-Talk-Teach/events/229659531/

INDIA

Ingesting Unbounded File Data + Streaming Log Analysis Using Apex (Pune) - Wednesday, April 13
http://www.meetup.com/Apache-Apex-incubating-Meetup-Pune/events/230194517/

SINGAPORE

Hadoopy Birthday: Hadoop Turns 10, with Doug Cutting, Father of Hadoop (Singapore) - Monday, April 11
http://www.meetup.com/BigData-Hadoop-SG/events/230062826/

NEW ZEALAND

First Organizational Meeting (Christchurch) - Thursday, April 14
http://www.meetup.com/Christchurch-Apache-Spark-Meetup/events/229207479/