Data Eng Weekly


Hadoop Weekly Issue #175

19 June 2016

Hadoop Summit is a little over a week away, and we're already seeing a number of product announcements timed to coincide with the event. On the technical side, there are great posts on Hadoop Kerberos authentication and Avro at Salsify. And in terms of releases, there were several announcements including a new open-source columnar database from Yandex.

Technical

The OpenCore blog has an article that demonstrates a number of debugging tools for Hadoop's Kerberos authentication. Specifically, it shows how to use the main() method of UserGropuInformation to dump a bunch of useful debug information.

http://www.opencore.com/blog/2016/5/user-name-handling-in-hadoop/

In part four in a series on YARN, the Cloduera blog looks at how to configure a fair scheduler queue. Specifically, the post describes settings for resource constraints, queue placement policies, and preemption.

http://blog.cloudera.com/blog/2016/06/untangling-apache-hadoop-yarn-part-4-fair-scheduler-queue-basics/

Salsify is building an asynchronous micro-services architecture built on Apache Kafka with Apache Avro for data serialization. Their application is built in Ruby, and they've created several new tools to make Avro easier to work with in that language. This post describes the tools and their value: avro-builder for defining records, a postgres-based schema registry, and avromatic for generating models from avro schemas.

http://blog.salsify.com/engineering/adventures-in-avro

Apache Drill can infer schemas on the fly and also supports data with multiple (but compatible) scheams. This combination enables some interesting use cases, such as querying across multiple json files with differing schemas. A post on the MapR blog explores these feature and includes several examples.

https://www.mapr.com/blog/sql-query-mixed-schema-data-using-apache-drill

This tutorial shows how to use Apache Kafka with Druid to build a streaming analytics and visualization (using Pivot, a web UI for Druid) application.

http://www.confluent.io/blog/building-a-streaming-analytics-stack-with-apache-kafka-and-druid

The Apache Beam (incubating) blog has a post describing some of the work that was done to make the Beam connector for Apache Flink work with Flink's batch runner. Beam is the open-source SDK, originally from Google, that exposes a backend-agnostic data pipeline API.

http://beam.incubator.apache.org/blog/2016/06/13/flink-batch-runner-milestone.html

Cask Hydrator is a tool for building data pipelines using a drag and drop UI. This tutorial shows how to use Hydrator to export data from MySQL into HDFS.

http://blog.cask.co/2016/06/bringing-relational-data-into-data-lakes/

Databricks has a post on the new SQL subquery support in the upcoming Apache Spark 2.0 release. Interestingly, the post is written as a notebook, which is a straightforward way to present code and example data.

https://databricks.com/blog/2016/06/17/sql-subqueries-in-apache-spark-2-0.html

The Apache Kudu (incubating) blog has a post on their use of Raft in single node clusters, which allows for dynamic scaling to multi-master clsuters.

http://getkudu.io/2016/06/17/raft-consensus-single-node.html

News

This article points out that the Apache Spark community, if not careful, could suffer the same type of fragmentation that has caused confusion in the Apache Hadoop ecosystem. For example, the latest versions of CDH and HDP support different versions of Spark.

https://techcrunch.com/2016/06/12/spark-fragmentation-undermines-community/

The New Stack has an article on Concord, which is a new stream processing framework (in public beta) built on Apache Mesos. Concord is written in C++ and supports dynamic topologies (scaling up/down parts of the pipeline without downtime).

http://thenewstack.io/concord-leverages-mesos-high-performance-stream-processing/

On the heels of announcing general availability of Databricks Community Edition, Databricks has announced the first in a series of tutorials on writing Apache Spark applications using Databricks.

https://databricks.com/blog/2016/06/15/an-introduction-to-writing-apache-spark-applications-on-databricks.html

Hadoop Summit San Jose, which takes place in a few weeks, will feature a Women in Big Data Lunch and Panel. The Hortonworks blog features an interview with the panel moderator and Hortonworks CMO Ingrid Burton.

http://hortonworks.com/blog/summer-hortonworks-part-2-wibd-assertive-innovative-take-risks/

Releases

Apache SystemML (incubating) recently released version 0.10.0. SystemML is a machine learning framework built with multiple backend support, including Apache Spark and Apache Hadoop. This release includes new Spark Matrix Block types, support for deep learning, several performance enhancements, a new KNN algorithm, and much more.

http://systemml.apache.org/0.10.0-incubating/release_notes.html

Apache Mahout, another machine learning framework, has released version 0.12.2. This release makes some improvements towards the goal of integrating with Apache Zeppelin for visualization and notebook support.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201606.mbox/%3CCAOtpBjgBAuQs5FiX5X_5A+Rd-A1fVz0R7SKttGe4cJuCLRiGww@mail.gmail.com%3E

Qubole has announced that their HBase-as-a-Service offering is now generally available on AWS. The offering has a number of nice features for a long-running cluster, supports Hannibal and other monitoring tools, integrates with Apache Zeppelin, and can be configured with OpenTSDB and Apache Phoenix through node bootstrap actions.

https://www.qubole.com/blog/product/quboles-hbase-as-a-service-is-generally-available-on-aws/

Altiscale has announced Altiscale Insight Cloud Real-Time Edition. The system is backed by Apache HBase and Spark Streaming.

https://www.altiscale.com/blog/announcing-the-altiscale-insight-cloud-real-time-edition/

hs2client is a new C++ library for Apache Hive and Apache Impala (incuting). In addition to C++, the library has python bindings with support reads to DataFrames in pandas.

http://blog.cloudera.com/blog/2016/06/announcing-hs2client-a-fast-new-c-python-thrift-client-for-impala-and-hive/

MapR has announced an Apache Spark 2.0 Developer Preview for their distribution.

https://www.mapr.com/blog/spark-20-now-developer-preview-mode-mapr-platform

Apache Beam has announced the 0.1.0-incubating release, the projects first since joining the Apache incubator.

http://beam.incubator.apache.org/beam/release/2016/06/15/first-release.html

Yandex has open sourced ClickHouse, a columnar analytics database. The system is built for both horizontal and vertical scaling. It supports complex data types (e.g. arrays) and can do approximate queries. The team has also published benchmark results in comparison to several other databases.

https://clickhouse.yandex/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Apache Bigtop & Apache Apex (San Jose) - Tuesday, June 21
http://www.meetup.com/Big-Data-native-Hadoop-Ingest-and-Transform-Bay-Area/events/231799257/

Options for Ingest: Elasticsearch Ingest Node and Apache Airflow (Mountain View) - Wednesday, June 22
http://www.meetup.com/SF-Bay-Area-Data-Ingest-Meetup/events/231024947/

Open House: Big Data Processing with Apache Spark (San Francisco) - Thursday, June 23
http://www.meetup.com/Metis-San-Francisco-Data-Science/events/231743873/

Washington

Ted Dunning on Anomaly Detection (Redmond) - Thursday, June 23
http://www.meetup.com/Seattle-DAML/events/231426676/

Colorado

Big Data at Twitter Scale (Boulder) - Thursday, June 23
http://www.meetup.com/Boulder-Denver-Big-Data/events/231339134/

Texas

June Tech Talk: PySpark (Austin) - Monday, June 20
http://www.meetup.com/PyLadies-ATX/events/231494470/

Using Splunk for Your Big Data Along with Hadoop (Carrollton) - Tuesday, June 21
http://www.meetup.com/DFW-BigData/events/231582306/

Invited Speaker Series: Cody Koeninger on Fundamentals of Spark and Kafka (Austin) - Thursday, June 23
http://www.meetup.com/Austin-ACM-SIGKDD/events/231377005/

Illinois

Top 5 Mistakes When Writing Spark Applications (Chicago) - Monday, June 20
http://www.meetup.com/Chicago-Spark-Users/events/231373252/

North Carolina

Apache Metron Overview and Codelab: Building the next Generation Cyber Security (Durham) - Thursday, June 23
http://www.meetup.com/futureofdata-triangle/events/231440309/

Georgia

High Concurrency, Low Latency Reporting Within Hadoop (Roswell) - Wednesday, June 22
http://www.meetup.com/Atlanta-Hadoop-Users-Group/events/230344816/

Virginia

Introduction to Spark In-Memory Computing (McLean) - Wednesday, June 22
http://www.meetup.com/Big-Data-Developers-in-DC/events/231023283/

Overview of Open Source Fast Data Platforms and Future Plans (Reston) - Wednesday, June 22
http://www.meetup.com/Fast-Data-DC/events/230711519/

Interactive Data Analytics with Flink and Zeppelin (Reston) - Thursday, June 23
http://www.meetup.com/DCFlinkMeetup/events/231718607/

New York

Distributed ML in Spark (New York) - Friday, June 24
http://www.meetup.com/Spark-NYC/events/231796695/

Massachusetts

Solving Analytics Problems in the Cloud w/ Spark, Presto, Hive (Boston) - Tuesday, June 21
http://www.meetup.com/bostonhadoop/events/231534340/

UNITED KINGDOM

Big Data Show and Tell (Reading) - Wednesday, June 22
http://www.meetup.com/Big-Data-Thames-Valley-Meetup/events/227108997/

GERMANY

How Spark Can Improve Your Hadoop Cluster (Hamburg) - Wednesday, June 22
http://www.meetup.com/Scala-Hamburg/events/231414246/

CZECH REPUBLIC

Apache Flink Crash Course: Meet the Squirrel (Prague) - Thursday, June 23
http://www.meetup.com/CS-HUG/events/231711892/

INDIA

Third Spark Meetup (Pune) - Wednesday, June 22
http://www.meetup.com/Pune-Apache-Spark-Meetup/events/231618172/

Understanding and Building Big Data Architectures, Part 3: Messaging/Kafka (Hyderabad) - Saturday, June 25
http://www.meetup.com/hyderabad-scalability/events/229886391/