19 June 2016
Hadoop Summit is a little over a week away, and we're already seeing a number of product announcements timed to coincide with the event. On the technical side, there are great posts on Hadoop Kerberos authentication and Avro at Salsify. And in terms of releases, there were several announcements including a new open-source columnar database from Yandex.
The OpenCore blog has an article that demonstrates a number of debugging tools for Hadoop's Kerberos authentication. Specifically, it shows how to use the
main() method of UserGropuInformation to dump a bunch of useful debug information.
In part four in a series on YARN, the Cloduera blog looks at how to configure a fair scheduler queue. Specifically, the post describes settings for resource constraints, queue placement policies, and preemption.
Salsify is building an asynchronous micro-services architecture built on Apache Kafka with Apache Avro for data serialization. Their application is built in Ruby, and they've created several new tools to make Avro easier to work with in that language. This post describes the tools and their value: avro-builder for defining records, a postgres-based schema registry, and avromatic for generating models from avro schemas.
Apache Drill can infer schemas on the fly and also supports data with multiple (but compatible) scheams. This combination enables some interesting use cases, such as querying across multiple json files with differing schemas. A post on the MapR blog explores these feature and includes several examples.
This tutorial shows how to use Apache Kafka with Druid to build a streaming analytics and visualization (using Pivot, a web UI for Druid) application.
The Apache Beam (incubating) blog has a post describing some of the work that was done to make the Beam connector for Apache Flink work with Flink's batch runner. Beam is the open-source SDK, originally from Google, that exposes a backend-agnostic data pipeline API.
Cask Hydrator is a tool for building data pipelines using a drag and drop UI. This tutorial shows how to use Hydrator to export data from MySQL into HDFS.
Databricks has a post on the new SQL subquery support in the upcoming Apache Spark 2.0 release. Interestingly, the post is written as a notebook, which is a straightforward way to present code and example data.
The Apache Kudu (incubating) blog has a post on their use of Raft in single node clusters, which allows for dynamic scaling to multi-master clsuters.
This article points out that the Apache Spark community, if not careful, could suffer the same type of fragmentation that has caused confusion in the Apache Hadoop ecosystem. For example, the latest versions of CDH and HDP support different versions of Spark.
The New Stack has an article on Concord, which is a new stream processing framework (in public beta) built on Apache Mesos. Concord is written in C++ and supports dynamic topologies (scaling up/down parts of the pipeline without downtime).
On the heels of announcing general availability of Databricks Community Edition, Databricks has announced the first in a series of tutorials on writing Apache Spark applications using Databricks.
Hadoop Summit San Jose, which takes place in a few weeks, will feature a Women in Big Data Lunch and Panel. The Hortonworks blog features an interview with the panel moderator and Hortonworks CMO Ingrid Burton.
Apache SystemML (incubating) recently released version 0.10.0. SystemML is a machine learning framework built with multiple backend support, including Apache Spark and Apache Hadoop. This release includes new Spark Matrix Block types, support for deep learning, several performance enhancements, a new KNN algorithm, and much more.
Apache Mahout, another machine learning framework, has released version 0.12.2. This release makes some improvements towards the goal of integrating with Apache Zeppelin for visualization and notebook support.
Qubole has announced that their HBase-as-a-Service offering is now generally available on AWS. The offering has a number of nice features for a long-running cluster, supports Hannibal and other monitoring tools, integrates with Apache Zeppelin, and can be configured with OpenTSDB and Apache Phoenix through node bootstrap actions.
Altiscale has announced Altiscale Insight Cloud Real-Time Edition. The system is backed by Apache HBase and Spark Streaming.
hs2client is a new C++ library for Apache Hive and Apache Impala (incuting). In addition to C++, the library has python bindings with support reads to DataFrames in pandas.
MapR has announced an Apache Spark 2.0 Developer Preview for their distribution.
Apache Beam has announced the 0.1.0-incubating release, the projects first since joining the Apache incubator.
Yandex has open sourced ClickHouse, a columnar analytics database. The system is built for both horizontal and vertical scaling. It supports complex data types (e.g. arrays) and can do approximate queries. The team has also published benchmark results in comparison to several other databases.
Curated by Datadog ( http://www.datadog.com )
Apache Bigtop & Apache Apex (San Jose) - Tuesday, June 21
Options for Ingest: Elasticsearch Ingest Node and Apache Airflow (Mountain View) - Wednesday, June 22
Open House: Big Data Processing with Apache Spark (San Francisco) - Thursday, June 23
Ted Dunning on Anomaly Detection (Redmond) - Thursday, June 23
Big Data at Twitter Scale (Boulder) - Thursday, June 23
June Tech Talk: PySpark (Austin) - Monday, June 20
Using Splunk for Your Big Data Along with Hadoop (Carrollton) - Tuesday, June 21
Invited Speaker Series: Cody Koeninger on Fundamentals of Spark and Kafka (Austin) - Thursday, June 23
Top 5 Mistakes When Writing Spark Applications (Chicago) - Monday, June 20
Apache Metron Overview and Codelab: Building the next Generation Cyber Security (Durham) - Thursday, June 23
High Concurrency, Low Latency Reporting Within Hadoop (Roswell) - Wednesday, June 22
Introduction to Spark In-Memory Computing (McLean) - Wednesday, June 22
Overview of Open Source Fast Data Platforms and Future Plans (Reston) - Wednesday, June 22
Interactive Data Analytics with Flink and Zeppelin (Reston) - Thursday, June 23
Distributed ML in Spark (New York) - Friday, June 24
Solving Analytics Problems in the Cloud w/ Spark, Presto, Hive (Boston) - Tuesday, June 21
Big Data Show and Tell (Reading) - Wednesday, June 22
How Spark Can Improve Your Hadoop Cluster (Hamburg) - Wednesday, June 22
Apache Flink Crash Course: Meet the Squirrel (Prague) - Thursday, June 23
Third Spark Meetup (Pune) - Wednesday, June 22
Understanding and Building Big Data Architectures, Part 3: Messaging/Kafka (Hyderabad) - Saturday, June 25