Data Eng Weekly


Hadoop Weekly Issue #188

25 September 2016

Lots of releases this week—CouchDB, Accumulo, Kylin, Osso (a new OSS project from Rocana—but most notably Apache Kudu hit version 1.0. There's a bit less technical content and general news than usual, but that's to be expected. With Strata + Hadoop World taking place this week in NYC, get ready for tons of news in the next issue.

Technical

The Cloudera blog has a post on the recently released Apache Hadoop 3.0.0-alpha1. It describes several of the features of the release, including HDFS erasure coding, v.2 of the YARN Timeline Service, and the shell script rewrite.

http://blog.cloudera.com/blog/2016/09/getting-to-know-the-apache-hadoop-3-alpha/

MapR has posted a whiteboard walkthrough on how Apache Flink handles event time for stream processing. In addition to the video, there's a transcript of the presentation.

https://www.mapr.com/blog/event-time-apache-flink-stream-processing-whiteboard-walkthrough

This post is a great walkthrough of Apache Drill. It covers a bunch of topics, including: quoting reserved keywords, interpreting/fixing json parse errors, use of subqueries, conveniences for querying csv, a basic overview of Drill's web interface, plugin configuration, querying a rdbms, and analyzing a query plan.

https://www.mapr.com/blog/how-guide-getting-started-apache-drill

Cloudera has published a post comparing Apache Impala and Amazon Redshift. There's an overview of key differences, but the main focus is a performance and cost comparison. As always, these results shouldn't be viewed as necessarily representative (each dataset is different). With that said, using a TPC-DS derived workload, they show that Impala can often beat Redshift in cost and performance.

http://blog.cloudera.com/blog/2016/09/apache-impala-incubating-vs-amazon-redshift-s3-integration-elasticity-agility-and-cost-performance-benefits-on-aws/

The StreamSets blog has a post arguing that Apache Kudu's support for efficient real-time access and atomic updates provides an alternative to the lambda architecture.

https://streamsets.com/blog/post-lambda-world-apache-kudu/

This post describes some of the challenges of moving a data science research project into a production data pipeline. The author argues that it's important for developers and data scientists to work together to integrate quickly.

https://www.oreilly.com/ideas/what-is-hardcore-data-science-in-practice

News

IBM Power systems are getting support for Apache Hadoop through an IBM partnership with Hortonworks.

http://www.prnewswire.com/news-releases/hortonworks-ibm-collaborate-to-offer-open-source-distribution-on-power-systems-300330299.html

dataArtisans have announced the dA Platform, which is a distribution of Apache Flink with enterprise support.

http://data-artisans.com/announcing-the-da-platform-our-distribution-of-apache-flink/

Oracle and Qubole announced a partnership to bring the Qubole big data as a service offering to the Oracle Cloud Platform.

https://www.qubole.com/blog/qubole-and-oracle/

Omid is a transaction manager for Apache HBase that was recently accepted into the Apache Incubator after a proposal from Yahoo. It both provides snapshot isolation guarantees and can be used in high performance environments (supporting over 100k transactions/second).

http://yahoohadoop.tumblr.com/post/150821732246/omids-first-step-in-the-apache-community

Releases

Rocana has open sourced Osso, which is a new semi-structured event format. Built on Avro, the standard is meant to be easy, intuitive, efficient, and complementary to existing solutions.

http://www.osso-project.org/

The Google Cloud Platform blog has highlighted three integrations related to Kafka. The Google Cloud Pub/Sub connectors offer a mechanism for moving data between pub/sub and Kafka, the KafkaIO connector for Apache Beam allows Beam systems to consume from Kafka, and the Kafka to BigQuery connector can be used to mirror data to BigQuery.

https://cloud.google.com/blog/big-data/2016/09/apache-kafka-for-gcp-users-connectors-for-pubsub-dataflow-and-bigquery

Version 2.0 of Apache CouchDB was released this week. Highlights of the release include new clustering, a new querying language, and a rewritten admin interface.

https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces99

Apache Kudu announced version 1.0 this week. The release includes support for HA Kudu Master, a rewritten Apache Spark integration, an official client library for Python, and more. To mark the occasion, the Cloudera blog has an overview of the history of the project and a look at its future.

https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces100
http://vision.cloudera.com/apache-kudu-1-0-is-released/

Apache Accumulo 1.6.6 includes a data loss fix, a fix for DataNode decommission, dependency upgrades, and more.

https://accumulo.apache.org/release_notes/1.6.6

Amazon EMR now supports security configurations to enable encryption for data at rest and in transit. The post has an example of configuring the encryption providers.

http://blogs.aws.amazon.com/bigdata/post/Tx31P2UUJKR4ONF/Encrypt-Data-At-Rest-and-In-Flight-on-Amazon-EMR-with-Security-Configurations

Version 1.5.4 of Apache Kylin, the OLAP engine for Hadoop, was released.

http://mail-archives.us.apache.org/mod_mbox/www-announce/201609.mbox/%3CCANfpUctGUDjsNVoe_Pd1CJF4Ebh8ne2NSzZBaaYsj2d7M4rq6Q@mail.gmail.com%3E

Amazon Web Services has open-sourced the Amazon EMR-DynamoDB connector.

http://blogs.aws.amazon.com/bigdata/post/Tx1LFQWRADHKT44/Amazon-EMR-DynamoDB-Connector-Repository-on-AWSLabs-GitHub

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Apache Spark Meetup (San Francisco) - Tuesday, September 27
http://www.meetup.com/spark-users/events/233723499/

Azure 101: Hadoop on Cloud (Mountain View) - Wednesday, September 28
http://www.meetup.com/Microsoft-Azure-Open-Group/events/234105376/

Washington

Scaling Recommenders + Content Embeddings at Facebook (Seattle) - Wednesday, September 28
http://www.meetup.com/Seattle-Scalability-Meetup/events/231640322/

Colorado

Apache Nifi (Lafayette) - Monday, September 26
http://www.meetup.com/Lafayette-CO-Tech/events/234001885/

Nebraska

Hadoop Security and Governance with Apache Ranger and Apache Atlas (Manhattan) - Wednesday, September 28
http://www.meetup.com/futureofdata-newyork/events/234153727/

Texas

Big Data & Data Science Workshop Using Apache Spark (Houston) - Monday, September 26
http://www.meetup.com/Houston-Spark-Meetup/events/234198876/

Georgia

Diving Into Big Data Technologies: Hadoop, Hive, and Apache NiFi (Atlanta) - Thursday, September 29
http://www.meetup.com/Technologists/events/231068842/

District of Columbia

“Data Analytics with Hadoop” Book Release Celebration (Washington) - Monday, September 26
http://www.meetup.com/Data-Community-DC/events/234075049/

New York

HBaseCon East 2016 (New York) - Monday, September 26
http://www.meetup.com/HBase-NYC/events/233024937/

Intro to Apache Kudu: Fast Analytics on Fast Data (New York) - Tuesday, September 27
http://www.meetup.com/mysqlnyc/events/233599664/

The Stream Processor as a Database (New York) - Wednesday, September 28
http://www.meetup.com/NYCRealTimeStreamingAnalytics/events/234329394/

NORWAY

Let’s Get Started with Hadoop #9 (Oslo) - Thursday, September 29
http://www.meetup.com/Oslo-Hadoop-Big-Data-Meetup/events/231886409/

FRANCE

Criteo Labs Tech Talks Session 3 (Paris) - Wednesday, September 28
http://www.meetup.com/Criteo-Labs-Tech-Talks/events/234001806/

NETHERLANDS

Introduction to Apache Flink (Amsterdam) - Thursday, September 29
http://www.meetup.com/Apache-Flink-Meetup-Amsterdam/events/233817119/

GERMANY

Data Engineering on AWS by Thorsten Greiner (Dusseldorf) - Thursday, September 29
http://www.meetup.com/Dusseldorf-Data-Science-Meetup/events/234016369/

POLAND

Hands-On Introduction to Apache Spark & Apache Zeppelin (Gdansk) - Wednesday, September 28
http://www.meetup.com/futureofdata-gdansk/events/233501975/

ISRAEL

Practical Distributed Stream Processing with Akka Streams (Tel Aviv-Yafo) - Tuesday, September 27
http://www.meetup.com/underscore/events/234017005/

INDIA

Discuss Key Emerging Big Data Technologies (Bangalore) - Thursday, September 29
http://www.meetup.com/Emerging-Big-Data-Technologies-Meetup/events/231614230/

Introduction to Hadoop, Yarn, HDFS Students Only - Friday, September 30
http://www.meetup.com/Apache-Apex-Pune/events/234087397/