23 October 2016
This week's issue is short and sweet with a few technical posts, two interesting news articles, and several exciting releases (including Apache Kafka 0.10.1.0). With Spark Summit Europe this week, expect lots of great content in the next issue. And if you're attending, please send interesting slides/talks my way!
Cloudera's CDH supports intra-node disk balancing since version 5.8.2 (it's also part of the 3.0.0 alpha Apache release). Using this feature, a data node can rebalance data blocks across disks using the
hdfs diskbalancer command. This post describes how the tool works and shows how to run it.
This post demonstrates the capabilities of the spark.ml library by building a logistic regression model to predict malignancy of cases from the Wisconsin Diagnostic Breast Cancer data set. The example code covers parsing, exploring a dataset with built-in statistics, extracting features from the input dataset, training the model, and evaluating the model.
The Amazon Big Data blog has a tutorial for running RStudio with sparklyr on EMR. Thanks to a bootstrap action, a cluster complete with RStudio running on the master, can be launched with a single command.
The Databricks blog features a list of seven tips for debugging Apache Spark code on Databricks. Most of the suggestions, like "Scale up Spark jobs slowly for really large datasets" and "Examine the partitioning for your dataset," are generally applicable to all Spark users.
InfoQ has an interview with Yahoo VP of Engineering, Peter Cnudde. Topics covered include Hadoop, Spark adoption at Yahoo (mostly for in-memory computing, not for ETL), and Caffe-on-Spark for deep learning.
ZDNet contributor Tony Baer has read between the lines when it comes to recent benchmarks by Cloudera and Hortonworks. The takeaways are as follows: 1) "SQL's the gateway drug to Hadoop." 2) Cloudera is trying to challenge Amazon (in this case Redshift), and 3) Hortonworks (via Hive's Live Long and Prosper) has caught up on the investment Cloudera made in Impala.
Apache Kafka 0.10.1.0 was released this week. It contains improvements from over 500 pull requests and the implementation of 15 Kafka Improvement Proposals. The Confluent blog has the highlights of additions/improvements to Kafka Server (time-based indexes, replication quotas, and improved log compaction), improvements to Kafka client APIs (interactive queries for Kafak Streams, improved memory management, secure quotas, and more), and bug fixes.
Apache Fluo (incubating), recently had its first release since entering the incubator. Fluo is a tool for making "incremental updates to large data sets stored in Apache Accumulo" a la Google's Perculator.
Apache Flume 1.7.0 was released. It adds support for a
taildir source and includes a number of improvements and bug fixes. Many of these are around Flume's integration with Apache Kafka.
Apache NiFi 0.7.1 was released as a follow-up to July's 0.7.0 release (version 1.0.0 was also recently released—in August). This release adds a number of improvements and bug fixes.
Apache Giraph 1.2.0 was released. Highlight's of the release include a new blocks API, support for graphs that don't fit in memory, and the addition of a new set of default configuration options based on Facebook's experience with Giraph.
deeplearning4j is a deep learning implementation that integrates with Hadoop and Spark and supports GPUs. Version 0.6.0 was recently released.
Curated by Datadog ( http://www.datadog.com )
Uber Engineering Tech Talk Series (San Francisco) - Monday, October 24
Real-Time Streaming and Exactly-Once Semantics with Kafka (San Francisco) - Tuesday, October 25
Building Your First Spark & C* App + SMACK Stack + The Cassandra Odyssey (San Francisco) - Wednesday, October 26
Apache YARN Committers/Contributors Meetup #4 (Sunnyvale) - Thursday, October 27
Kafka Palooza: LinkedIn, Microsoft Azure, MapR (Bellevue) - Monday, October 24
PixieDust: Making Python Visualizations Easier for Jupyter Notebooks with Spark (Las Vegas) - Monday, October 24
O&G Big Data Use Cases, by Hortonworks (Houston) - Thursday, October 27
Using Data Quality to Support Analytics in Hadoop (Overland Park) - Tuesday, October 25
Using Data Quality to Support Analytics in Hadoop (Kansas City) - Tuesday, October 25
Big Data Streaming Platform Ecosystem (Chicago) - Tuesday, October 25
Apache Spark 101 (Chicago) - Tuesday, October 25
October Edition of MOHUG (Dublin) - Tuesday, October 25
Apache Spark (Miami) - Wednesday, October 26
Lambda-in-a-Box: Merging Apache Spark & HBase into an Open-Source Database (New York) - Thursday, October 27
October Data Engineering Meetup (New York) - Thursday, October 27
Toronto Apache Spark #14 (Toronto) - Wednesday, October 26
Introduction to MapR (Toronto) - Thursday, October 27
Why SMACK for Fast Data (London) - Monday, October 24
Building Scalable Systems in a Changing Data Landscape (London) - Tuesday, October 25
Spark Structured Streaming in Practice (London) - Wednesday, October 26
Season Premiere with Reynold Xin, Co-Founder & Chief Architect at Databricks (Barcelona) - Thursday, October 27
Introduction to Kafka (Malaga) - Friday, October 28
Spark Pre-Summit Meetup (Brussels) - Tuesday, October 25
Meeting on Streamsets, Datameer and Kudu (Kontich) - Tuesday, October 25
Spark & Machine Learning Meetup (Brussels) - Thursday, October 27
Introduction to Spark & Use Cases (Hyderabad) - Monday, October 24
Rethink SQL for Big Data with Apache Drill (Barton) - Tuesday, October 25
Spark Meetup October (Sydney) - Wednesday, October 26
Rethink SQL for Big Data with Apache Drill (Melbourne) - Thursday, October 27
Big Data: Spark and TensorFlow (Tallinn) - Monday, October 24