23 October 2016
This week's issue is short and sweet with a few technical posts, two interesting news articles, and several exciting releases (including Apache Kafka 0.10.1.0). With Spark Summit Europe this week, expect lots of great content in the next issue. And if you're attending, please send interesting slides/talks my way!
Cloudera's CDH supports intra-node disk balancing since version 5.8.2 (it's also part of the 3.0.0 alpha Apache release). Using this feature, a data node can rebalance data blocks across disks using the hdfs diskbalancer
command. This post describes how the tool works and shows how to run it.
This post demonstrates the capabilities of the spark.ml library by building a logistic regression model to predict malignancy of cases from the Wisconsin Diagnostic Breast Cancer data set. The example code covers parsing, exploring a dataset with built-in statistics, extracting features from the input dataset, training the model, and evaluating the model.
The Amazon Big Data blog has a tutorial for running RStudio with sparklyr on EMR. Thanks to a bootstrap action, a cluster complete with RStudio running on the master, can be launched with a single command.
https://aws.amazon.com/blogs/big-data/running-sparklyr-rstudios-r-interface-to-spark-on-amazon-emr/
The Databricks blog features a list of seven tips for debugging Apache Spark code on Databricks. Most of the suggestions, like "Scale up Spark jobs slowly for really large datasets" and "Examine the partitioning for your dataset," are generally applicable to all Spark users.
https://databricks.com/blog/2016/10/18/7-tips-to-debug-apache-spark-code-faster-with-databricks.html
InfoQ has an interview with Yahoo VP of Engineering, Peter Cnudde. Topics covered include Hadoop, Spark adoption at Yahoo (mostly for in-memory computing, not for ETL), and Caffe-on-Spark for deep learning.
https://www.infoq.com/articles/peter-cnudde-yahoo-big-data
ZDNet contributor Tony Baer has read between the lines when it comes to recent benchmarks by Cloudera and Hortonworks. The takeaways are as follows: 1) "SQL's the gateway drug to Hadoop." 2) Cloudera is trying to challenge Amazon (in this case Redshift), and 3) Hortonworks (via Hive's Live Long and Prosper) has caught up on the investment Cloudera made in Impala.
http://www.zdnet.com/article/sql-on-hadoop-benchmarks-get-serious/
Apache Kafka 0.10.1.0 was released this week. It contains improvements from over 500 pull requests and the implementation of 15 Kafka Improvement Proposals. The Confluent blog has the highlights of additions/improvements to Kafka Server (time-based indexes, replication quotas, and improved log compaction), improvements to Kafka client APIs (interactive queries for Kafak Streams, improved memory management, secure quotas, and more), and bug fixes.
http://mail-archives.apache.org/mod_mbox/kafka-users/201610.mbox/%3CCAJL4t_oz9q4T9vn6Z-EBoazWJFyqHw4Y0L-PTowD%2BpFhcPv0VQ%40mail.gmail.com%3E
http://www.confluent.io/blog/announcing-apache-kafka-0-10-1-0/
Apache Fluo (incubating), recently had its first release since entering the incubator. Fluo is a tool for making "incremental updates to large data sets stored in Apache Accumulo" a la Google's Perculator.
https://fluo.apache.org/release/fluo-1.0.0-incubating/
Apache Flume 1.7.0 was released. It adds support for a taildir
source and includes a number of improvements and bug fixes. Many of these are around Flume's integration with Apache Kafka.
http://flume.apache.org/releases/1.7.0.html
Apache NiFi 0.7.1 was released as a follow-up to July's 0.7.0 release (version 1.0.0 was also recently released—in August). This release adds a number of improvements and bug fixes.
https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version0.7.1
Apache Giraph 1.2.0 was released. Highlight's of the release include a new blocks API, support for graphs that don't fit in memory, and the addition of a new set of default configuration options based on Facebook's experience with Giraph.
https://blogs.apache.org/giraph/entry/giraph_1_2_0_release
deeplearning4j
is a deep learning implementation that integrates with Hadoop and Spark and supports GPUs. Version 0.6.0 was recently released.
https://github.com/deeplearning4j/deeplearning4j
Curated by Datadog ( http://www.datadog.com )
Uber Engineering Tech Talk Series (San Francisco) - Monday, October 24
http://www.meetup.com/UberEvents/events/234789134/
Real-Time Streaming and Exactly-Once Semantics with Kafka (San Francisco) - Tuesday, October 25
http://www.meetup.com/MemSQL/events/234405914/
Building Your First Spark & C* App + SMACK Stack + The Cassandra Odyssey (San Francisco) - Wednesday, October 26
http://www.meetup.com/SF-Spark-and-Friends/events/234932979/
Apache YARN Committers/ContributÂors Meetup #4 (Sunnyvale) - Thursday, October 27
http://www.meetup.com/Hadoop-Contributors/events/234971372/
Kafka Palooza: LinkedIn, Microsoft Azure, MapR (Bellevue) - Monday, October 24
http://www.meetup.com/Seattle-Apache-Kafka-Meetup/events/234836624/
PixieDust: Making Python Visualizations Easier for Jupyter Notebooks with Spark (Las Vegas) - Monday, October 24
http://www.meetup.com/Data-Science-Las-Vegas/events/234557659/
O&G Big Data Use Cases, by Hortonworks (Houston) - Thursday, October 27
http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/234282996/
Using Data Quality to Support Analytics in Hadoop (Overland Park) - Tuesday, October 25
http://www.meetup.com/Kansas-City-Big-Data-Projects-Group/events/234597551/
Using Data Quality to Support Analytics in Hadoop (Kansas City) - Tuesday, October 25
http://www.meetup.com/Kansas-City-Big-Data-Projects-Group/events/234597347/
Big Data Streaming Platform Ecosystem (Chicago) - Tuesday, October 25
http://www.meetup.com/ChicagoRealTimeStreamingAnalytics/events/234676872/
Apache Spark 101 (Chicago) - Tuesday, October 25
http://www.meetup.com/Chicago-Spark-Users/events/233999667/
October Edition of MOHUG (Dublin) - Tuesday, October 25
http://www.meetup.com/MOHUG-Mid-Ohio-Hadoop-User-Group/events/234416779/
Apache Spark (Miami) - Wednesday, October 26
http://www.meetup.com/Miami-Hadoop-User-Group/events/234992451/
Lambda-in-a-Box: Merging Apache Spark & HBase into an Open-Source Database (New York) - Thursday, October 27
http://www.meetup.com/mysqlnyc/events/233775657/
October Data Engineering Meetup (New York) - Thursday, October 27
http://www.meetup.com/NYC-Data-Engineering/events/234946410/
Toronto Apache Spark #14 (Toronto) - Wednesday, October 26
http://www.meetup.com/Toronto-Apache-Spark/events/234878620/
Introduction to MapR (Toronto) - Thursday, October 27
http://www.meetup.com/Toronto-MapR-User-Group/events/231648976/
Why SMACK for Fast Data (London) - Monday, October 24
http://www.meetup.com/skillsmatter/events/234588911/
Building Scalable Systems in a Changing Data Landscape (London) - Tuesday, October 25
http://www.meetup.com/data-science-lab/events/234754144/
Spark Structured Streaming in Practice (London) - Wednesday, October 26
http://www.meetup.com/hadoop-users-group-uk/events/234876912/
Season Premiere with Reynold Xin, Co-Founder & Chief Architect at Databricks (Barcelona) - Thursday, October 27
http://www.meetup.com/Spark-Barcelona/events/234463208/
Introduction to Kafka (Malaga) - Friday, October 28
http://www.meetup.com/Linux-Malaga/events/234826330/
Spark Pre-Summit Meetup (Brussels) - Tuesday, October 25
http://www.meetup.com/Spark-Belgium/events/234234256/
Meeting on Streamsets, Datameer and Kudu (Kontich) - Tuesday, October 25
http://www.meetup.com/Belgium-Cloudera-User-Group/events/234618841/
Spark & Machine Learning Meetup (Brussels) - Thursday, October 27
http://www.meetup.com/Data-Science-Community-Meetup/events/234173917/
Introduction to Spark & Use Cases (Hyderabad) - Monday, October 24
http://www.meetup.com/meetup-group-ytFpRTDs/events/234412261/
Rethink SQL for Big Data with Apache Drill (Barton) - Tuesday, October 25
http://www.meetup.com/Canberra-Big-Data-Converged-SQL-NoSQL-and-Real-Time/events/233463561/
Spark Meetup October (Sydney) - Wednesday, October 26
http://www.meetup.com/Sydney-Apache-Spark-User-Group/events/233723585/
Rethink SQL for Big Data with Apache Drill (Melbourne) - Thursday, October 27
http://www.meetup.com/Melbourne-Big-Data-Converged-SQL-NoSQL-and-Real-Time/events/233463459/
Big Data: Spark and TensorFlow (Tallinn) - Monday, October 24
http://www.meetup.com/Advanced-Java-Estonia/events/234612322/