15 October 2017
Technical highlights in this week's issue include a post on cross-data center replication with Apache Pulsar (incubating) and a new SQL stream processing tool from Uber called AthenaX. There are also a number of benchmarks in this week's issue and several conference-related announcements (videos posted from Big Data LA, CFP for Big Data Technology Warsaw, and a preview of Spark Summit EU).
Hortonworks uses 21,000 compute hours a day to run tests for Hadoop ecosystem projects. The majority of these are nightly automated tests, but there are four testing environments in total. This post describes more about the testing stages.
https://hortonworks.com/blog/automated-validation-apache-hadoop-ecosystem/
Fivetran, who builds a product for syncing data into a data warehouse, has written up a benchmark comparing Redshift, Snowflake, and BigQuery. It shows that for a client's typical use case—lots of tables—all the systems behave similarly. While one should always be careful at reading too much into benchmarks... in this case they also provide some commentary on several previous benchmarks to provide details on why their analysis differs.
https://blog.fivetran.com/warehouse-benchmark-dce9f4c529c1
In another benchmark, the Cloudera blog considers the performance characteristics of the Azure Data Lake Store (ADLS). Ultimately, they find that ADLS is comparable to the Azure disk storage using the Teragen/Terasort/Teravalidate benchmark to measure throughput.
http://blog.cloudera.com/blog/2017/10/a-look-at-adls-performance-throughput-and-scalability/
This post describes how to perform cross-data center replication with Apache Pulsar (incubating). It describes the commands needed to setup replication, how to override it on a per-application basis, how to monitor, how to limit replication bandwidth, and more.
https://streaml.io/blog/geo-replication-patterns-practices/
After hello world, the "twitter data" tutorial seems to be the next example used for most streaming systems. In this case, Confluent has a demo of analyzing tweets in Kafka using KSQL.
https://www.confluent.io/blog/using-ksql-to-analyse-query-and-transform-data-in-kafka
One final benchmark for this week—Databricks has compared Spark streaming to Flink and Kafka Streams.
Qubole has a post that shows how to use PyHive from a Jupyter notebook to run queries against a Hive or Presto cluster. While parts of the post are Qubole specific, the instructions should work as long as the Hive Server 2 is enabled on your cluster.
https://www.qubole.com/blog/hive-presto-clusters-jupyter-aws-azure-oracle/
This tutorial is a tour of both tools and machine learning models for ultimately predicting whether a song will make the Billboard top 10. On the tools front—RStudio, H2O.ai, and Amazon Athena are used for analysis using GLM, GMB, and deep learning models. The walkthrough includes a bunch of code and example output.
Kontena has a post on using Kafka with microservices. It provides a good overview of how Kafka's distributed log semantics and data retention policies can power event sourcing microservices.
http://blog.kontena.io/event-sourcing-microservices-with-kafka/
Videos from Big Data LA, which took place in August, have been posted online.
https://www.bigdatadayla.com/#slides
The Roaring Elephant podcast has an episode this week that recaps a number of sessions from the recent Dataworks Summit in Sydney. Topics covered include Hadoop 3.0, SparkR and Kerberos.
https://roaringelephant.org/2017/10/10/episode-56-dataworks-summit-sydney-recap-by-dave-part-1/
The Call for Papers for Big Data Technology Warsaw, which takes place in February, ends tomorrow. This post has more about what to expect from the conference.
http://getindata.com/2-3-reasons-become-speaker-big-data-technology-warsaw-summit-2018/
Spark Summit EU takes place in just over a week in Dublin. The Databricks blog has more details and a discount code if you haven't yet signed up.
A CVE for Apache NiFi was announced. Some versions prior to 1.4.0 (released earlier this month) are subject to a XML External Entity attack.
A CVE for Apache ZooKeeper was also announced. In this case, certain commands can cause a denial of service. More details about affected versions and mitigation are in the announcement.
Uber has open-sourced their stream processing framework AthenaX, which is allows users to write stream jobs in SQL. The announcement describes the motivation and the high-level architecture of AthenaX, which is built on Apache Flink and Apache Kafka. It also uses Apache YARN for deployment. Uber has over 220 applications in multiple data centers running via AthenaX.
Apache Phoenix 4.12 was released. It includes resolution of over 100 issues, a new approximate count distinct function, a new table sampling feature, improvements to secondary indexing, and more.
https://blogs.apache.org/phoenix/entry/announcing-phoenix-4-12-released
Cloudera Director 2.6 was released. Among all the changes, it includes new TLS capability / other improvements to SSH key and improvements to the Microsoft Azure plugin.
http://blog.cloudera.com/blog/2017/10/whats-new-in-cloudera-director-2-6/
Gluon is a new machine learning library from AWS and Microsoft. It's available today as part of Apache MXNet.
Curated by Datadog ( http://www.datadog.com )
Tensorflow on Apache Hadoop YARN (San Francisco) - Wednesday, October 18
https://www.meetup.com/SF-Big-Analytics/events/243896061/
Apache Ignite: Building Consistent, HA Distributed Systems & Memory-Centric SQL (San Ramon) - Thursday, October 19
https://www.meetup.com/datariders/events/243704206/
The Evolving Landscape of Data Engineering & How Systems Fail (San Francisco) - Thursday, October 19
https://www.meetup.com/Data-Engineering-Club/events/243137230/
Spark MLlib (Salt Lake City) - Monday, October 16
https://www.meetup.com/apache-spark-slc/events/243934568/
KSQL and Data Management with Kafka (Minneapolis) - Wednesday, October 18
https://www.meetup.com/TwinCities-Apache-Kafka/events/243858653/
Leveraging Messaging Platforms Such as Kafka for Real-Time Streaming Transaction (Atlanta) - Thursday, October 19
https://www.meetup.com/BigData-Atlanta/events/244083390/
Real-World Deployments with Apache Kafka (Raleigh) - Tuesday, October 17
https://www.meetup.com/Raleigh-Apache-Kafka-Meetup-by-Confluent/events/243894809/
Spring for Apache Kafka (Richmond) - Wednesday, October 18
https://www.meetup.com/Richmond-Java-Users-Group/events/243370292/
Exactly-Once Semantics in Apache Kafka (New York) - Monday, October 16
https://www.meetup.com/Apache-Kafka-NYC/events/243901137/
IRELAND
Let's Talk Kafka Streams (Dublin) - Thursday, October 19
https://www.meetup.com/hadoop-user-group-ireland/events/243387159/
Stream SQL and Real-Time Applications with Apache Flink (London) - Wednesday, October 18
https://www.meetup.com/Apache-Flink-London-Meetup/events/244056097/
Big Data Analytics for Small- and Medium-Size Enterprises (Oslo) - Tuesday, October 17
https://www.meetup.com/OsloBigDataDay/events/237398012/
Helsinki Apache Kafka Meetup October (Helsinki) - Wednesday, October 18
https://www.meetup.com/Helsinki-Apache-Kafka-Meetup/events/243673560/
Stream SQL and Real-Time Applications with Apache Flink (Neuilly-Sur-Seine) - Thursday, October 19
https://www.meetup.com/Paris-Fast-Data-Meetup/events/244055726/
Stream Processing and War Stories (Berlin) - Thursday, October 19
https://www.meetup.com/fast-data-berlin/events/243706029/
Apache Kafka Streams: Building Distributed, Fault-Tolerant Processing Apps (Tel Aviv-Yafo) - Wednesday, October 18
https://www.meetup.com/Apache-Kafka-TLV/events/243455001/