Hadoop Weekly Issue #224

16 July 2017

With summer taking off, the volume of content has slowed down a bit (please reach out if you do find something I should include!). This week's issue is again a short one. The highlights include new releases of Apache Spark, IBM BigSQL, and Cloudera Enterprise. In addition, there's a new implementation of Kafka Streams in Python, the schedule for Flink Forward is posted, and there are several technical posts covering Spark, Kafka, Kinesis, StreamSets, and HBase.

Technical

Although you should keep in mind that these benchmarks are from a vendor and don't necessarily extrapolate to other workloads, Databricks has shown that their platform is faster than vanilla Spark, Presto, and Impala.

https://databricks.com/blog/2017/07/12/benchmarking-big-data-sql-platforms-in-the-cloud.html

This post compares and contrasts Apache Kafka and Amazon Kinesis. Since Kinesis is a SaaS product, it compares favorably in terms of operational complexity. With that said, it has more limitations on throughput (but at least it's predictable), and the two share similar sets of high-level APIs for moving data and doing analysis (e.g. Kafka Connect/Kinesis Firehose and Kafka Streams/Kinesis Analytics). The post also discusses architectural and pricing considerations.

http://www.jesse-anderson.com/2017/07/apache-kafka-and-amazon-kinesis/

In order to improve their product and provide operational support, Qubole collects metrics and data from a number of data sources across its product suite of Hive, YARN, and Presto. Collected data include events and metrics from the optimizer, the execution engine, the storage layer, and the cluster management layer. They load all of this data into QDS for analysis using a combination of Sqoop, Kafka, and a custom-built metrics service. The post has a few more details about the data they collect and what kinds of things they do with it.

http://www.qubole.com/blog/building-qdsair-infrastructure/

The StreamSets blog has a post on deploying the StreamSets Data Collector (SDC) with Kubernetes. SDC can be deployed as a stateful application or it can rely on a cloud service (Dataflow Performance Manager) for storying state, so the example is a bit more interesting than some of the common Kubernetes tutorials out there.

https://streamsets.com/blog/scaling-streamsets-kubernetes/

This tutorial walks through loading the openFDA dataset into Spark, transforming it to Parquet files in S3, querying it with Amazon Athena, and using R and Python for additional analysis.

https://aws.amazon.com/blogs/big-data/analyze-openfda-data-in-r-with-amazon-s3-and-amazon-athena/

The Apache Software Foundation blog has a post describing several HBase application archetypes. It details four different use cases, including storage of documents, graphs, queues, and metrics. For each, there is a example or two of how the schema/structure can be defined as well as a reference out to an application that is designed to use HBase in this way.

https://blogs.apache.org/hbase/entry/hbase-application-archetypes-redux2-part

News

Two CVEs were announced for Apache Impala (incubating). The mitigation for both is to update to version 2.9.0.

https://mail-archives.apache.org/mod_mbox/www-announce/201707.mbox/%3CCA%2BLM4Mt5Nk_qJom7YhKaZrLQArqWWY%3DofUDK7h%2BMM7mmOVhp_w%40mail.gmail.com%3E
https://mail-archives.apache.org/mod_mbox/www-announce/201707.mbox/%3CCA%2BLM4Mur2%2Bmdv_YukGG0s5OWrj6xdZyLZbPRaB0g6Mm_WH%2BN9g%40mail.gmail.com%3E

The sessions for Flink Forward, which takes place in Berlin this September, have been posted.

https://berlin.flink-forward.org/sessions/

Software Engineering Daily has an interview with Confluent co-founder and CTO Neha Narkhede. If you aren't interesting in the podcast, there's also a PDF transcript.

https://softwareengineeringdaily.com/2017/07/10/kafka-in-the-cloud-with-neha-narkhede/

Releases

Winton Kafka Streams is a python implementation of the Apache Kafka Streams API. It's a new project that doesn't yet have a release, and it requires Python 3.6. There are a few examples in the repo that help demonstrate the API.

https://github.com/wintoncode/winton-kafka-streams

Apache Spark 2.2.0 was released. Among the highlights, PySpark can now be installed in a single command from PyPI. The release notes have lots of further details, but some highlights include a new cost-based optimizer and the GA of Structured Streaming.

https://spark.apache.org/releases/spark-release-2-2-0.html

Cloudera Enterprise 5.12 has been released with updates to the core platform, the data science workbench, Kudu, and more.

http://blog.cloudera.com/blog/2017/07/cloudera-enterprise-5-12-is-now-available/

IBM has announced Big SQL 5.0, which is the first version to be tightly integrated with Hortonworks HDP.

https://developer.ibm.com/hadoop/2017/07/13/announcing-bigsql-5-0/

Hue 4.0 has been released with a new interface that revamps the editors and system browsers.

http://gethue.com/hue-4-and-its-new-interface-is-out/

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Self-Service Data Integration, Kodiak Data, and Kubernetes! (Palo Alto) - Wednesday, July 19
https://www.meetup.com/BigDataApps/events/239890782/

Minnesota

Sensors, Spark, and Kafka: Applied Machine Learning (Minnetonka) - Wednesday, July 19
https://www.meetup.com/TwinCities-Bigdata-Analytics/events/241333229/

Illinois

Data Streaming Panel Event (Chicago) - Thursday, July 20
https://www.meetup.com/TechinMotionChicago/events/240293492/

North Carolina

SQL on Hadoop and Modern Analytic Databases with Ian Cook (Chapel Hill) - Tuesday, July 18
https://www.meetup.com/Research-Triangle-Analysts/events/237582880/

BRAZIL

DATA Rave Party: Spark Edition (Sao Paulo) - Thursday, July 20
https://www.meetup.com/Cloudera-Brasil-Meetup/events/241374445/

UNITED KINGDOM

Applied Data Engineering #1 (London) - Wednesday, July 19
https://www.meetup.com/Applied-Data-Engineering-London/events/240932091/

SPAIN

The State of Spark and Hive in the Cloud, by Nico Poggi (Barcelona) - Thursday, July 20
https://www.meetup.com/Spark-Barcelona/events/241275453/

ROMANIA

July Meetup: Scala, Spark, Docker, Elasticsearch (Bucharest) - Thursday, July 20
https://www.meetup.com/Bucharest-Big-Data-Meetup/events/240990369/

ISRAEL

Big Data Processing on AWS (Tel Aviv-Yafo) - Tuesday, July 18
https://www.meetup.com/israelclouds/events/241304553/

Kafka Operations (Herzeliyya) - Wednesday, July 19
https://www.meetup.com/ops-for-developers/events/241304477/

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit https://hadoopweekly.com

Data Eng Weekly