Data Eng Weekly


Data Eng Weekly Issue #277

12 August 2018

Quite a bit of variety in this week's issue, including Kafka on Kubernetes, Docker on YARN, speeding up data parsing by filtering raw data, Hadoop at Microsoft, and the NSA's LemonGraph open source project. Also, a couple of new books to check out and releases of Flink, Landoop Lenses, and a KSQL plugin for VSCode.

Sponsor

From the creators of Apache Arrow, Dremio is an open source Data-as-a-Service platform. Accelerate your queries (up to 1,000x!) and make data truly self-service for your BI and data science users.

Visit https://bit.ly/about-dremio to learn more, or download for free.

Technical

Hortonworks published a four part series this week on Docker on YARN. It describes a financial derivatives application, which is built on a C++ library whose dependency is easily fulfilled via a docker image. The series walks through several details about the architecture, deployment, and more.

https://hortonworks.com/blog/distributed-pricing-engine-using-dockerized-spark-yarn-w-hdp-3-0-part-1-4/

This post has a good collection of performance tips for working with Apache Spark, including some specialized advice about storing data in Amazon S3.

https://medium.com/@brajendragouda/5-key-factors-to-keep-in-mind-while-optimizing-apache-spark-in-aws-part-1-4b113425bdcf

A good tutorial for Apache NiFi, this post describes how to load data from twitter, run it through a sentiment analyzer running as an HTTP server, and storing the resulting data in Apache Solr.

https://community.hortonworks.com/articles/208667/hdphdp-twitter-nlp-processing-framework.html

WalmartLabs writes about their BI pipeline that's built with Apache Kafka, Apache Cassandra, Apache Spark, Apache Parquet, Swift, and Tableau.

https://medium.com/walmartlabs/how-we-build-a-robust-analytics-platform-using-spark-kafka-and-cassandra-lambda-architecture-70c2d1bc8981

Cloudera's solution for Kafka schema management doesn't require a separate service, which makes its usage and configuration a bit different (and somewhat simpler) than other solutions. This post walks through how to setup a consumer.

https://blog.cloudera.com/blog/2018/08/robust-message-serialization-in-apache-kafka-using-apache-avro-part-3/

The Hadoop YARN capacity scheduler now features application lifetime SLAs. This post shows how to configure the settings, both as an admin and at job submission time, if you want an application or queue to have a maximum lifetime.

https://hortonworks.com/blog/enforcing-application-lifetime-slas-yarn/

BigQuery has a new cluster feature that is similar to partitioning in many data systems. When enabling clustering, queries that include the clustered columns in a filter/where clause can see significant speedups by scanning much less data. This post shows several examples of the speedups and costs savings from enabling clustering.

https://medium.com/@hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b

Confluent has published Helm Charts and a white paper for running the Confluent Platform on Kubernetes. The deployment makes use of StatefulSets and Persistent Volumes for Kafka and ZooKeeper, while using deployments for stateless services. The Helm Charts on are on Github, and the white paper is behind an email wall.

https://www.confluent.io/blog/getting-started-apache-kafka-kubernetes/

Sparser is a new research project to speed up queries over large data sets by filtering raw bytes before parsing. As described in this post, it uses SIMD-based filters and an optimizer that arranges/selects candidate filters to build an efficient pre-parsing filter. It works on both JSON and binary formats (like Parquet) to speed up end-to-end processing time in Spark by up to 9x. The code is on github, and hopefully we'll see these techniques integrated into some common projects soon.

https://dawn.cs.stanford.edu/2018/08/07/sparser/

This interview with Honeycomb co-founder and CEO Charity Majors covers a lot of ground in the observability (which is often considered the successor to monitoring and alerting) space. She talks about Facebook's internal debugging tool "Scuba," the systems that they've built at Honeycomb, and she provides a bunch of great advice about the state of the art and best practices in debuggability and observability.

https://read.acloud.guru/why-you-cant-effectively-debug-your-modern-systems-with-dashboards-57fe3ecd26bf

If you have a Spark cluster publicly accessible on the internet, there's a Remote Code Execution exploit that you should be aware of. It makes use of the Spark REST API to download and execute a rogue program. Alibaba has detected this in the wild.

https://medium.com/@Alibaba_Cloud/alibaba-cloud-security-team-discovers-apache-spark-rest-api-remote-code-execution-rce-exploit-a5fdb8fbd173

Microsoft writes about their 50,000+ node Apache YARN cluster that is used by over 15,000 developers performing applied research and science tasks. It's an exabyte scale cluster and is the largest known cluster in the world. The article speaks about several of the performance issues and optimizations that the team has implemented.

https://azure.microsoft.com/en-us/blog/how-microsoft-drives-exabyte-analytics-on-the-world-s-largest-yarn-cluster/

Github has open sourced their Github Load Balancer (GLB) Director, which is a key part of the GLB architecture. This post describes the technical details of how GLB operates. While it's not a typical data system, there are a number of interesting distributed system challenges and solutions discussed (e.g. consistent hashing, routing, testing, failover, and more).

https://githubengineering.com/glb-director-open-source-load-balancer/

The AWS blog has an overview of using Apache Sqoop for a couple of use cases: loading data to Hive (via HDFS and S3) and to Redshift (via S3).

https://aws.amazon.com/blogs/big-data/migrate-rdbms-or-on-premise-data-to-emr-hive-s3-and-amazon-redshift-using-emr-sqoop/

LemonGraph is a recently open sourced graph (nodes/edges) database project from the National Security Agency. This blog series tours the codebase, which is python (wrapping a bunch of native C code). It looks at the storage component, query parsing, execution engine, and more.

https://ayende.com/blog/posts/series/184066-C/reading-the-nsas-codebase

Timescale writes about their Time Series Benchmark Suite, which is a suite of Go programs to benchmark data loading and query execution for time series databases.

https://blog.timescale.com/time-series-database-benchmarks-timescaledb-influxdb-cassandra-mongodb-bc702b72927e

Sponsor

Unravel covers the basic concepts for caching data in Spark apps to speed up performance in this blog. Read the blog and request a trial to see how you can speed up your Spark apps.

http://bit.ly/cache-spark

News

Hortonworks and Google Cloud have announced a partnership to collaborate on HDP and Google Cloud integration, including a Cloud Storage Connector, HDP, and Hortonworks DataFlow.

https://hortonworks.com/blog/hortonworks-google-cloud-collaborate-expand-data-analytics-offerings/

Flink Forward Berlin is in a few weeks. DataArtisans has a post that highlights some of the sessions.

https://data-artisans.com/blog/flink-forward-berlin-2018-preview

"Getting Started with Kudu" is a new book that covers the architecture, use cases, and design patterns for Apache Kudu. One of the co-authors provides a bit more detail about the book and has some pointers for getting in touch with the authors via the Kudu Slack channel.

https://www.phdata.io/getting-started-with-kudu/

With new elastic data warehouse technologies like Redshift and BigQuery, it's become pretty common to load raw data before doing any transformations. This post describes several reasons why this technique—ELT—has been replacing ETL.

https://medium.com/@mohammedhassan3390515/etl-vs-elt-d76ae24f6a66

"Kafka Streams in Action" is an upcoming book on building applications with the Kafka Streams API. The Confluent blog has the foreword, which is written by Confluent co-founder and CTO Neha Narkhede. The first chapter of the book is also available for free on the Manning website.

https://www.confluent.io/blog/kafka-streams-action

Jobs

Checkout the jobs from the Data Eng Weekly board:

Submit a job at https://jobs.dataengweekly.com/ !

Releases

Apache Flink 1.6.0 was released. It has a number of features that improve stateful stream processing, such as support for state TTL, job submission over HTTP/REST, Apache Avro support for streaming SQL, UDF and batch query support for the SQL client, and a Kafka table sink. Flink also now has Jepsen testing for fault tolerance.

https://flink.apache.org/news/2018/08/09/release-1.6.0.html

Landoop has released Lenses 2.1, which includes a number of updates to the Lenses SQL streaming engine. The new version adds support for custom serialization formats and bundled support for XML and protobuf. It also supports arrays and several UI features to visualize streaming programs.

https://www.landoop.com/blog/2018/08/lenses-21-release

There's a new VSCode plugin for KSQL that provides syntax highlighting.

https://marketplace.visualstudio.com/items?itemName=rmoff.ksql

Sponsors

From the creators of Apache Arrow, Dremio is an open source Data-as-a-Service platform. Accelerate your queries (up to 1,000x!) and make data truly self-service for your BI and data science users.

Visit https://bit.ly/about-dremio to learn more, or download for free.

Unravel covers the basic concepts for caching data in Spark apps to speed up performance in this blog. Read the blog and request a trial to see how you can speed up your Spark apps.

http://bit.ly/cache-spark

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Kafka at PagerDuty + Schema Registry by Confluent (San Francisco) - Wednesday, August 15
https://www.meetup.com/KafkaBayArea/events/253463221/

Washington

Snowflake: Data Warehouse Built for the Cloud (Bellevue) - Wednesday, August 15
https://www.meetup.com/Big-Data-Bellevue-BDB/events/252367649/

Colorado

KSQL: Stream Processing Was Never This Easy (Boulder) - Thursday, August 16
https://www.meetup.com/Boulder-Denver-Big-Data/events/253048261/

New York

Guaranteed Fresh: Defining a Data Team’s Relationship with the Business (New York) - Tuesday, August 14
https://www.meetup.com/Analytics-Data-Science-by-Dataiku-NY/events/253062539/

Log-Based Architecture for Distributed Systems + Running Postgres at Scale (New York) - Wednesday, August 15
https://www.meetup.com/NYC-Data-Engineering/events/252743598/

CANADA

Holden Karau: ML Pipelines with Apache Spark & Apache Beam (Kanata) - Wednesday, August 15
https://www.meetup.com/Big-Data-Developers-in-Ottawa/events/253349273/

AUSTRALIA

Azure Databricks: An Introduction (Barton) - Wednesday, August 15
https://www.meetup.com/Canberra-Microsoft-Analytics-PowerBI-Group/events/253511900/