25 March 2018
This week's issue is quite large—great technical content covering BigGraphite (a Cassandra backend for Graphite), Apache Spark Streaming, Postgres on Kubernetes, KSQL, and a DynamoDB/API Gateway solution for serving ML data. In releases, there are a bunch, including Apache Trafodion, Apache Kudu, Apache Drill, and Apache HAWQ. There is hopefully something for everyone, but if you're looking for even more then checkout @DataEngWeekly on twitter throughout the week.
Cut HBase storage needs by 50%; elevate job throughput by 260%. ScaleFlux Computational Storage cohesively optimizes your Hadoop HBase infrastructure. Significant performance leaps on Spark, RocksDB, and Aerospike as well. Now shipping - see more:
http://bit.ly/scaleflux-computational-storage-hbase-260percent-faster
Last week, I included the wrong link to the GO-JEK article—here's the right one. In addition to that post on their data infrastructure (built on Apache Kafka, Apache Flink, Kubernetes, Chef, and InfluxDB), there's another post from their blog. This second article has a great overview of some resilience strategies—timeouts, idempotency, and the circuit breaker pattern—in distributed systems.
https://blog.gojekengineering.com/data-infrastructure-at-go-jek-cd4dc8cbd929
https://blog.gojekengineering.com/resiliency-in-distributed-systems-efd30f74baf4
The OpenMessaging Project is a new initiative to 1) provide unifying APIs across messaging systems and 2) provide standard tools for benchmarking those systems. It is able to generate synthetic workloads that are tunable on several axes and comes with tools to provision Apache Kafka and Apache Pulsar on AWS.
https://streaml.io/blog/openmessaging-benchmarks/
Criteo has open sourced a new Apache Cassandra backend for Graphite called BigGraphite. With it, they store over 1 million metrics per second, and they take advantage of Cassandra's secondary indexes to support the different types of Graphite queries. This presentation goes into further detail on the new backend and also links out to the design document and code on github.
https://docs.google.com/presentation/d/17opE2U2aIe1TFYJgr0Z8Dd2fJhy2x0YbReCVeCz6uhA/edit
Apache Spark 2.3.0 adds experimental support for continuous mode in stream processing. Compared to microbatch latency of 100s of milliseconds, continuous mode uses long-running processes that can take a record end to end in 10s of milliseconds.
The Pivotal blog has a post from Crunchy Data, makers of an open source tool for running Postgres on Kubernetes. The article provides a brief introduction to the tool, a tutorial for getting a DB up and running, and a demo of scaling up the number of read replicas.
https://content.pivotal.io/blog/postgresql-for-kubernetes-quickstart-for-pks
A tutorial on the Confluent Blog covers loading data into Kafka using Kafka Connect for MySQL (via Debezium for change data capture) and from CSV files, querying and joining those two streams, and writing the results out to Amazon S3 using the Kafka Connect S3 sink.
https://www.confluent.io/blog/ksql-in-action-enriching-csv-events-with-data-from-rdbms-into-AWS/
This article enumerates several of the mechanisms that can be used to access data in Apache Kudu (such as Impala and Spark), and it provides some performance comparison to HDFS when using the TPC-H synthetic dataset. While it's always important to test with your own data set, the experiment shows that Impala is nearly as fast at querying both parquet data in HDFS and Kudu (with some exceptions). Kudu, additionally, has the benefit of efficiently supporting CRUD and random access via a primary key.
Great overview of Impala best practices—such as using Parquet, optimizing file size (in this case, they saw a 2,000x speedup by combining small files), configuring separate coordinators and executors (a new feature as of Impala 2.9), and computing statistics. The post also describes a strategy of automatically detecting and canceling long running queries.
https://medium.com/@adirmashiach/apache-impala-my-insights-and-best-practices-8e0507089935
Azure Event Hubs is a managed service for storing data for streaming applications (with similarities to Apache Kafka or Google Cloud PubSub). This post describes using Event Hubs with Apache Spark (via Azure Databricks) to stream data from the Twitter API into Event Hubs and then to consume the tweet data from Event Hubs.
https://lenadroid.github.io/posts/connecting-spark-and-eventhubs.html
This post has five tips for working with data—if you haven't yet learned these the hard way yet, then they're good advice.
https://medium.com/@EasySize/5-crucial-tips-we-learned-watching-over-our-clients-data-226bf795fbdf
Rue La La is using AWS DynamoDB and API Gateway for serving data generated by their data science team via Spark jobs. This post provides an overview and links out to two other posts: the first on how store data in DynamoDB for efficiently answer API queries (including over a 20B element sparse matrix), and the second on using AWS Batch and Lambda to load data from S3 to DyanmoDB (there are some interesting details on long-running S3 reads and efficiently loading data).
https://medium.com/rue-la-la-tech/rich-data-science-apis-at-rue-la-la-3d44bd23b529
This article describes how one org does non-destructive schema changes using Rail's ActiveRecord migration utilities. They've codified a number of these best practices, to the point where exceptions are raised it you try to drop a table or column.
https://samsaffron.com/archive/2018/03/22/managing-db-schema-changes-without-downtime
The Data Engineering Podcast had its 23rd episode this week—it covers Elasticsearch and the Elastic stack.
https://www.dataengineeringpodcast.com/elastic-stack-with-philipp-krenn-episode-23/
Apache Airflow has gained so much popularity that there's now a podcast from the folks at Astronomer. This episode describes Airflow best practices.
https://soundcloud.com/the-airflow-podcast/best-practices
This post defines several job areas related to data—Data Analytics, Data Science, Machine Learning Engineering, Data Engineering, Backend Engineering, and Devops. It also describes how they overlap with a 6-circle Venn diagram. Whether or not these are the exact definitions you use at your company, it's useful to have these clear definitions to help identify gaps or blind spots.
https://mindfulmachines.io/blog/2018/3/17/the-flavors-of-data-science-and-data-engineering
PhoenixCon, the conference on Apache Phoenix, has announced that they're holding the next instance in San Jose on June 18th. The CFP is open through April 16th.
https://phoenix.apache.org/phoenixcon-2018/
Astronomer recently announced Astronomer Cloud Edition—a managed Apache Airflow service.
https://www.astronomer.io/blog/announcing-apache-airflow-on-astronomer/
Spring Cloud Data Flow 1.4 is out. The new version is mostly a UI/UX refresh, which includes a new builder for configuring streaming applications. There are also some security feature enhancements in the release.
Kafka Streams Viz is a new tool for visualizing streaming topologies. It's a single page webapp that takes a plain-text definition of a streaming topology and outputs a user-friendly diagram.
https://zz85.github.io/kafka-streams-viz/
The first of four SQL/data query engine Apache releases this week, Apache Trafodion 2.2.0 is the also the inaugural release since the project graduated from the Apache incubator. This is a bug fix release with a few enhancements.
https://trafodion.apache.org/release-notes-2-2-0.html
Apache Drill 1.13.0 is out with YARN support for Drill, support for JDK 8, improvements to security and memory usage, and more.
https://medium.com/@ApacheDrill/apache-drill-1-13-released-35a1663249ee
Apache HAWQ 2.3.0.0-incubating was announced. This was a bug fix and feature enhancement release that adds support for HDFS transparent data encryption and improved Ranger support (kerberos authentication and RPS HA).
https://cwiki.apache.org/confluence/display/HAWQ/Apache+HAWQ+2.3.0.0-incubating+Release
Apache Kudu 1.7.0 was released. New features include support for the decimal data type, improvements to self-healing through re-replication, and a new READYOURWRITES scan mode. There are a bunch of improvements to the Java client, performance during inserts and updates, and more. See the release notes for full details.
https://kudu.apache.org/releases/1.7.0/docs/release_notes.html
The 0.3.0 release of Banzai Pipeline, the PaaS system for deploying Spark, Kafka, and others on Kubernetes, is out. This version adds support for Google Kubernetes Engine, strengthens security, and includes several bug fixes and improvements.
https://banzaicloud.com/blog/pipeline-third-release/
Databricks and Microsoft announced that Azure Databricks is now generally available.
Cut HBase storage needs by 50%; elevate job throughput by 260%. ScaleFlux Computational Storage cohesively optimizes your Hadoop HBase infrastructure. Significant performance leaps on Spark, RocksDB, and Aerospike as well. Now shipping - see more:
http://bit.ly/scaleflux-computational-storage-hbase-260percent-faster
Interested in sponsoring? See: https://www.dataengweekly.com/advertising.html for details and email info@dataengweekly.com
Curated by Datadog ( http://www.datadog.com )
WePay Engineering Meetup: Managing Data at Scale (Redwood City) - Wednesday, March 28
https://www.meetup.com/WePay-Engineering-Meetup/events/248205670/
Bay Area Apache Spark & Women in Big Data @ Databricks HQ (San Francisco) - Thursday, March 29
https://www.meetup.com/spark-users/events/248239522/
An Introduction & Dive into Apache Kafka and the Power of Messaging Systems (Phoenix) - Thursday, March 29
https://www.meetup.com/Phoenix-Java/events/248305322/
Stealing the Jewels from Python: Putting PySpark Code into Production (Chicago) - Monday, March 26
https://www.meetup.com/Chicago-Spark-Users/events/248836419/
Event Sourcing from Scratch with Apache Kafka and Spring (Pittsburgh) - Tuesday, March 27
https://www.meetup.com/The-Pittsburgh-Java-Meetup-Group/events/246494113/
March Presentation Night: Spark 2.3, Cloud, and Use Cases (Cambridge) - Wednesday, March 28
https://www.meetup.com/Boston-Apache-Spark-User-Group/events/248548389/
Reactive Applications with the SMACK Stack, Led by Duncan DeVore! (Kanata) - Thursday, March 29
https://www.meetup.com/Big-Data-Developers-in-Ottawa/events/248471095/
Mesosphere and Confluent at Our First Kafka Meetup (Manchester) - Wednesday, March 28
https://www.meetup.com/Manchester-Kafka/events/248451554/
Apache Kafka March Meetup (Paris) - Monday, March 26
https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/248693926/
IoT Stream Processing and Analytics in Real-time (Frankfurt) - Wednesday, March 28
https://www.meetup.com/Future-of-Data-Frankfurt-am-Main-Germany/events/247939726/
Let's Talk about KSQL for Kafka (Zurich) - Tuesday, March 27
https://www.meetup.com/Zurich-Apache-Kafka-Meetup-by-Confluent/events/246915204/
Get Started with Data Integration Using Apache NiFi (Vienna) - Tuesday, March 27
https://www.meetup.com/Hadoop-User-Group-Vienna/events/248266228/
Big Data: How to Deal with Sensorics Data Easily (Prague) - Tuesday, March 27
https://www.meetup.com/Prague-Data-Management-Meetup/events/248157876/
Big Data on Kubernetes (Budapest) - Tuesday, March 27
https://www.meetup.com/Big-Data-Meetup-Budapest/events/248053604/