Data Eng Weekly


Data Eng Weekly Issue #273

15 July 2018

Lots of coverage of tools this week—Scio, make at Propublica, Paypal's NameNode analytics, MySQL on Kubernetes, Kinesis+Lambda, and data replication at Hotels.com. There are also a couple of great posts on building and running distributed systems, the CAP theorem, and debugging. In releases, there are quite a few—including Hortonworks Data Platform, the dA platform, Apache Phoenix, and Hadoopi (for running Hadoop on Raspberry Pis).

Sponsor

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2t96fa7, or visit http://dremio.com to learn more.

Technical

With the CAP theorem thrown around all the time in distribution systems, it's useful to understand it (at least) at high-level. This is a quick and easy to understand visual guide of the key terms and proof of the theorem.

https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/

This post covers a lot of ground about building a distributed system (in this case for storing time series data). There are lots of components involved, from sharding to replication to query API to testing (unit through integration). It covers a lot of details on great libraries, lessons learned, and more.

https://fosdem.org/2018/schedule/event/datastore/attachments/slides/2618/export/events/attachments/datastore/slides/2618/designing_distributed_datastore_in_go_timbala.pdf

In this story of debugging a performance issue, there's a really good look into Apache Cassandra internals—tombstones, garbage collection, and the read path.

http://thelastpickle.com/blog/2018/07/05/undetectable-tombstones-in-apache-cassandra.html

Here's a good intro the the Scio project, which provides a Scala interface atop of Apache Beam.

https://medium.com/@andrewnguonly/scio-a-nice-alternative-to-apache-beams-java-sdk-63ed10da716e

DataArtisans has a post covering the efforts on Apache Flink in June. They include hardening of the Flink CEP library, support for a blob store backend to the Flink distributed cache, performance improvements by upgrading to netty 4.1, and support for batch queries via the SQL client.

https://data-artisans.com/blog/apache-flink-master-branch-monthly-new-in-flink-in-june-2018

The eBay tech blog has a two part post on how they use event sourcing and CQRS for their continuous delivery pipeline. The first part gives a good overview of event sourcing, and the second dives into the technical implementation details for their system, which is written in Scala and uses MongoDB for storage.

https://www.ebayinc.com/stories/blogs/tech/event-sourcing-connecting-the-dots-for-a-better-future/
https://www.ebayinc.com/stories/blogs/tech/event-sourcing-in-action-with-ebays-continuous-delivery-team/

Make can be used to model dependencies and execute a data flow graph. Steps are generally idempotent and retry-able, so it's a good, lightweight alternative to a larger workflow engine. Propublica writes about using make to download and ETL campaign finance data.

https://www.propublica.org/nerds/gnu-make-illinois-campaign-finance-data-david-eads-propublica-illinois

Segment has written a post with a lot of great reflections on microservices architecture, monorepos, and monoliths. There are a number of interesting lessons learned about the tradeoffs across each (e.g. fault tolerance, integration testing).

https://segment.com/blog/goodbye-microservices/

PayPal has been pushing the limits of the Hadoop NameNode—and one particular area of concern is in their ability to analyze the FSImage to produce usage reports. They've implemented a new system, called NameNode Analytics, that works by tailing the EditLog. Latency goes down from hours to minutes, and they can produce much more real-time and actionable analytics. NameNode Analytics is open source and available on Github.

https://www.paypal-engineering.com/2018/07/11/namenode-analytics/

The Hotels.com team has a great article about CircusTrain, their tool for replicating Hive data across on-prem and cloud storage. The post describes the architecture, the evolution of the service, and some of the optimizations they've added. Seems like a useful tool if you're shuffling bytes across data centers or cloud services.

https://medium.com/hotels-com-technology/replicating-big-datasets-in-the-cloud-c0db388f6ba2

This post has a good overview of using the Oracle MySQL operator to run a MySQL cluster on Kubernetes. It covers design goals and some example operations.

https://banzaicloud.com/blog/mysql-on-kubernetes/

Trivago is using Apache Kafka and Amazon Kinesis+Lambda to replicate data from an on-prem MySQL instance to AWS. They've written about their experience using the setup, which includes how to overcome some scaling challenges and implement error handling.

https://tech.trivago.com/2018/07/13/aws-kinesis-with-lambdas-lessons-learned/

Netflix has written about their experience using Memcache's new extstore to cache data using SSDs at significant cost savings.

https://medium.com/netflix-techblog/evolution-of-application-data-caching-from-ram-to-ssd-a33d6fa7a690

Sponsor

Unravel demoed a new, fully automated Spark optimization tool at Spark Summit in San Francisco. They showed how to speed up or improve reliability of any Spark application with a single click. See the demo video or download the slides here.

http://bit.ly/unravel-spark-optimization

News

The MemSQL blog has an interesting analysis of the evolution of distributed databases over the last 10+ years. It provides a good overview of how NoSQL databases created and evolved—and how that need and rethinking of assumptions also spurred innovation in the relational database ecosystem.

http://blog.memsql.com/nosql/

The Harvard Business Review has a good analysis of the new Data Privacy Law in California.

https://hbr.org/2018/07/what-you-need-to-know-about-californias-new-data-privacy-law

This analysis notes that Hadoop, especially the Hadoop Distributed File System, is losing some momentum. As many orgs move to the cloud, vendors are now supporting and emphasizing cloud blob storage.

https://siliconangle.com/blog/2018/07/09/hadoops-star-dims-era-cloud-object-data-storage-stream-computing/

Jobs

The Data Eng Weekly board has jobs from Etsy (Senior Data Engineer - New York), Netflix (Senior Data Engineer - Los Gatos, CA), Wooga (Data Engineer - Berlin), and Shopify (Software Eng Data Infrastructure - Ottawa/Waterloo/Montreal).

https://jobs.dataengweekly.com/

Releases

Hadoopi is a Hadoop distribution for the Raspberry Pi. It supports a number of components from the ecosystem (like HBase, Hive, and Spark), and a recent release added prometheus and grafana.

https://github.com/andyburgin/hadoopi/releases/tag/1.2

Apache Kylin 2.4.0 was released a couple of weeks back. The new version of the OLAP engine for big data adds several features, improvements, and bug fixes.

http://kylin.apache.org/docs/release_notes.html#v240---2018-06-23

Google Cloud Dataflow Stream Processing recently added support for Python, which is built on Apache Beam. The announcement includes some example code—the syntax is pretty concise.

https://cloud.google.com/blog/big-data/2018/06/dataflow-stream-procgessing-now-supports-python

DataArtisans has announced da Platform 1.1, which includes Apache Flink 1.5.0, improved interoperability with Kubernetes, and more.

https://data-artisans.com/blog/announcing-da-platform-1-1-with-support-for-apache-flink-1-5-0-and-other-additions

You may remember the recent research paper in which the author's analyzed ORM usage to detect common API misuse. That team has now released an open source project that analyzes and detects potential problems in a Rails app.

https://medium.com/@uwdb/introducing-powerstation-26c8b3a53191

Jib is a new tool from Google for natively building Docker containers in the JVM.

https://cloudplatform.googleblog.com/2018/07/introducing-jib-build-java-docker-images-better.html?m=1

Splice Machine is now available as a managed service on Microsoft Azure.

https://www.prnewswire.com/news-releases/splice-machines-data-platform-for-intelligent-applications-now-available-as-a-fully-managed-service-on-microsoft-azure-300678986.html

Apache Flink 1.5.1 is out - it's a maintenance release with over 60 bug fixes, improvements, and new features.

http://flink.apache.org/news/2018/07/12/release-1.5.1.html

Hortonworks Data Platform 3.0.0 is now generally available. It includes a number of new features, like Erasure Coding, Namenode Federation, support for Dockerized Spark jobs and Docker containers on YARN, several improvements to Hive, and much more. Lots more new features to check out in the blog post.

https://hortonworks.com/blog/announcing-general-availability-hortonworks-data-platform-3-0-0-ambari-2-7-0-smartsense-1-5-0/

Apache Phoenix 5.0.0 is out, with compatibility for Apache HBase 2.0 and upgraded Spark & Hive dependencies.

https://blogs.apache.org/phoenix/entry/apache-phoenix-releases-next-major

Sponsors

Dremio is an open source Data-as-a-Service platform, based on SQL and Apache Arrow. Accelerate your queries up to 1000x. Self-service experience for BI and data science users. Download at https://bit.ly/2t96fa7, or visit http://dremio.com to learn more.

Unravel demoed a new, fully automated Spark optimization tool at Spark Summit in San Francisco. They showed how to speed up or improve reliability of any Spark application with a single click. See the demo video or download the slides here.

http://bit.ly/unravel-spark-optimization

Events

Curated by Datadog ( http://www.datadog.com )

California

Uber Data Platform Night (San Francisco) - Tuesday, July 17
https://www.meetup.com/UberEvents/events/252239845/

Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Thursday, July 19
https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/251481797/

Bay Area Apache Spark Meetup @ Databricks HQ (San Francisco) - Thursday, July 19
https://www.meetup.com/spark-users/events/251930598/

Texas

Building Data Pipelines with Hortonworks Data Flow/Apache NiFi (Addison) - Monday, July 16
https://www.meetup.com/DFW-Data-Science/events/252474656/

UNITED KINGDOM

Open Source & Large Scale Data Science (Bristol) - Monday, July 16
https://www.meetup.com/south-west-data/events/252224545/

SPAIN

All about Kafka: Origins, Ecosystem and Future Directions (Madrid) - Wednesday, July 18
https://www.meetup.com/apachekafkamadrid/events/251264232/

GERMANY

Data Science/Engineering @ Dalia (Berlin) - Monday, July 16
https://www.meetup.com/Data-Science-in-Action-Dalia-Research/events/252266415/

Microservices & Events: Architecture with Kafka & Atom (Oldenburg) - Wednesday, July 18
https://www.meetup.com/jugbremen/events/252240737/

POLAND

WHUG: Assisting Millions of Active Users in Real-Time with Apache Flink (Warszawa) - Wednesday, July 18
https://www.meetup.com/warsaw-hug/events/252459981/

HUNGARY

Intro to Apache Hive and HDFS Erasure Coding (Budapest) - Wednesday, July 18
https://www.meetup.com/Cloudera-Tech-Meetup/events/251989566/

ISRAEL

Big Data on AWS (Tel Aviv) - Monday, July 16
https://www.meetup.com/AWS-IL/events/251200777/

AUSTRALIA

Kafka Streams API and Kafka at Zendesk (Melbourne) - Monday, July 16
https://www.meetup.com/KafkaMelbourne/events/252500340/