Data Eng Weekly Issue #271

01 July 2018

Tons of great content this week, from the A/B testing platform at Walmart to Hulu's HDFS data center migration to choosing a SQL engine for data in S3. There's also a good post on the business value of SQL, a great collection of Airflow resources, and a bunch more.

Sponsor

SimpleDataLabs builds Prophecy - a Predictive Analytics Designer for Business Analysts, powered by our DeepWisdom engine. It'll put Predictive Analytics in every Business. We're looking for two Founding Engineers - System Architect to drive SaaS Application React/Scala/Spark/K8s/Cloud and ML Architect who can build MetaLearning in Tensorflow.

Contact Raj on LinkedIn http://bit.ly/raj-bains-linkedin or see http://bit.ly/simpledatalabs

Technical

In this second part of a series on implementing a stream processing system for a telco, they describe the design of the system (based on Apache Kafka and Apache Flink), how they test the system with synthetic data, and how they monitor it using the ELK stack. To avoid lots of deploys, they implement a control stream in Kafka for adding new business rules to the system.

http://getindata.com/stream-analytics-platform-telco-part-2/

Amazon CTO Werner Vogels has a compelling argument that there has been a proliferation of databases to handle all the different types of data and use cases. The post also describes the types of databases and which services at AWS support each type.

https://www.allthingsdistributed.com/2018/06/purpose-built-databases-in-aws.html

This post walks though building a streaming application with KSQL on syslog data in Apache Kafka joined with data from MongoDB. From there, Slack is used for alerts, and ElasticSearch is employed for visualization.

https://www.confluent.io/blog/real-time-syslog-processing-apache-kafka-ksql-enriching-events-with-external-data/

After analyzing several Rails programs to detect inefficiencies in using the ORM, the authors of this paper built a static code analyzer to detect common problems. Simple changes can make big differences—lots of interesting takeaways if you're using an ORM.

https://blog.acolyer.org/2018/06/28/how-_not_-to-structure-your-database-backed-web-applications-a-study-of-performance-bugs-in-the-wild/

Starburst, who offer a custom build of Presto and enterprise services, have benchmarked their offering on AWS. The results show big performance bumps, which suggests that if you're using Presto on EMR, then you might want to try it out with your own data.

https://www.starburstdata.com/technical-blog/starburst-presto-on-aws-18x-faster-than-emr/

Schibsted has written about their experience in choosing a SQL engine for their data in S3. They evaluated Amazon Athena, Amazon Redshift Spectrum, Presto, Hive, and more. Ultimately, they landed on Athena (despite some drawbacks mentioned in the post).

http://bytes.schibsted.com/bigdata-sql-query-engine-benchmark/

A look at the differences in Apache Spark APIs across Scala, Java, Python, R, and SQL. There's a comparison of syntax and performance when using a couple of different user-defined functions. As you might expect, there's a pretty big performance hit for using a non-JVM language.

https://mindfulmachines.io/blog/2018/6/apache-spark-scala-vs-java-v-python-vs-r-vs-sql26

An interesting look at some modern hardware systems (like fast GPUs, FPGAs) from a data perspective. Most are still in early days when it comes to integration with data systems, although the post describes what kinds of applications each might fit well.

https://lemire.me/blog/2018/06/26/data-processing-on-modern-hardware/

Data services tend to use large numbers of relatively expensive machines. There are lots of different places to try to optimize performance and cost, e.g. the application and infrastructure/sizing level. Both Stripe and Zymergen wrote this week about how they optimize their AWS reserved instances to keep costs down. There are several good tips, especially given how overwhelming all of the AWS instance options can be.

https://stripe.com/blog/aws-reserved-instances
https://medium.com/@ZymergenTechBlog/composing-an-aws-reservation-portfolio-c657c304e2d0

Walmart has a post on their A/B testing platform, which is built with Apache Kafka, Spark streaming, and Spark (batch) to implement the lambda architecture. It shares how the implement sessionization of user data in Spark streaming, some of their debugging techniques, their monitoring strategy, and more.

https://medium.com/walmartlabs/how-we-built-a-data-pipeline-with-lambda-architecture-using-spark-spark-streaming-9d3b4b4555d3

This post describes the Spark architecture including how it is designed to scale out. It also covers several drawbacks and weakness in spark, many from lack of maturity. While some tech companies won't mind those drawbacks, there are definitely some things to consider.

https://thenewstack.io/the-good-bad-and-ugly-apache-spark-for-data-science-work/

The above-mentioned challenges include tuning memory usage of Spark executors. This post describes an interesting solution to that problem by potentially overcommitting the system memory but dynamically pausing (snapshotting state to disk) executors to reclaim memory for other tasks. This avoids thrashing and paging leading to some good performance improvements.

https://medium.com/@Petuum/a-solution-to-the-memory-limit-challenge-in-big-data-machine-learning-49783a72088b

Hulu recently migrated their Hadoop clusters across data centers. In the second post about this migration, they talk about their strategy for migrating smaller clusters of 100-200 nodes running both HDP and CDH. Their data migration solution leverages HDFS inotify events to track changes for an incremental synchronization.

https://medium.com/hulu-tech-blog/migrating-hulus-hadoop-clusters-to-a-new-data-center-part-two-creating-a-mirrored-hadoop-9b251ca469c2

This post describes the journey of one project from an Apache Hive cluster to the Snowflake data warehouse. It includes plenty of technical details about loading and converting the ORCfile data as its loaded into Snowflake.

https://medium.com/hashmapinc/snowflakes-cloud-data-warehouse-what-i-learned-and-why-i-m-rethinking-the-data-warehouse-75a5daad271c

SQL support seems to eventually make its way into every big data tool. And for good reason, as is described with several of great examples in this post. Filtering, counting, and personalization can get your product a long way before you have to do anything more complicated.

https://cyberomin.github.io/startup/2018/07/01/sql-ml-ai.html

Jobs

Etsy is hiring Senior Data Engineers in New York.

https://jobs.dataengweekly.com/jobs/0aa00df2-679e-4b35-b646-77ccbafd3451

Job postings on the Data Eng Weekly job board are now just $99. Submit a job to reach your peers looking for something new!

https://jobs.dataengweekly.com/submit/job

News

Databricks has posted the videos from the recent Spark + AI Summit. If the 204 videos are a bit too many to go through, you can filter them by topics, including Data Engineering.

https://databricks.com/sparkaisummit/sessions?eventName=Summit%202018

Astronomer has compiled a great list of Apache Airflow resources.

https://www.astronomer.io/guides/external-airflow-resources/

Hortonworks has a recap from the Women in Big Data Panel that took place at DataWorks Summit. There's also information about how to become a member, partner, or sponsor of the Women in Big Data organization.

https://hortonworks.com/blog/women-big-data-panel-dws18-san-jose/

The Big Data Beard podcast has lots of great interviews from the big data ecosystem. In this week's episode, they speak about the Dataiku platform for data science.

https://bigdatabeard.com/bdb-podcast-ep-32-dataiku-with-dr-ken-stanford/

Sponsor

Contact Raj on LinkedIn http://bit.ly/raj-bains-linkedin or see http://bit.ly/simpledatalabs

Releases

Databricks has announced integration with RStudio, supporting both SparkR and sparklyr, for their service.

https://databricks.com/blog/2018/06/27/rstudio-integration.html

Version 0.2.0 of the Apache NiFi registry was released. There are several major improvements and new features in the release, including to the REST API, integration with git for snapshotting, and several improvements to deployment and configuration.

https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-NiFiRegistry0.2.0

MapR announced a number of new features, including support for S3-compatibility, data tiering, and erasure coding.

https://www.datanami.com/2018/06/26/mapr-makes-platform-more-cloud-like/

Apache NiFi 1.7.0 was also released, with new and improved processors for XML, Hive 3.0 & more, Java 9 support, and UI/UX improvements.

https://lists.apache.org/thread.html/f959d1aa30a568a78b7c939c740a74ab9b40289ddfb5de48616ea542@%3Cannounce.apache.org%3E

Apache Kylin, the OLAP engine for Apache Hadoop, released version 2.4.0. It adds a Spark/Kafka integration, Kafka and Hive table joins, and more.

https://lists.apache.org/thread.html/689671d75f9dfc156f0c47709584b44e0ab391b01a502054c47fd90c@%3Cannounce.apache.org%3E

Events

Curated by Datadog ( http://www.datadog.com )

UNITED KINGDOM

Edinburgh Devops Meetup (Edinburgh) - Thursday, July 5
https://www.meetup.com/Edinburgh-DevOps-Meetup/events/248991957/

FRANCE

July Meetup: Kafka @ JCDecaux (Paris) - Monday, July 2
https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/251949176/

BELGIUM

Streaming Architectures (Brussels) - Wednesday, July 4
https://www.meetup.com/bigdatabe/events/251875687/

ITALY

From Legacy Offloading to Event-Driven Microservices: The Journey of Data (Milano) - Wednesday, July 4
https://www.meetup.com/Milano-Kafka-meetup/events/251372462/