Data Eng Weekly


Data Eng Weekly Issue #284

07 October 2018

Cloudera and Hortonworks made big news this week by announcing that they intend to merge—it will be interesting to keep an eye on what happens in the coming months as that deal closes. Elastic had their IPO this week, Salesforce unveiled a new open source project to mirror data between Kafka clusters, and Wallaroo announced that they've re-licensed their project using the Apache 2.0 license. In technical posts, several tutorials and posts sharing experience with various tools.

Sponsor

Built by narwhals, just for you – Dremio simplifies data engineering and data analytics with the power of Apache Arrow. Connect almost any data source. Accelerate queries up to 1,000x. Let your BI and data science users curate their own data with our nautically-themed user interface. Open source.

Visit https://bit.ly/about-dremio to learn more, or download for free.

Technical

Amazon has a tutorial for setting up Kubeflow, the framework for running TensorFlow on Kubernetes. They include details on training a model using a Jupyter notebook and serving out the model using Seldon Core (a tool for machine learning on Kubernetes).

https://aws.amazon.com/blogs/opensource/kubeflow-amazon-eks/

Landoop has a good tutorial of using their Lenses SQL engine for Apache Kafka to analyze DNS traffic. They describe how to build an application to detect data exfiltration and other malicious activity.

https://www.landoop.com/blog/2018/10/lenses-sql-for-your-intrustion-detection-system/

Martin Kleppmann, author of of Designing Data-Intensive Applications, gave a talk on Conflict-free Replicated Data Types (CRDTs) at QCon London. It's a great presentation on distributed system topics and how CRDTs can solve some important problems. The video and the slides, which provide a number of great visual examples, are posted online.

https://www.infoq.com/presentations/crdt-distributed-consistency

Confluent's second post on troubleshooting KSQL has an overview of how to debug running queries via CLI (using EXPLAIN and SHOW QUERIES) and the Confluent Control Center as well as analyzing metrics exposed via JMX.

https://www.confluent.io/blog/troubleshooting-ksql-part-2

Autotrader has a great post on how they've productionized their Spark ML workloads. They use Apache Airflow to execute the training ML Spark job, convert the Spark ML models to MLeap for serving requests, have a CI job to build and test a Docker container with their model and MLeap, and deploy using Kubernetes. They have lots of details on these phases and their strategy for the discovery phase of ML which is done via Databricks Notebooks.

https://engineering.autotrader.co.uk/2018/10/03/productionizing-days-to-sell.html

This is great overview of several distributed database concepts, and how production systems have evolved to meet scalability and reliability demands over the past decade. The inline illustrations are quite good, too.

https://www.cockroachlabs.com/blog/brief-history-high-availability/

Several useful tips for building PySpark jobs—how to organize your code, building out unit tests, logging, and more.

https://medium.com/@lubna_22592/building-production-pyspark-jobs-5480d03fd71e

End-to-end example of generating (using Keras), building (using Flask and Docker), and deploying (with Kubernetes) a deep learning model.

https://medium.com/analytics-vidhya/deploy-your-first-deep-learning-model-on-kubernetes-with-python-keras-flask-and-docker-575dc07d9e76

Annalect writes about their experience building a data warehouse on Amazon S3 with Redshift Spectrum. In addition to a glimpse at their architecture, they share several best practices (like using short-lived clusters for jobs) for working with Spectrum.

https://aws.amazon.com/blogs/big-data/how-annalect-built-an-event-log-data-analytics-solution-using-amazon-redshift/

The dataArtisans blog has a good collection of things to consider (like network capacity and number of records + size per record) when determining Apache Flink (and probably some other frameworks) cluster size.

https://data-artisans.com/blog/6-things-to-consider-when-defining-your-apache-flink-cluster-size

Sponsor

"We are no tables, but you might join us." If you find this as funny as we do, you might be perfect for our Runtastic's Data engineering team, building algorithms for our suite of fitness apps.

Apply here: http://bit.ly/runtastic-data-engineer

News

And to the biggest news of the week: Cloudera and Hortonworks have announced that they intend to merge early next year. The deal is an "all-stock merger of equals" in which Cloudera CEO Tom Reilly will be the CEO and Hortonworks COO Scott Davidson will be COO.

https://www.geekwire.com/2018/big-data-stalwarts-cloudera-hortonworks-merge-cloudera-ceo-tom-reilly-will-run-new-company/

Datanami has a great overview of reactions to the merger, which will probably have some winners and losers (when it comes to Apache open source projects). A lot of industry veterans weighed in on Twitter, and this has a good summary of their sentiments.

https://www.datanami.com/2018/10/04/reaction-to-hortonworks-cloudera-mega-merger/

The Cloudera and Hortonworks merger brings together the two public Hadoop companies. Other than SaaS Hadoop vendors, MapR is the other big enterprise company that has their own distribution. The reaction by their CEO highlights the difference of the MapR approach, which involves their own core technology.

https://mapr.com/blog/in-a-consolidating-market-mapr-delivers-today/

Wallaroo has announced that their flagship open source product is now fully Apache 2.0 licensed. Their stream processing framework is an interesting alternative (with first-class Python support) to others that are mostly built on the JVM.

https://blog.wallaroolabs.com/2018/10/wallaroo-goes-full-apache-2.0/

Elastic had their IPO this week, and they raised around $250 million.

https://www.datanami.com/2018/10/05/elastic-ipo-expected-to-raise-250m/

Jobs

Have you checked out the Data Eng Weekly job board yet? https://jobs.dataengweekly.com/. Jobs:

Post a job for $99. https://jobs.dataengweekly.com/

Releases

Wallaroo 0.5.3 was released. It includes a preview of the Python Connector API and a new snapshotting implementation based on the Chandy-Lamport algorithm.

https://github.com/WallarooLabs/wallaroo/releases/tag/0.5.3

StreamSets Data Collector 3.5.0 and Control Hub 3.4.0 were released. In the new version of data collector, there are improvements to the microservice pipelines, data governance, and support for delimited and excel data.

https://streamsets.com/blog/streamsets-announces-control-hub-version-3-4-0-and-streamsets-data-collector-version-3-5-0/

MapR 6.1 is out with a new "secure by default" configuration, improved streaming security, improvements to the MapR filesystem, support for idempotent producers in the MapR event store, and much more.

https://mapr.com/blog/mapr-6-1-release-with-mep-6-0-is-now-generally-available/
https://mapr.com/docs/61/ReleaseNotes/whatsnew.html

Starburst Presto Enterprise 208e has a new integration into Apache Ranger and Apache Sentry for Role-Based Access Control. This blog post describes how to setup and use RBAC using the CLI.

https://www.starburstdata.com/technical-blog/presto-security-apache-ranger/

Salesforce has open sourced Mirus, their tool for mirroring data between Apache Kafka clusters. Mirus solves a few problems in Mirror Maker, like scaling to multiple clusters, supporting a dynamic configuration, and fault tolerance. The post has some details about the architecture and implementation. It's been used in production at Salesforce for 6 months.

https://engineering.salesforce.com/open-sourcing-mirus-3ec2c8a38537

Sponsor

Built by narwhals, just for you – Dremio simplifies data engineering and data analytics with the power of Apache Arrow. Connect almost any data source. Accelerate queries up to 1,000x. Let your BI and data science users curate their own data with our nautically-themed user interface. Open source.

Visit https://bit.ly/about-dremio to learn more, or download for free.

"We are no tables, but you might join us." If you find this as funny as we do, you might be perfect for our Runtastic's Data engineering team, building algorithms for our suite of fitness apps.

Apply here: http://bit.ly/runtastic-data-engineer

Events

Curated by Datadog ( http://www.datadog.com )

California

Bay Area Flink Meetup (Santa Clara) - Thursday, October 11
https://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/255047007/

Texas

Making Data Great Again (Plano) - Tuesday, October 9
https://www.meetup.com/North-Texas-DAMA-Meetup/events/254364348/

Virginia

Finding Cost Efficiencies with Spark & AWS (Tysons) - Thursday, October 11
https://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/253545616/

District of Columbia

Big Data Beyond Hadoop (College Park) - Wednesday, October 10
https://www.meetup.com/Big-Data-In-Action/events/255193940/

UNITED KINGDOM

Streaming, Databases, & Distributed Systems: Bridging the Divide (Bristol) - Tuesday, October 9
https://www.meetup.com/BigDataBristol/events/254440209/

FRANCE

LinuxKit + Kafka (Nantes) - Tuesday, October 9
https://www.meetup.com/Nantes-Java-User-Group/events/255067803/

GERMANY

Gwen Shapira and Dominik Benz Talk Kafka (München) - Monday, October 8
https://www.meetup.com/Apache-Kafka-Germany-Munich/events/254705011/

Munich Datageeks (Unterföhring) - Thursday, October 11
https://www.meetup.com/Munich-Datageeks/events/255051436/

RUSSIA KafkaStreams & KSQL (Saint Petersburg) - Saturday, October 13
https://www.meetup.com/St-Petersburg-Kafka/events/255101428/

AUSTRALIA

Data Engineering Meetup (Sydney) - Wednesday, October 10
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/255260041/