Data Eng Weekly Issue #270

24 June 2018

Lots of variety in this week's issue—topics include Data Reliability Engineering at Criteo, the history of the Apache Arrow project, stream processing with both Wallaroo and St8Flow (a javascript framework), a parquet backend for SQLite, and a few posts on working with large scale relational databases.

Sponsor

SimpleDataLabs builds Prophecy - a Predictive Analytics Designer for Business Analysts, powered by our DeepWisdom engine. It'll put Predictive Analytics in every Business. We're looking for two Founding Engineers - System Architect to drive SaaS Application React/Scala/Spark/K8s/Cloud and ML Architect who can build MetaLearning in Tensorflow.

Contact Raj on LinkedIn http://bit.ly/raj-bains-linkedin or see http://bit.ly/simpledatalabs

Technical

Heap have built a product that captures and analyzes lots of data about how users interact with a website. This post describing their software architecture has some great advice about technical decision making and recommendations for technologies (e.g. to adopt Kafka early).

https://stackshare.io/heap/how-heap-built-an-analytics-platform-that-auto-tracks-every-user-event

The Citus blog has an overview of a neat trick for incrementally building rollup tables by keeping track of a high-watermark of autogenerated ids.

https://www.citusdata.com/blog/2018/06/14/scalable-incremental-data-aggregation/

This tutorial walks through how to run the dataArtisan's platform on the Google Kubernetes Engine. In addition to the basics of running an application on Kubernetes, it covers using Google Cloud Storage for checkpoints and building/using a custom docker image.

https://data-artisans.com/blog/getting-started-with-da-platform-on-google-kubernetes-engine

This post walks through a solution to a common problem—getting data from an external service (e.g. Google Analytics) into your data platform. It uses the StreamSets data collector with the HTTP Client origin.

https://streamsets.com/blog/extract-data-google-analytics-streamsets-data-collector/

Great collection of best practices for building or refining your data ingestion system.

http://www.adaltas.com/en/2018/06/18/data-lake-ingestion-best-practices/

Github has migrated from a MySQL high availability strategy based on DNS and virtual IPs to one built on Raft, Consul, and HAProxy. They use orchestrator (a system they built internally) for failure detection and initiating MySQL failover. With this solution, they have very small amounts of downtime (<30s) during a failover for their multi-datacenter MySQL deployment.

https://githubengineering.com/mysql-high-availability-at-github/

This post describes how windowing works in the Wallaroo stream processing engine by way of an example that computes trending topics from a stream of tweets.

https://blog.wallaroolabs.com/2018/06/stream-processing-trending-hashtags-and-wallaroo/

The Dremio blog has a look at the architecture behind the Gandiva initiative, which aims to bring speedups to Apache Arrow through LLVM code generation. The post discusses optimizations like vectorization and pipelining. Early work is showing some impressive speedups over the JVM JIT.

https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/

PgBouncer fronts PostgreSQL to handle thousands of connections with fewer resources than the builtin PostgreSQL connection pool. This post, the second in a series, describes how to use PgBouncer in a multi-tenant environment in which multiple types of services with different service-level objectives are connecting to the database.

https://medium.com/futuretech-industries/postgres-raffle-10k-connections-35-mo-part-two-b4c2e0c86e37

Schibsted has a multi-tenant Presto platform for querying data in S3. This post describes a neat solution to authorization built atop of AWS IAM, their usage of the AWS Glue metastore, how they monitor with Datadog, and their CI & deployment infrastructure built on Docker, Travis, Spinnaker, and FPM.

https://medium.com/@FranziCros/accessing-s3-data-through-sql-with-presto-ddb6d4fbb99c

Historically, there haven't been great tools for ad hoc queries of data stored in avro, orc, and parquet files. A new option is a parquet backend for SQLite, which is both quite helpful for ad hoc introspection as well as highly-performant. This opens up the possibility of an online-service consuming parquet files (via SQLite) to power API endpoints. This post introduces the backend and compares performance to other data formats in SQLite and PostgreSQL.

https://cldellow.com/2018/06/22/sqlite-parquet-vtable.html

The St8Flo framework provides a high-level JavaScript API for implementing distributed data processing. This post serves as a good introduction to the API and basic concepts.

https://medium.com/@dhavalwathare/8-quick-concepts-to-building-data-pipes-and-distributed-computing-systems-with-st8flo-b6fac5fde9fa

Bunsen is a Spark connector for FHIR, which is a health care data interchange specification. This post shows how to use it to build an analytics dashboard by querying data in Apache Cassandra.

https://medium.com/@prkpbandara/gsoc-librehealth-fhir-analytics-using-spark-sql-9019dcb41593

Jobs

Trovit is hiring Big Data Engineers in Barcelona.

https://jobs.dataengweekly.com/jobs/7f3d72a3-d0e6-4b1e-8c3a-30a74af6d886

Job postings on the Data Eng Weekly job board are now just $99. Submit a job to reach your peers looking for something new!

https://jobs.dataengweekly.com/submit/job

News

Datanami has coverage of some of the big announcements from Hortonworks and its partners at this past week's DataWorks Summit. Among them are new cloud offerings (including Hortonworks DataFlow in AWS and Microsoft Azure) and a preview of HDP based on Apache Hadoop 3.

https://www.datanami.com/2018/06/18/hortonworks-looks-to-expand-hybrid-cloud-footprints/

The Criteo Labs blog has a great post describing the history of their big data systems and data team, including scaling problems and the principles they've embraced to solve technical challenges. It also introduces the notion of a Data Reliability Engineer, which is a hybrid data engineer and SRE. At Criteo, the team responsible for data tools and keeping systems function at scale falls under the SRE organization.

http://labs.criteo.com/2018/06/from-just-a-bunch-of-engineers-to-data-reliability-engineering/

Dremio has the story of Apache Arrow, which has quickly become an important component in data infrastructure. In addition to history (like the original team that conceived of it), they cover recent developments such as support for GPUs and the Arrow Flight Protocol (which aims to replace ODBC/JDBC for in-memory analytics).

https://www.dremio.com/origin-history-of-apache-arrow/

The agenda for the Spark + AI Summit Europe, which takes place in October in London, has been announced. It includes over 100 sessions across 11 tracks.

https://databricks.com/blog/2018/06/21/spark-ai-summit-europe-agenda-announced.html

Confluent Hub is a new "App Store for Kafka" that makes it easy to install Kafka Connect components.

https://www.confluent.io/blog/introducing-confluent-hub/

Sponsor

Contact Raj on LinkedIn http://bit.ly/raj-bains-linkedin, see http://bit.ly/simpledatalabs

Releases

Apache HBase 2.0.1 is out. It includes 70+ bug fixes, improvements, and new features.

https://lists.apache.org/thread.html/572f56a3f539d7a054784fe5665c0f73b84f17b3c74438aecd02d61f@%3Cannounce.apache.org%3E

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

California

Streaming Meetup: Streaming SQL, Kinesis at Lyft, and Apache Calcite (San Francisco) - Wednesday, June 27
https://www.meetup.com/SF-Big-Analytics/events/251066711/

Washington

Flink at Lyft + Graph Processing with Flink (Seattle) - Thursday, June 28
https://www.meetup.com/seattle-apache-flink/events/250529020/

Colorado

Data Engineering in the Cloud Era with Ibotta, Qubole, and Snowflake (Denver) - Tuesday, June 26
https://www.meetup.com/Denver-Data-Engineering/events/251441413/

Georgia

C# Stream Processing with Apache Storm (Alpharetta) - Monday, June 25
https://www.meetup.com/Atlanta-Net-User-Group/events/244527107/

Virginia

Streaming Data Pipelines & Data Science in Healthcare (Tysons) - Wednesday, June 27
https://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/251093508/

New Jersey

Predictive Maintenance with IoT, Apache NIFi, MiniFi, and Blockchain (Hamilton) - Thursday, June 28
https://www.meetup.com/futureofdata-princeton/events/249163765/

New York

Workshop: Event Streams Using Apache Kafka (New York) - Monday, June 25
https://www.meetup.com/ibmcodenyc/events/251679428/

Massachusetts

Keeping Production Sane: Let's Talk about Monitoring and Streaming (Boston) - Wednesday, June 27
https://www.meetup.com/Boston-ELK-Stack/events/251277457/

CHILE

Future of Data Santiago: Episode 1 (Santiago) - Tuesday, June 26
https://www.meetup.com/futureofdata-santiago/events/251644687/

UNITED KINGDOM

Streaming ETL with Kafka & Drones with APIs for Developers (London) - Monday, June 25
https://www.meetup.com/Oracle-Developer-Meetup-London/events/249256400/

FRANCE

Kafka Meetup (Ennevelin) - Tuesday, June 26
https://www.meetup.com/ChtiJUG/events/251468696/

GERMANY

Data Engineering for Artificial Intelligence (Berlin) - Tuesday, June 26
https://www.meetup.com/Zalando-Tech-Events-Berlin/events/251202283/

CZECH REPUBLIC

Stream Processing and Real-Time Data Pipelines (Prague) - Thursday, June 28
https://www.meetup.com/CS-HUG/events/251514033/

ISRAEL

Apache Kafka Streams Workshop, Part 1 (Tel Aviv) - Wednesday, June 27
https://www.meetup.com/ApacheKafkaTLV/events/249089274/

SINGAPORE

Big Data 101: Introduction to Hadoop File System as Storage for Big Data (Singapore) - Thursday, June 28
https://www.meetup.com/BigDataX/events/251229580/

PHILIPPINES

Manila Big Data Tech Meetup #3: HBase & Amazon Kinesis (Taguig) - Wednesday, June 27
https://www.meetup.com/Manila-BIG-DATA-Group/events/250933139/