Data Eng Weekly Issue #282

23 September 2018

A few tutorials, some pretty interesting distributed systems content, and a couple of compelling applications of machine learning to data cleansing and database systems. There's also a preview of Postgres 11, a list of talks from Strata to check out, and a handful of releases.

Sponsor

Built by narwhals, just for you – Dremio simplifies data engineering and data analytics with the power of Apache Arrow. Connect almost any data source. Accelerate queries up to 1,000x. Let your BI and data science users curate their own data with our nautically-themed user interface. Open source.

Visit https://bit.ly/about-dremio to learn more, or download for free.

Technical

This multi-chapter tutorial walks through building a distributed system atop of the riak_core framework, which is the foundation of Riak. Each chapter has lots of sample code with descriptions, and it's all open source on Github.

https://marianoguerra.github.io/riak-core-tutorial/

An example of a real-time streaming pipeline with StreamSets. Data is consumed via HTTP, translated from one JSON structure to another, written to Kafka, and ultimately lands in MapD.

https://www.jowanza.com/blog/2018/9/8/real-time-station-tracking-ford-gobike-and-mapd

This post previews the forthcoming Postgres 11 and argues that it now has many features that were previously differentiators of proprietary databases. These include improved partitioning, parallelization (especially of b-tree index builds and hash joins), and JIT compilation of queries.

https://lwn.net/Articles/764515/

Fivetran has updated their benchmark comparison of Redshift, Snowflake, Azure, Presto and BigQuery. There are a lot of assumptions baked into the benchmark, but the results show that "all warehouses had excellent execution speed, suitable for ad-hoc interactive querying." The authors have proposed five key features that help differentiate the databases, and they've included that feature matrix in the report. There's also a comparison of their benchmark to previous results—lots to consider!

https://fivetran.com/blog/warehouse-benchmark

The spark-metrics project provides a mechanism to send Apache Spark metrics to Prometheus. It recently added features for standardizing metrics names across jobs and attaching context as labels.

https://banzaicloud.com/blog/spark-prometheus-sink-labels/

This article describes an example use case and provides good visuals of building with the distributed log/event hub architecture to replace ETL between systems and improve timeliness of data.

https://www.confluent.io/blog/changing-face-etl

With just a small amount of code, you can send data to Wallaroo from Python and spawn a number of worker tasks to crunch on that data. This post provides an example using Pandas.

https://blog.wallaroolabs.com/2018/09/make-python-pandas-go-fast/

While this is more of a linux system article, it's about tuning high-performance tuning, which a lot of readers are probably doing. It turns out that in certain Xen virtualization setups, the hypervisor can add a lot of overhead to the clock_gettime syscalls. This post has a bunch more details on the problem and how to diagnose it.

https://heapanalytics.com/blog/engineering/clocksource-aws-ec2-vdso

Usually data engineering enables machine learning, but this post has an example that turns the tables—extracting features from data files to build a classifier for detecting column types.

https://medium.com/liveramp-engineering/using-machine-learning-to-auto-detect-column-types-in-customer-files-80413c976a1e

A good overview of the tradeoffs in distributed databases, how Spanner guarantees consistency, and why some spanner-derivative databases aren't able to provide the same guarantees.

http://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-failing-to.html

This paper describes an AI approach to database query optimization. Rather than relying on heuristics for pruning the query optimization search space, the authors apply a Reinforcement Learning algorithm.

https://databeta.wordpress.com/2018/09/20/the-crossroads-of-ai-and-database-algorithms-query-optimization/

These slides cover lots of great distributed systems concepts, such as the 8 fallacies of distributed systems. There's a fantastic description of the CAP theorem and lots of practical advice about building distributed systems.

https://drive.google.com/file/d/15nxAaVXZwNFnJNVvgtKonNbzxNgTUCxP/view

Traveloka has designed an API atop of their BigQuery-backed analytics infrastructure. The API abstracts the underlying query and storage layer so that they can enforce access control, standardize access, audit queries, and more.

https://medium.com/traveloka-engineering/data-lake-api-on-microservice-architecture-using-bigquery-10d6e9c5ca8f

Lots of examples that demonstrate the intricacies of how Spark serializes functions and classes, and why this doesn't always work.

https://medium.com/onzo-tech/serialization-challenges-with-spark-and-scala-part-2-now-for-something-really-challenging-bd0f391bd142

Sponsor

Mount Sinai School of Medicine is hiring Data Engineers; come work on cool research and important applied problems in NYC's largest healthcare system!

https://careers.mountsinai.org/jobs/2311556 to apply

News

The Call for Papers for Big Data Technology Warsaw 2019, which takes place in February, is open through October 15th.

http://getindata.com/become-speaker-big-data-technology-warsaw-2019/

A list of several recommended talks from the recent Strata Conf.

https://cloudlock.engineering/strata-conference-nyc-2018-c0a9164aa10a

data Artisans announced that Flink Forward is going to Beijing in December, San Francisco in April, and Berlin in September.

https://data-artisans.com/blog/data-artisans-announces-flink-forward-conference-expansion-to-china

Jobs

Have you checked out the Data Eng Weekly job board yet? https://jobs.dataengweekly.com/. Jobs:

Linux Big Data Engineer, G-Research, London: https://jobs.dataengweekly.com/jobs/cc513d48-56d0-4818-8364-84b1319a9411
Data Engineer, AginicX, Sydney: https://jobs.dataengweekly.com/jobs/07f44617-4048-4236-beb7-9b7ae47fb849

Post a job for $99. https://jobs.dataengweekly.com/

Releases

Apache Flink 1.5.4 and 1.6.1 were released. Both are bug fix releases, with the former resolving bugs with HA and timeout issues. The latter has a bunch of fixes and improvements, including in the Kinesis connector, in resuming from a checkpoint, and a memory leak problem.

https://flink.apache.org/news/2018/09/20/release-1.5.4.html
https://flink.apache.org/news/2018/09/20/release-1.6.1.html

Couchbase 6.0 was released, with a new Analytics Service, which is a distributed data store that supports efficient query of JSON data. It's built with Apache AsterixDB and SQL++, which is a superset of SQL, for querying JSON data.

https://www.datanami.com/2018/09/20/couchbase-to-deliver-parallel-json-analytics-without-the-etl/

Apache Atlas, the data governance and metadata framework, version 1.1.0 was released this week with an updated authorization model, support for AWS data types, and more.

http://atlas.apache.org/1.1.0/WhatsNew-1.1.html
https://lists.apache.org/thread.html/f4511017d62a5932bc8e0967547a4527408b4634ad1dd4dc0244c018@%3Cannounce.apache.org%3E

Version 2.5.0 of Apache Kylin, the OLAP engine, was released with support for Hadoop 3.0 & HBase 2.0, MySQL for metadata storage, and much more.

https://lists.apache.org/thread.html/02551488382fb3f0a3717ce94d548d9af0f59e2dadd0d96c447b0843@%3Cannounce.apache.org%3E

The Apache Pulsar distributed pub-sub messaging system announced version 2.1.1-incubating with fixes to the 2.1.0 release.

https://lists.apache.org/thread.html/322e9c5ed39169ef0f3257d194180011dc6000959d82f5786845be96@%3Cannounce.apache.org%3E

Data Eng Weekly

Data Eng Weekly Issue #282

Sponsor

Technical

Sponsor

News

Jobs

Releases

Sponsors

Events

California

Texas

Florida

Virginia

New York

Massachusetts

SPAIN

BELGIUM

NETHERLANDS

GERMANY

AUSTRALIA

NEW ZEALAND