23 September 2018
A few tutorials, some pretty interesting distributed systems content, and a couple of compelling applications of machine learning to data cleansing and database systems. There's also a preview of Postgres 11, a list of talks from Strata to check out, and a handful of releases.
Built by narwhals, just for you – Dremio simplifies data engineering and data analytics with the power of Apache Arrow. Connect almost any data source. Accelerate queries up to 1,000x. Let your BI and data science users curate their own data with our nautically-themed user interface. Open source.
Visit https://bit.ly/about-dremio to learn more, or download for free.
This multi-chapter tutorial walks through building a distributed system atop of the riak_core
framework, which is the foundation of Riak. Each chapter has lots of sample code with descriptions, and it's all open source on Github.
https://marianoguerra.github.io/riak-core-tutorial/
An example of a real-time streaming pipeline with StreamSets. Data is consumed via HTTP, translated from one JSON structure to another, written to Kafka, and ultimately lands in MapD.
https://www.jowanza.com/blog/2018/9/8/real-time-station-tracking-ford-gobike-and-mapd
This post previews the forthcoming Postgres 11 and argues that it now has many features that were previously differentiators of proprietary databases. These include improved partitioning, parallelization (especially of b-tree index builds and hash joins), and JIT compilation of queries.
https://lwn.net/Articles/764515/
Fivetran has updated their benchmark comparison of Redshift, Snowflake, Azure, Presto and BigQuery. There are a lot of assumptions baked into the benchmark, but the results show that "all warehouses had excellent execution speed, suitable for ad-hoc interactive querying." The authors have proposed five key features that help differentiate the databases, and they've included that feature matrix in the report. There's also a comparison of their benchmark to previous results—lots to consider!
https://fivetran.com/blog/warehouse-benchmark
The spark-metrics
project provides a mechanism to send Apache Spark metrics to Prometheus. It recently added features for standardizing metrics names across jobs and attaching context as labels.
https://banzaicloud.com/blog/spark-prometheus-sink-labels/
This article describes an example use case and provides good visuals of building with the distributed log/event hub architecture to replace ETL between systems and improve timeliness of data.
https://www.confluent.io/blog/changing-face-etl
With just a small amount of code, you can send data to Wallaroo from Python and spawn a number of worker tasks to crunch on that data. This post provides an example using Pandas.
https://blog.wallaroolabs.com/2018/09/make-python-pandas-go-fast/
While this is more of a linux system article, it's about tuning high-performance tuning, which a lot of readers are probably doing. It turns out that in certain Xen virtualization setups, the hypervisor can add a lot of overhead to the clock_gettime
syscalls. This post has a bunch more details on the problem and how to diagnose it.
https://heapanalytics.com/blog/engineering/clocksource-aws-ec2-vdso
Usually data engineering enables machine learning, but this post has an example that turns the tables—extracting features from data files to build a classifier for detecting column types.
A good overview of the tradeoffs in distributed databases, how Spanner guarantees consistency, and why some spanner-derivative databases aren't able to provide the same guarantees.
http://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-failing-to.html
This paper describes an AI approach to database query optimization. Rather than relying on heuristics for pruning the query optimization search space, the authors apply a Reinforcement Learning algorithm.
These slides cover lots of great distributed systems concepts, such as the 8 fallacies of distributed systems. There's a fantastic description of the CAP theorem and lots of practical advice about building distributed systems.
https://drive.google.com/file/d/15nxAaVXZwNFnJNVvgtKonNbzxNgTUCxP/view
Traveloka has designed an API atop of their BigQuery-backed analytics infrastructure. The API abstracts the underlying query and storage layer so that they can enforce access control, standardize access, audit queries, and more.
Lots of examples that demonstrate the intricacies of how Spark serializes functions and classes, and why this doesn't always work.
Mount Sinai School of Medicine is hiring Data Engineers; come work on cool research and important applied problems in NYC's largest healthcare system!
https://careers.mountsinai.org/jobs/2311556 to apply
The Call for Papers for Big Data Technology Warsaw 2019, which takes place in February, is open through October 15th.
http://getindata.com/become-speaker-big-data-technology-warsaw-2019/
A list of several recommended talks from the recent Strata Conf.
https://cloudlock.engineering/strata-conference-nyc-2018-c0a9164aa10a
data Artisans announced that Flink Forward is going to Beijing in December, San Francisco in April, and Berlin in September.
https://data-artisans.com/blog/data-artisans-announces-flink-forward-conference-expansion-to-china
Have you checked out the Data Eng Weekly job board yet? https://jobs.dataengweekly.com/. Jobs:
Post a job for $99. https://jobs.dataengweekly.com/
Apache Flink 1.5.4 and 1.6.1 were released. Both are bug fix releases, with the former resolving bugs with HA and timeout issues. The latter has a bunch of fixes and improvements, including in the Kinesis connector, in resuming from a checkpoint, and a memory leak problem.
https://flink.apache.org/news/2018/09/20/release-1.5.4.html
https://flink.apache.org/news/2018/09/20/release-1.6.1.html
Couchbase 6.0 was released, with a new Analytics Service, which is a distributed data store that supports efficient query of JSON data. It's built with Apache AsterixDB and SQL++, which is a superset of SQL, for querying JSON data.
https://www.datanami.com/2018/09/20/couchbase-to-deliver-parallel-json-analytics-without-the-etl/
Apache Atlas, the data governance and metadata framework, version 1.1.0 was released this week with an updated authorization model, support for AWS data types, and more.
http://atlas.apache.org/1.1.0/WhatsNew-1.1.html
https://lists.apache.org/thread.html/f4511017d62a5932bc8e0967547a4527408b4634ad1dd4dc0244c018@%3Cannounce.apache.org%3E
Version 2.5.0 of Apache Kylin, the OLAP engine, was released with support for Hadoop 3.0 & HBase 2.0, MySQL for metadata storage, and much more.
The Apache Pulsar distributed pub-sub messaging system announced version 2.1.1-incubating with fixes to the 2.1.0 release.
Built by narwhals, just for you – Dremio simplifies data engineering and data analytics with the power of Apache Arrow. Connect almost any data source. Accelerate queries up to 1,000x. Let your BI and data science users curate their own data with our nautically-themed user interface. Open source.
Visit https://bit.ly/about-dremio to learn more, or download for free.
Mount Sinai School of Medicine is hiring Data Engineers; come work on cool research and important applied problems in NYC's largest healthcare system!
https://careers.mountsinai.org/jobs/2311556 to apply
Curated by Datadog ( http://www.datadog.com )
Kafka & Elasticsearch: Introduction, Best Practices & User Stories (San Francisco) - Monday, September 24
https://www.meetup.com/KafkaBayArea/events/254248245/
Airflow Meetup @ Google (Sunnyvale) - Monday, September 24
https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/253105418/
Hadoop Contributors Meetup (Sunnyvale) - Tuesday, September 25
https://www.meetup.com/Hadoop-Contributors/events/254012512/
Vespa: Open Source Big Data Serving Engine + Lambda Architecture in Practice (San Francisco) - Wednesday, September 26
https://www.meetup.com/SF-Big-Analytics/events/254461052/
Stream Ingestion, Processing & Analytics + In-Memory in the Cloud! (Menlo Park) - Wednesday, September 26
https://www.meetup.com/Bay-Area-In-Memory-Computing/events/254629901/
Messaging + Stream Processing Systems: Kafka, Cassandra, Spark, Nats (San Jose) - Thursday, September 27
https://www.meetup.com/Women-Who-Go-South-Bay/events/252413917/
Easy Hadoop + Presto + Spark in Azure: This Is Qubole (Irving) - Thursday, September 27
https://www.meetup.com/Data-AI-Microsoft/events/254771837/
Data Engineering at Scale with Azure Databricks (Tampa) - Thursday, September 27
https://www.meetup.com/Tampa-Bay-BI-Data-Analytics/events/250884906/
5 Steps to Build Streaming Systems with Confluent's Neha Narkhede (Tysons) - Tuesday, September 25
https://www.meetup.com/Apache-Kafka-DC/events/254479484/
Apache Hive 3: A New Horizon (New York) - Wednesday, September 26
https://www.meetup.com/futureofdata-newyork/events/254153447/
Welcome HDP 3.0 (Boston) - Wednesday, September 26
https://www.meetup.com/futureofdata-boston/events/253213786/
KSQL and Demystifying Kafka (Barcelona) - Wednesday, September 26
https://www.meetup.com/Barcelona-Kafka-Meetup/events/254252957/
Moving Away from Legacy Using Kafka Streams (Barcelona) - Thursday, September 27
https://www.meetup.com/privaliatech/events/254786911/
Kafka by Robin Moffatt of Confluent + Big Industries (Kontich) - Tuesday, September 25
https://www.meetup.com/Brussels-Apache-Kafka-Meetup-by-Confluent/events/253855467/
Deepdive into Spark SQL (Amsterdam) - Thursday, September 27
https://www.meetup.com/Amsterdam-Spark/events/254846744/
Flink with Amazon + Monitoring Flink with Prometheus (Munich) - Wednesday, September 26
https://www.meetup.com/Hadoop-User-Group-Munich/events/252393503/
Tim Berglund Talks KSQL (Sydney) - Wednesday, September 26
https://www.meetup.com/apache-kafka-sydney/events/254705244/
Kafka 101: Kafka in a Microservices Architecture (Sydney) - Thursday, September 27
https://www.meetup.com/ExpertTalks-Sydney/events/254680260/
Tim Berglund Talks KSQL (Auckland) - Thursday, September 27
https://www.meetup.com/Auckland-Kafka/events/254425084/