Data Eng Weekly


Hadoop Weekly Issue #223

09 July 2017

A short but sweet issue this week with a slew of posts on streaming systems—encryption in Amazon Kinesis, scaling up/down in Apache Flink, and loading data from Apache Kafka to Google BigQuery. There are interesting posts on Microsoft's new storage infrastructure, distributed consensus & exactly once, and performance of graph processing systems. Finally, Apache Hadoop 3.0.0-alpha4 was also released this week, and there's a great post on some of the new shell script features.

Technical

The Morning Paper has an overview of the SIGMOD publication on the Microsoft Azure Data Lake Store, which is their next gen data storage system. Clusters are up to 50,000 servers and it is built with a number of services including naming, extent management, and secret management. The post notes that adding encryption adds very little overhead, and there is a "Small Append Service" for tweaking performance out of use cases like that of the HBase Write-Ahead Log.

https://blog.acolyer.org/2017/07/04/azure-data-lake-store-a-hyperscale-distributed-file-service-for-big-data-analytics/

This post describes how Amazon Kinesis has implemented server side encryption by leveraging Amazon Key Management Service for generating and storing encryption keys. The post has a walkthrough of the process of encryption and decryption, and it notes that enabling encryption adds only a small amount of overhead, on the order of 0.2ms per record in their tests.

https://aws.amazon.com/blogs/big-data/under-the-hood-of-server-side-encryption-for-amazon-kinesis-streams/

This post describes how Apache Flink uses its checkpointing functionality to rescale job (e.g. to increase or decrease parallelism). It goes into the detail of how Flink efficiently reassigns operator state and keyed state from the checkpoint (which is loaded from persistent storage like HDFS).

http://flink.apache.org/features/2017/07/04/flink-rescalable-state.html

The MyHeritage Engineering blog has a post on loading data from Apache Kafka into Google BigQuery. They examine several different options including batch (via Secor) and streaming (via Apache Kafka Streams, Apache Kafka Connect, and Apache Beam).

https://medium.com/myheritage-engineering/kafka-to-bigquery-load-a-guide-for-streaming-billions-of-daily-events-cbbf31f4b737

The new Apache Kafka exactly-once features have caused quite a bit of discussion, which has lead far past the actual Kafka feature set. Along those lines, this post looks at how exactly-once relates to consensus and the FLP impossibility result. The author argues that they're not the same, and it describes how idempotence and transactions help achieve "effective only once" rather than guaranteeing that something happens once and only once.

https://fpj.me/2017/07/04/no-consensus-in-exactly-once/

Another post on The Morning Paper covers a performance analysis of graph database systems. There are surprising results—mostly that for smaller OLTP-like applications (i.e. a small number of graph edge hops), a traditional SQL database is a good fit. As the post concludes, there seems to be quite a bit of room for reliability and scalability improvements.

https://blog.acolyer.org/2017/07/07/do-we-need-specialized-graph-databases-benchmarking-real-time-social-networking-applications/

Apache Hadoop 3.0.0-alpha4 was released this week. One of the new features is more powerful support for user configuration via shell variables in the Hadoop shell scripts. This post describes the new features of user restrictions and user switching as well as the return of the start-all script.

https://effectivemachines.com/2017/07/08/powerful-_users-in-apache-hadoop-3-0-0-alpha4/

Releases

Apache Phoenix 4.11 was released this week. It adds support for HBase 1.3.1+, includes performance improvements, and much more.

https://lists.apache.org/thread.html/de6b3a8f06c9ce153bf753e02ef5bd9674116352154ebab484e15edf@%3Cannounce.apache.org%3E

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES

Washington

Full-Stack Spark + Openshift (Seattle) - Tuesday, July 11
https://www.meetup.com/Seattle-Full-Stack/events/240473408/

Full-Stack Analytics App Dev Using Spark and Kafka (Seattle) - Wednesday, July 12
https://www.meetup.com/Metis-Seattle-Data-Science/events/240790353/

Snowflake Makes Apache Spark Faster + Agile Data Science 2.0 (Bellevue) - Thursday, July 13
https://www.meetup.com/Seattle-Spark-Meetup/events/234726935/

Ohio

Cleveland Big Data and Hadoop User Group (Mayfield Village) - Monday, July 10
https://www.meetup.com/Cleveland-Hadoop/events/239698667/

North Carolina

Introduction to Spark for Data Engineers, Data Scientists, and Developers (Raleigh) - Wednesday, July 12
https://www.meetup.com/Big-Data-Developers-in-Raleigh/events/241093719/

CANADA

Apache NiFi: Enterprise Data Flow Management and FBP with Joe Witt (Toronto) - Tuesday, July 11
https://www.meetup.com/Toronto-GTA-Flow-Based-Programming- Meetup/events/240482555/

SPAIN

Big Data Workshop: Data Ingestion in Hadoop (Madrid) - Thursday, July 13
https://www.meetup.com/Big-Data-Madrid/events/241272104/

If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit https://hadoopweekly.com