I spent a lot of time this week reading about how different engineering teams have scaled their Hadoop infrastructure. I also spent time looking at consistent hashing strategies, how GitHub manages DNS and routing, and how individual personalities affect team culture and performance. Read on for all the details.
One of GitHub’s infrastructure engineers talks about they use Geographic DNS routing and network peering to impact how their traffic is shaped. Through their efforts they were able to direct 60-70% of GitHub traffic through optimal routes.
This article on the Harvard Business Review discusses how personalities affect team performance. Yes, the skill of each person is important, but other factors such as whether you are results-focused or relationship-focused matter too. The tl;dr is a mixture of various personality traits tends to yield greater team performance and interpersonal dynamics.
Uber’s engineering team talks about how they’ve addressed critical bottlenecks in scaling their Hadoop/HDFS infrastructure. The post talks about splitting Hadoop/HDFS into multiple physical clusters using Namespaces and ViewFs, deploying HDFS upgrades through better a deployment framework, and tuning NameNode garbage collection.
After reading how Uber addressed scalability challenges with ViewFs I wanted to better understand HDFS Federation. HDFS Federation addresses a limitation in prior Hadoop architectures where a single NameNode manages the Namespace. With federation you can support multiple NameNode/Namespaces which can bring better performance and process isolation, especially in a deployment with a lot of small files.
This article on InfoQ provides some background on General Data Protection Regulation (GDPR) and Amazon being GDPR ready for the legislation which goes into effect next month.
This post discusses the results of testing Hive-MR3 and Hive-on-Tez using the TPC-DS benchmark on two different clusters. For this lab environment Hive-MR3 resulted in faster query execution times but it’s not an easy comparison since Hive-MR3 shares the same runtime environment with Hive-on-Tez.
This article discusses various solutions to a common hashing problem. Given a key/value store, you want to distribute the keys evenly across all servers so you can easily find them again. But, you can’t store a global directory/lookup to tell you where things are. Algorithms discussed include: Random Trees, Ketama (libketama), AWS Dynamo’s k/v store, Jump Hash, Multi-probe Consistent Hashing and more.
The Teads engineering team talks about their move from a Lambda Architecture to a multi-cloud approach using AWS (Kafka) and Google Cloud Platform (BigQuery, DataFlow). Much of the article talks about their approach in scaling BigQuery through different ETL practices and data rollups.
The Uber engineering team discusses their Presto architecture and how they created an open source Parquet reader that uses memory and CPU more efficiently. Several of their Parquet reader performance and efficiency strategies are discussed include: skipping unnecessary data reads, reading columns vs rows, and lazy reads. If you use Presto - or Presto with Hadoop - it’s worth checking this out.