Tiered Locality is a feature led by my colleague Andrew Audibert at Alluxio. This article dives into the details of how tiered locality helps provide optimized performance and lower costs. 1. Data Locality and Tiered Locality corresponds to the effort required for an application to read the data it needs. The closer or more localized the data is, the faster the application can retrieve it. As a configuration optimization, it provides significant value in big data workloads on distributed systems and enables higher performance without needing to scale more resources. In most cases, data locality is about colocating the data and application on the same node. However, in the cloud, where there are multiple networking layers, there are more tiers of locality. Data locality Tiered locality allows a user to configure data placement policies to accommodate the cluster's network topology in order to achieve performance and cost optimizations. 2. Tiered Locality in Alluxio OSS Tiered locality uses awareness of network topology and configurable policies to manage data placement for performance and cost optimizations. This feature is particularly useful with cloud deployments across multiple availability zones. It can also be useful for cost savings in environments where cross-zone or cross-location traffic is more expensive than intra-zone data traffic. Here is a simple scenario where Alluxio can use network topology information to bias towards more local reads and writes with Alluxio workers in two different AWS Availability Zones. Using this setup with EC2 instances, the application demonstrates different read performance depending on which worker the data is read from. m5.xlarge Unsurprisingly, performance is fastest when reading from the local Alluxio worker and slows when read from a non-local worker. The performance difference between worker 2 and worker 3 is due to the difference in bandwidth between Availability Zones (AZs). Worker 2 is in the same AZ as the application with about 10 gigabits per second of bandwidth. Reading from worker 3 is slower because the bandwidth across AZs is only about 5 gigabits per second. Without tiered locality, the application is just as likely to read from either worker 2 or 3. Configuring tiered locality gives a preference for worker 2 for faster performance. The situation is similar for writing data. When applications write data through Alluxio, there is a preference for more-local workers. 3. Configuration To enable tiered locality and the associated performance benefit, every actor (clients and workers) must be configured to know its . Tiered identity is a mapping from locality tier (e.g. Availability Zone) to the value for that tier (e.g. us-east-1a). For the above cluster setup example, the tiered identities would be: tiered identity Configure with alluxio-site.properties The most straight forward way to configure tiered locality is to use Please refer to this : alluxio-site.properties configuration setting page Properties for Application, Worker 1, and Worker 2 alluxio-site.properties alluxio.locality.az= alluxio.locality.order= # custom locality hierarchy "us-east-1a" "node,az" Properties for Worker 3 and Worker 4 alluxio-site.properties alluxio.locality.az= alluxio.locality.order= # custom locality hierarchy "us-east-1b" "node,az" We set to introduce the locality tier and show its order in the locality hierarchy. By default the locality tiers are . Note that we don't need to explicitly configure identity because it is determined automatically via localhost lookup. alluxio.locality.order az node,rack node Configure with alluxio-locality.sh When the cluster is set up automatically or there are many workers, it can be convenient to set the locality information via script instead of using a static value in . If a script exists at , it will be executed to determine tiered identity. alluxio-site.properties ${ALLUXIO_HOME}/conf/alluxio-locality.sh alluxio-locality.sh #!/bin/bash echo "az=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)" The same script can be set on every node to look up and report the availability zone. The script name can be configured by setting the property key. alluxio.locality.script 4. Custom Tiers The example shown here uses the tiers and , but the tier configuration is fully customizable. Readers can use whatever tiers make sense for their deployment, e.g. , , , etc. as long as each region is contained within the next in , since locality decisions prefer to match in the earliest tier possible. node az rack zone region alluxio.locality.order For more information and docs on Alluxio Tiered Locality, Alluxio community users can get more information . It is also worth mentioning that Alluxio Enterprise Edition adds an additional feature: strict locality tiers. Users can define a tier to be “strict” meaning that no traffic is allowed between actors that don’t match in the tier. For more information on Cluster Partitioning with Strict Locality, Alluxio Enterprise Edition users can refer to the . here docs here 5. Conclusion In big data workloads on distributed systems, data locality helps engineers to get around the limitations due to network traffic. In the cloud, when clusters have non-uniform networking capabilities, tiered locality adds significant value in improving performance as well as saving costs. With its ability to intelligently aggregate and manage data in different environments, tiered locality is one of many capabilities that Alluxio can perform as a virtual data layer. For more information and discussion, I encourage users to dive deeper into . Alluxio open source community This article was previously published on Alluxio’s engineering blog and Dzone .