Multi-Region R2 Bucket System

May 26, 2024 (1 year ago)
• 0 views

I want to create a system to download large files cheaply with a worldwide infrastructure. Cloudflare is definitely the best option I've found, but there's a problem: the files are saved in a single region, for example in EEUR (Eastern Europe), so if I want to download a file from Singapore I have to connect to Vienna, for example, with obvious speed problems. The solution is to upload the same file to different regions around the world to reduce download times.

Cloudflare provides 5 regions:

[
  { "name": "EU East - Vienna", "shortName": "EEUR" },
  { "name": "EU West - Dublin", "shortName": "WEUR" },
  { "name": "US East - Washington D.C.", "shortName": "ENAM" },
  { "name": "US West - Los Angeles", "shortName": "WNAM" },
  { "name": "Asia Pacific - Singapore", "shortName": "APAC" }
]

In this way, my large file can be downloaded at the maximum possible speed.

How do I manage the Download System?

First, let's go over how the code works to create a download using Cloudflare Workers.

const object = await bucket.get(objectName, {
  range: request.headers,
  onlyIf: request.headers,
});

const headers = new Headers();
object.writeHttpMetadata(headers);
headers.set("etag", object.httpEtag);

if (object.range) {
  headers.set(
    "content-range",
    `bytes ${object.range.offset}-${object.range.end ?? object.size - 1}/${
      object.size
    }`
  );
}

const status = object.body
  ? request.headers.get("range") !== null
    ? 206
    : 200
  : 304;

return new Response(object.body, {
  headers,
  status,
});

However, this code has a major problem: the file we want to download is taken from a previously selected bucket without considering where the request is coming from in order to optimize download times.

The solution was to implement a system to download from the nearest bucket. This is calculated using the distance between geographic coordinates that are provided to us by Cloudflare.

Obviously, this system is not precise because the coordinates are calculated by Cloudflare based on the IP address and therefore do not provide the exact location. Additionally, we do not know the exact location of the bucket but only the region in which it is located. However, this system provides us with a simple way to find the nearest bucket.

These two functions work together to find the nearestPosition from the userLocation.

// Calculates the distance between the user's location and a given position.
const calculateDistance = (userLocation, position) => {
  if (!userLocation || !position) {
    return null;
  }
  return getDistance(userLocation, position);
};

// Finds the nearest position (R2 bucket) to the user's location.
const findNearestPosition = (userLocation, positions) => {
  if (!userLocation || positions.length === 0) {
    return null;
  }

  const distances = positions.map((position) => ({
    ...position,
    distance: calculateDistance(userLocation, position),
  }));

  const nearestPosition = distances.reduce((minPosition, current) =>
    current.distance < minPosition.distance ? current : minPosition
  );
  return nearestPosition.env;
};

The first function calculates the actual distance between the userLocation and each bucket's position using the geolib library (check here for more info about this package).

The second function, which is the more important one, iterates through the positions of the various buckets in different regions and returns the closest bucket.

How do I manage Authentication?

Now we have another problem with the previously explained system: everyone can download the files whenever and however they want. For this reason, let's look at an example of authentication together so we know when and how our files are downloaded.

const token = request.headers.get("Authorization");
if (token === null) {
  return new Response("Missing Authorization header", { status: 401 });
}

const tokenParts = token.split(" ");
if (tokenParts.length !== 2 || tokenParts[0] !== "Bearer") {
  return new Response("Invalid Authorization header", { status: 401 });
}

// check if the tokens exist in the tokens set in upstash redis
if (await redis.sismember("tokens", tokenParts[1])) {
  return new Response("Invalid Token", { status: 401 });
}

For this example, I have decided to use Upstash Redis due to its ease of implementation with Cloudflare, but you can choose to use your preferred database. We check the headers of the request and verify if there is an Authorization header and if the token within it is present in our database.

How do I manage the Upload System?

Regarding our Cloudflare worker, things get a bit complicated, but the important thing to understand is that the file is uploaded in multipart mode. This way, we avoid uploading a file larger than 50 megabytes directly to the server. It's not a good idea to do so, and Cloudflare itself doesn't allow it. Therefore, the file is uploaded in small parts of a chosen size.

In this case, the important part of the whole Cloudflare worker process is knowing which bucket to upload the file to. Cloudflare cannot handle more than one bucket at a time for file uploads. Therefore, the management of multiple buckets is done "client-side" by a script that uploads to each bucket one at a time.

var serverName = request.headers.get("X-Bucket-Name");
if (serverName === null) {
  return new Response(`Missing server name`, {
    status: 400,
  });
}
var server = positions.find((position) => position.shortName === serverName);

if (server === undefined) {
  return new Response(`Unknown server ${serverName}`, {
    status: 400,
  });
}

In this code, we check which bucket name is chosen by the upload request and verify if it exists in our list of buckets mentioned at the beginning of this article.

Now, "client-side," we just need to send an upload request along with the server name. For this task, I wrote a Python script that efficiently handles the upload of a file to multiple buckets. For more information, check here.

What is the cost of all this?

The cost of this project is primarily from Cloudflare, which is entirely free for our usage as long as we don't exceed the limits of 100k requests per day for Cloudflare Workers and 1 million requests per month to a single bucket. This isn't an issue because, with multiple buckets, the requests are distributed based on where the file is downloaded from.

The only real cost of this project is storage, which is 10GB per month and then $0.015 per GB-month. This is quite affordable compared to many other options, costing us about $15 per month for a terabyte of space. There's a significant difference compared to AWS S3, where the same usage could cost up to $500 per month. For more information, check the official Cloudflare page here.

I believe this system I've created is the most convenient, fast, and cost-effective solution I could find. There might be a better solution for the "mess" I had to create for this project, but this one works well for our needs.

For more information, on how to use this project, consult my github repository at this link.