Friday, January 11, 2008

Making S3 Folders In Ruby

Amazon's S3 storage service is a really cool way to store and serve up massive quantities of data. Interestingly, it is more like a database than like a file system. Despite the convenience of referring to an object like "/path/to/thing.jpg" S3 really does not have any separate objects for "/path" or "/path/to". In fact, neither "/path" nor "/path/to" even exist.

This is why, when you store an S3 object, the name it is given is called "key" instead of "filename". It functions like a database key, returning a file. However, as very "ultra hipster Web 2.0" as this may be, it is still convenient to browse a large collection of files by using a familiar directory/sub-directory/etc.

In fact, all of the S3 utilities that I use allow you to create a folder inside of a bucket, and sub-folders inside of that etc. But I did not see any programmatic facility for creating a folder independent of any object inside of S3. If you have been paying attention, you remember that in S3 there is no such thing as a directory! So how do these nice GUI browsers do it?

I did some Googling, but no luck, beyond "it couldn't be done". Since I had two different programs that already did it, I knew THAT wasn't the case. It took a little old-fashioned investigation... I looked directly at the output from a query to a bucket that had the little folders in them already.

Lo and behold, it turns out, they cheat, by creating a specially named object for each "directory". For a directory named "/path", you would create an object with the key "path_$folder$", and for a directory named "/path/to", you create an object with the key "path/to_$folder$". Then to get a directory listing for "/path" you just do a query on S3 for all object whose key starts with "/path". Ignore any objects that end with "_$folder$" and there you have it: S3 folders.

I decided that it would be nice if the aws/s3 gem would support this foldering the same way that copying a file within a file system does: if the enclosing directories do not exist, they are created before the file copy.

Thanks to the beauty of modern dynamic languages, I was easily able to put together a little monkeypatch for aws/S3 to the S3Object class, that handles this.

# This is an extension to S3Object that supports the emerging 'standard' for virtual folders on S3.
# For example:
#'/folder/to/greeting.txt', 'hello world!', 'ron', :use_virtual_directories => true)
# This will create an object in S3 that mimics a folder, as far as the S3 GUI browsers like
# the S3 Firefox Extension or Bucket Explorer are concerned.
module AWS
module S3
class S3Object
class << self

alias :original_store :store
def store(key, data, bucket = nil, options = {})
store_folders(key, bucket, options) if options[:use_virtual_directories]
original_store(key, data, bucket, options)

def store_folders(key, bucket = nil, options = {})
folders = key.split("/")
current_folder = "/"
folders.each {|folder|
current_folder += folder
store_folder(current_folder, bucket, options)
current_folder += "/"

def store_folder(key, bucket = nil, options = {})
original_store(key + "_$folder$", "", bucket, options) # store the magic entry that emulates a folder

Sure, you can have the best of both worlds: massive virtual storage, and a convenient directory-like structure. And now you can have it with your favorite Ruby S3 library.

Happy storage!


chrtest said...

>Then to get a directory listing for
>"/path" you just do a query on S3 for
>all object whose key starts with
Actually you need to ask for keys starting with "/path/", i.e. including a trailing space. Otherwise you'll all also get e.g. "/pathfinder".

That is why I don't understand this convention. I would have preferred to call the dummy file "path/$folder$". In this way it would work much better with the prefix and delimiter parameters. Don't you agree?

Ron Evans said...

Yes, the convention for the dummy "folder" is rather odd. Not to mention not really documented anywhere, which certainly doesn't help!

justinjas said...

I've been looking at this convention for a program I am writing using S3. I can say the reason they don't use the path/$folder$ convention you mentioned is because if you use the delimiter functionality in amazon then you won't know if the key is a folder without running another query on every key you get back.

Meaning if you do a query on /root/ using / as the delimiter then you'll get back File1, File2, Folder1, Folder2 but you won't know which ones are files and folders without doing 4 more queries. With this path$folder$ way you'll get back File1, File2, Folder1$folder$, Folder2$folder$ so it's easy to identify the folders without more hits to S3.

vinay said...

Hi Ron Evans,
I think your monkey patch is the thing I was looking for the integration of S3. But I am in a fix here. Could you be patient enough and guide me as to where do I need to apply the monkey patch. If I copy paste it to my gem that I think will not be advisable as when I update the gem the patch would be lost.

Would be nice on your part to be elobrative on the place where the code is to be used for beginners like me. And more over would also appreciate if you could post an example of retrieving a file from S3 virtual folder.


Pavan said...

If you are using rails, then including the monkey patch in one of the files in library would be enough.

Bill said...

Way cool! Thanks for creating this!