6 June 2008
Idle cores to the left of me, race conditions to the right
2 June 2008
Object-Oriented Reengineering Patterns
1 June 2008
Distributed File Systems
One of the things which I have been working with is the basic problem of how to get 50 (the current number) of boxes, each with 2TB of disk, to work together in a somewhat unified way. CPU, in our case, is a solved problem. We can use all those CPUs. What's not solved is the disks.
In some senses this is just like an office. Everyone has a PC on their desk with a load of unused disk space inside, and, yet on the shared file servers one is short disk. Collectivly one has TB of free disk, but, no one can really use it.
What would be quite interesting would be to design a system where nodes could be added and removed and the available disk storage grows or shrinks. The other cool part of this is that the I/O is distributed across N nodes and N disks rather than being concentrated on one (or a few) network interfaces. This also comes from the idea at the last Lisa conference in Andrew Hume's talk, No Terabyte Left Behind.
In other words, instead of N big filesystem on some sort of raid on a SAN (expensive) or a NAS (less expensive), or, 10s or 100s of shares on different systems managed separately (a royal pain) files would have to be distributed across multiple nodes, part of each file say on different nodes. A cloud of storage as it were.
I *think* you could do this as follows:
- You have a list of nodes which are part of the storage cluster. This list only is appended to. If a node breaks you do have to replace it, but not too quickly.
- You define a replacation factor. Say 3 or so. This means that as long as 3 nodes don't fail then all of your data can be read back.
- Files are stored in blocks of some size, say 8k.
- Each read of 8k hashes into the list of nodes with the path to the file, name of the file, and block number to find a node to read from. The file which will be read from the node is the hash split in some way into some sort of directory structure to keep the number of files per directory down. You take the node you hashed to and the next replacation factor nodes and you choose one of them. Each block has a checksum/hash/what ever to make sure that what you read back is what you wrote before. If they don't compare you read another of the replacation and replace. As part of this in the meta data stored in the directory entry you have to have the size of the hash table when the file was written. Othewise the hash won't work right as the hash table increases in size.
- As part of the file name, hidden by default, we'll have a version number. This means that you keep old versions of a file. This also makes writing a new version a bit less critical since the old version will exist.
Reading files is now split across the different nodes. It a node doesn't respond you choose one of the others in the replacation factor. If none respond this block is lost. If a node doesn't have this block or if the checksum is wrong it tries to fetch it, since it has the hash anyway, from a node which should have it. This allows replacing a system with a new node and, over time, it will be populated with the blocks which are read.
Writing is a bit more involved. You hash just like for reading, and write to the node which the hash points to. That node is responsible for writing to the other replacation factor nodes and the write is sucessful only if all nodes respond. Since you only write new files though a write failure is less critical. An existing file is not corrupted. As you add nodes to the end of the hash table only new files are stored in them.
With the above design reading should be spred over evenly over all the nodes. Writing is a bit slower since basically all the nodes have to be up.
This is not meant for a work area which is changed often, but for files which change less often. Ie, you save your Word doc to your local node and then copy it into the cloud as it were. It also assumes that the folks you writing to are friendly, ie, part of your company. If you wanted this to work to random nodes across the internet than you'd have to add encryption, etc, but that's not the problem I'm trying to solve.
Normal file system tasks which become hard are things such as:
- free space. This is hard to calculate. As a first approximation you can sum all the free space on all the nodes and divide by the replacation factor but that's a bit optimistic because of the hashing.
- Freeing unused blocks. Since you don't overwrite files but you write new files you want old files to go away at some point. Deleting is a bit like writing all the blocks of the file (without the actual write). If you didn't mind space leakage then you could relax the requirement that all the nodes have to be up like for writing. You'd have to have go through and purge old version. And yes, I did work on VMS.
- You could probably write some sort of mark and sweep garbage collector to really free up space, but boy would that be slow.
Could the above work for Twitter like systems. Probably. They would seem to be read bound, not write bound. If necessary the writes could be queued but the reads would be spred across many many systems and many many HDUs.
Last modified: Thu Apr 5 13:42:07 MET 2007