pckswarms.ch - interesting uncommon events

6 June 2008

Concurrent systems

Idle cores to the left of me, race conditions to the right

2 June 2008


Object-Oriented Reengineering Patterns

1 June 2008

Distributed File Systems

One of the things which I have been working with is the basic problem of how to get 50 (the current number) of boxes, each with 2TB of disk, to work together in a somewhat unified way. CPU, in our case, is a solved problem. We can use all those CPUs. What's not solved is the disks.

In some senses this is just like an office. Everyone has a PC on their desk with a load of unused disk space inside, and, yet on the shared file servers one is short disk. Collectivly one has TB of free disk, but, no one can really use it.

What would be quite interesting would be to design a system where nodes could be added and removed and the available disk storage grows or shrinks. The other cool part of this is that the I/O is distributed across N nodes and N disks rather than being concentrated on one (or a few) network interfaces. This also comes from the idea at the last Lisa conference in Andrew Hume's talk, No Terabyte Left Behind.

In other words, instead of N big filesystem on some sort of raid on a SAN (expensive) or a NAS (less expensive), or, 10s or 100s of shares on different systems managed separately (a royal pain) files would have to be distributed across multiple nodes, part of each file say on different nodes. A cloud of storage as it were.

I *think* you could do this as follows:

Reading files is now split across the different nodes. It a node doesn't respond you choose one of the others in the replacation factor. If none respond this block is lost. If a node doesn't have this block or if the checksum is wrong it tries to fetch it, since it has the hash anyway, from a node which should have it. This allows replacing a system with a new node and, over time, it will be populated with the blocks which are read.

Writing is a bit more involved. You hash just like for reading, and write to the node which the hash points to. That node is responsible for writing to the other replacation factor nodes and the write is sucessful only if all nodes respond. Since you only write new files though a write failure is less critical. An existing file is not corrupted. As you add nodes to the end of the hash table only new files are stored in them.

With the above design reading should be spred over evenly over all the nodes. Writing is a bit slower since basically all the nodes have to be up.

This is not meant for a work area which is changed often, but for files which change less often. Ie, you save your Word doc to your local node and then copy it into the cloud as it were. It also assumes that the folks you writing to are friendly, ie, part of your company. If you wanted this to work to random nodes across the internet than you'd have to add encryption, etc, but that's not the problem I'm trying to solve.

Normal file system tasks which become hard are things such as:

Could the above work for Twitter like systems. Probably. They would seem to be read bound, not write bound. If necessary the writes could be queued but the reads would be spred across many many systems and many many HDUs.

Bruce O'Neel

Last modified: Thu Apr 5 13:42:07 MET 2007