One directory, 500,000 files
·——
····
··
—
·····
···——
——···
I have a thing for the filesystem: I really want to use it instead of a relational database. Part of this is wanting to achieve
UNIX zen, wanting all my data in the same space, not in some cyber-cyberspace like
mysql> or
whit537=#. Part of it is wanting to install and admin less software. Part of it is wanting to avoid regexp dispatch hacks: URLs are hierarchical unique ids, the filesystem is hierarchical, what's the problem?
One problem is that the filesystem isn't ACID. Answer: Subversion. Another is that it can't be queried.
Working on it.
The question nagging me now is scalability. I know you're
not supposed to think about that until it happens, but you're also not supposed to use the filesystem for "real" data storage in the first place. So out of curiosity as much as fear: let's say you use the filesystem for storage, and that as part of your app you have a
users/ directory, which contains one file per user. Now let's say you balloon to 500,000 users. How screwed are you?
To find out, I'll start by creating a directory with 500,000 empty files in it. Here we go ...
$ mkdir files
$ cat - > test.py
for i in range(500000):
open('files/%d' % i, 'w')
$ python test.py
This has been running for 15+ CPU minutes so far. I have about 20 million inodes available on my system, so that won't be the problem. It will get more interesting when I try to load these into Subversion, and then again when I try to query them—
ls(1) just isn't going to cut it.
Stay tuned ...
Update: One repo, 27,923 nodes
·——
····
··
—
·····
···——
——···
Feed back to
Chad Whitacre.