123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242 |
- The LogFS Flash Filesystem
- ==========================
- Specification
- =============
- Superblocks
- -----------
- Two superblocks exist at the beginning and end of the filesystem.
- Each superblock is 256 Bytes large, with another 3840 Bytes reserved
- for future purposes, making a total of 4096 Bytes.
- Superblock locations may differ for MTD and block devices. On MTD the
- first non-bad block contains a superblock in the first 4096 Bytes and
- the last non-bad block contains a superblock in the last 4096 Bytes.
- On block devices, the first 4096 Bytes of the device contain the first
- superblock and the last aligned 4096 Byte-block contains the second
- superblock.
- For the most part, the superblocks can be considered read-only. They
- are written only to correct errors detected within the superblocks,
- move the journal and change the filesystem parameters through tunefs.
- As a result, the superblock does not contain any fields that require
- constant updates, like the amount of free space, etc.
- Segments
- --------
- The space in the device is split up into equal-sized segments.
- Segments are the primary write unit of LogFS. Within each segments,
- writes happen from front (low addresses) to back (high addresses. If
- only a partial segment has been written, the segment number, the
- current position within and optionally a write buffer are stored in
- the journal.
- Segments are erased as a whole. Therefore Garbage Collection may be
- required to completely free a segment before doing so.
- Journal
- --------
- The journal contains all global information about the filesystem that
- is subject to frequent change. At mount time, it has to be scanned
- for the most recent commit entry, which contains a list of pointers to
- all currently valid entries.
- Object Store
- ------------
- All space except for the superblocks and journal is part of the object
- store. Each segment contains a segment header and a number of
- objects, each consisting of the object header and the payload.
- Objects are either inodes, directory entries (dentries), file data
- blocks or indirect blocks.
- Levels
- ------
- Garbage collection (GC) may fail if all data is written
- indiscriminately. One requirement of GC is that data is separated
- roughly according to the distance between the tree root and the data.
- Effectively that means all file data is on level 0, indirect blocks
- are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,
- respectively. Inode file data is on level 6 for the inodes and 7-11
- for indirect blocks.
- Each segment contains objects of a single level only. As a result,
- each level requires its own separate segment to be open for writing.
- Inode File
- ----------
- All inodes are stored in a special file, the inode file. Single
- exception is the inode file's inode (master inode) which for obvious
- reasons is stored in the journal instead. Instead of data blocks, the
- leaf nodes of the inode files are inodes.
- Aliases
- -------
- Writes in LogFS are done by means of a wandering tree. A naïve
- implementation would require that for each write or a block, all
- parent blocks are written as well, since the block pointers have
- changed. Such an implementation would not be very efficient.
- In LogFS, the block pointer changes are cached in the journal by means
- of alias entries. Each alias consists of its logical address - inode
- number, block index, level and child number (index into block) - and
- the changed data. Any 8-byte word can be changes in this manner.
- Currently aliases are used for block pointers, file size, file used
- bytes and the height of an inodes indirect tree.
- Segment Aliases
- ---------------
- Related to regular aliases, these are used to handle bad blocks.
- Initially, bad blocks are handled by moving the affected segment
- content to a spare segment and noting this move in the journal with a
- segment alias, a simple (to, from) tupel. GC will later empty this
- segment and the alias can be removed again. This is used on MTD only.
- Vim
- ---
- By cleverly predicting the life time of data, it is possible to
- separate long-living data from short-living data and thereby reduce
- the GC overhead later. Each type of distinc life expectency (vim) can
- have a separate segment open for writing. Each (level, vim) tupel can
- be open just once. If an open segment with unknown vim is encountered
- at mount time, it is closed and ignored henceforth.
- Indirect Tree
- -------------
- Inodes in LogFS are similar to FFS-style filesystems with direct and
- indirect block pointers. One difference is that LogFS uses a single
- indirect pointer that can be either a 1x, 2x, etc. indirect pointer.
- A height field in the inode defines the height of the indirect tree
- and thereby the indirection of the pointer.
- Another difference is the addressing of indirect blocks. In LogFS,
- the first 16 pointers in the first indirect block are left empty,
- corresponding to the 16 direct pointers in the inode. In ext2 (maybe
- others as well) the first pointer in the first indirect block
- corresponds to logical block 12, skipping the 12 direct pointers.
- So where ext2 is using arithmetic to better utilize space, LogFS keeps
- arithmetic simple and uses compression to save space.
- Compression
- -----------
- Both file data and metadata can be compressed. Compression for file
- data can be enabled with chattr +c and disabled with chattr -c. Doing
- so has no effect on existing data, but new data will be stored
- accordingly. New inodes will inherit the compression flag of the
- parent directory.
- Metadata is always compressed. However, the space accounting ignores
- this and charges for the uncompressed size. Failing to do so could
- result in GC failures when, after moving some data, indirect blocks
- compress worse than previously. Even on a 100% full medium, GC may
- not consume any extra space, so the compression gains are lost space
- to the user.
- However, they are not lost space to the filesystem internals. By
- cheating the user for those bytes, the filesystem gained some slack
- space and GC will run less often and faster.
- Garbage Collection and Wear Leveling
- ------------------------------------
- Garbage collection is invoked whenever the number of free segments
- falls below a threshold. The best (known) candidate is picked based
- on the least amount of valid data contained in the segment. All
- remaining valid data is copied elsewhere, thereby invalidating it.
- The GC code also checks for aliases and writes then back if their
- number gets too large.
- Wear leveling is done by occasionally picking a suboptimal segment for
- garbage collection. If a stale segments erase count is significantly
- lower than the active segments' erase counts, it will be picked. Wear
- leveling is rate limited, so it will never monopolize the device for
- more than one segment worth at a time.
- Values for "occasionally", "significantly lower" are compile time
- constants.
- Hashed directories
- ------------------
- To satisfy efficient lookup(), directory entries are hashed and
- located based on the hash. In order to both support large directories
- and not be overly inefficient for small directories, several hash
- tables of increasing size are used. For each table, the hash value
- modulo the table size gives the table index.
- Tables sizes are chosen to limit the number of indirect blocks with a
- fully populated table to 0, 1, 2 or 3 respectively. So the first
- table contains 16 entries, the second 512-16, etc.
- The last table is special in several ways. First its size depends on
- the effective 32bit limit on telldir/seekdir cookies. Since logfs
- uses the upper half of the address space for indirect blocks, the size
- is limited to 2^31. Secondly the table contains hash buckets with 16
- entries each.
- Using single-entry buckets would result in birthday "attacks". At
- just 2^16 used entries, hash collisions would be likely (P >= 0.5).
- My math skills are insufficient to do the combinatorics for the 17x
- collisions necessary to overflow a bucket, but testing showed that in
- 10,000 runs the lowest directory fill before a bucket overflow was
- 188,057,130 entries with an average of 315,149,915 entries. So for
- directory sizes of up to a million, bucket overflows should be
- virtually impossible under normal circumstances.
- With carefully chosen filenames, it is obviously possible to cause an
- overflow with just 21 entries (4 higher tables + 16 entries + 1). So
- there may be a security concern if a malicious user has write access
- to a directory.
- Open For Discussion
- ===================
- Device Address Space
- --------------------
- A device address space is used for caching. Both block devices and
- MTD provide functions to either read a single page or write a segment.
- Partial segments may be written for data integrity, but where possible
- complete segments are written for performance on simple block device
- flash media.
- Meta Inodes
- -----------
- Inodes are stored in the inode file, which is just a regular file for
- most purposes. At umount time, however, the inode file needs to
- remain open until all dirty inodes are written. So
- generic_shutdown_super() may not close this inode, but shouldn't
- complain about remaining inodes due to the inode file either. Same
- goes for mapping inode of the device address space.
- Currently logfs uses a hack that essentially copies part of fs/inode.c
- code over. A general solution would be preferred.
- Indirect block mapping
- ----------------------
- With compression, the block device (or mapping inode) cannot be used
- to cache indirect blocks. Some other place is required. Currently
- logfs uses the top half of each inode's address space. The low 8TB
- (on 32bit) are filled with file data, the high 8TB are used for
- indirect blocks.
- One problem is that 16TB files created on 64bit systems actually have
- data in the top 8TB. But files >16TB would cause problems anyway, so
- only the limit has changed.
|