123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384 |
- The Second Extended Filesystem
- ==============================
- ext2 was originally released in January 1993. Written by R\'emy Card,
- Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the
- Extended Filesystem. It is currently still (April 2001) the predominant
- filesystem in use by Linux. There are also implementations available
- for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS.
- Options
- =======
- Most defaults are determined by the filesystem superblock, and can be
- set using tune2fs(8). Kernel-determined defaults are indicated by (*).
- bsddf (*) Makes `df' act like BSD.
- minixdf Makes `df' act like Minix.
- check=none, nocheck (*) Don't do extra checking of bitmaps on mount
- (check=normal and check=strict options removed)
- debug Extra debugging information is sent to the
- kernel syslog. Useful for developers.
- errors=continue Keep going on a filesystem error.
- errors=remount-ro Remount the filesystem read-only on an error.
- errors=panic Panic and halt the machine if an error occurs.
- grpid, bsdgroups Give objects the same group ID as their parent.
- nogrpid, sysvgroups New objects have the group ID of their creator.
- nouid32 Use 16-bit UIDs and GIDs.
- oldalloc Enable the old block allocator. Orlov should
- have better performance, we'd like to get some
- feedback if it's the contrary for you.
- orlov (*) Use the Orlov block allocator.
- (See http://lwn.net/Articles/14633/ and
- http://lwn.net/Articles/14446/.)
- resuid=n The user ID which may use the reserved blocks.
- resgid=n The group ID which may use the reserved blocks.
- sb=n Use alternate superblock at this location.
- user_xattr Enable "user." POSIX Extended Attributes
- (requires CONFIG_EXT2_FS_XATTR).
- See also http://acl.bestbits.at
- nouser_xattr Don't support "user." extended attributes.
- acl Enable POSIX Access Control Lists support
- (requires CONFIG_EXT2_FS_POSIX_ACL).
- See also http://acl.bestbits.at
- noacl Don't support POSIX ACLs.
- nobh Do not attach buffer_heads to file pagecache.
- xip Use execute in place (no caching) if possible
- grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2.
- Specification
- =============
- ext2 shares many properties with traditional Unix filesystems. It has
- the concepts of blocks, inodes and directories. It has space in the
- specification for Access Control Lists (ACLs), fragments, undeletion and
- compression though these are not yet implemented (some are available as
- separate patches). There is also a versioning mechanism to allow new
- features (such as journalling) to be added in a maximally compatible
- manner.
- Blocks
- ------
- The space in the device or file is split up into blocks. These are
- a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems),
- which is decided when the filesystem is created. Smaller blocks mean
- less wasted space per file, but require slightly more accounting overhead,
- and also impose other limits on the size of files and the filesystem.
- Block Groups
- ------------
- Blocks are clustered into block groups in order to reduce fragmentation
- and minimise the amount of head seeking when reading a large amount
- of consecutive data. Information about each block group is kept in a
- descriptor table stored in the block(s) immediately after the superblock.
- Two blocks near the start of each group are reserved for the block usage
- bitmap and the inode usage bitmap which show which blocks and inodes
- are in use. Since each bitmap is limited to a single block, this means
- that the maximum size of a block group is 8 times the size of a block.
- The block(s) following the bitmaps in each block group are designated
- as the inode table for that block group and the remainder are the data
- blocks. The block allocation algorithm attempts to allocate data blocks
- in the same block group as the inode which contains them.
- The Superblock
- --------------
- The superblock contains all the information about the configuration of
- the filing system. The primary copy of the superblock is stored at an
- offset of 1024 bytes from the start of the device, and it is essential
- to mounting the filesystem. Since it is so important, backup copies of
- the superblock are stored in block groups throughout the filesystem.
- The first version of ext2 (revision 0) stores a copy at the start of
- every block group, along with backups of the group descriptor block(s).
- Because this can consume a considerable amount of space for large
- filesystems, later revisions can optionally reduce the number of backup
- copies by only putting backups in specific groups (this is the sparse
- superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7.
- The information in the superblock contains fields such as the total
- number of inodes and blocks in the filesystem and how many are free,
- how many inodes and blocks are in each block group, when the filesystem
- was mounted (and if it was cleanly unmounted), when it was modified,
- what version of the filesystem it is (see the Revisions section below)
- and which OS created it.
- If the filesystem is revision 1 or higher, then there are extra fields,
- such as a volume name, a unique identification number, the inode size,
- and space for optional filesystem features to store configuration info.
- All fields in the superblock (as in all other ext2 structures) are stored
- on the disc in little endian format, so a filesystem is portable between
- machines without having to know what machine it was created on.
- Inodes
- ------
- The inode (index node) is a fundamental concept in the ext2 filesystem.
- Each object in the filesystem is represented by an inode. The inode
- structure contains pointers to the filesystem blocks which contain the
- data held in the object and all of the metadata about an object except
- its name. The metadata about an object includes the permissions, owner,
- group, flags, size, number of blocks used, access time, change time,
- modification time, deletion time, number of links, fragments, version
- (for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs).
- There are some reserved fields which are currently unused in the inode
- structure and several which are overloaded. One field is reserved for the
- directory ACL if the inode is a directory and alternately for the top 32
- bits of the file size if the inode is a regular file (allowing file sizes
- larger than 2GB). The translator field is unused under Linux, but is used
- by the HURD to reference the inode of a program which will be used to
- interpret this object. Most of the remaining reserved fields have been
- used up for both Linux and the HURD for larger owner and group fields,
- The HURD also has a larger mode field so it uses another of the remaining
- fields to store the extra more bits.
- There are pointers to the first 12 blocks which contain the file's data
- in the inode. There is a pointer to an indirect block (which contains
- pointers to the next set of blocks), a pointer to a doubly-indirect
- block (which contains pointers to indirect blocks) and a pointer to a
- trebly-indirect block (which contains pointers to doubly-indirect blocks).
- The flags field contains some ext2-specific flags which aren't catered
- for by the standard chmod flags. These flags can be listed with lsattr
- and changed with the chattr command, and allow specific filesystem
- behaviour on a per-file basis. There are flags for secure deletion,
- undeletable, compression, synchronous updates, immutability, append-only,
- dumpable, no-atime, indexed directories, and data-journaling. Not all
- of these are supported yet.
- Directories
- -----------
- A directory is a filesystem object and has an inode just like a file.
- It is a specially formatted file containing records which associate
- each name with an inode number. Later revisions of the filesystem also
- encode the type of the object (file, directory, symlink, device, fifo,
- socket) to avoid the need to check the inode itself for this information
- (support for taking advantage of this feature does not yet exist in
- Glibc 2.2).
- The inode allocation code tries to assign inodes which are in the same
- block group as the directory in which they are first created.
- The current implementation of ext2 uses a singly-linked list to store
- the filenames in the directory; a pending enhancement uses hashing of the
- filenames to allow lookup without the need to scan the entire directory.
- The current implementation never removes empty directory blocks once they
- have been allocated to hold more files.
- Special files
- -------------
- Symbolic links are also filesystem objects with inodes. They deserve
- special mention because the data for them is stored within the inode
- itself if the symlink is less than 60 bytes long. It uses the fields
- which would normally be used to store the pointers to data blocks.
- This is a worthwhile optimisation as it we avoid allocating a full
- block for the symlink, and most symlinks are less than 60 characters long.
- Character and block special devices never have data blocks assigned to
- them. Instead, their device number is stored in the inode, again reusing
- the fields which would be used to point to the data blocks.
- Reserved Space
- --------------
- In ext2, there is a mechanism for reserving a certain number of blocks
- for a particular user (normally the super-user). This is intended to
- allow for the system to continue functioning even if non-privileged users
- fill up all the space available to them (this is independent of filesystem
- quotas). It also keeps the filesystem from filling up entirely which
- helps combat fragmentation.
- Filesystem check
- ----------------
- At boot time, most systems run a consistency check (e2fsck) on their
- filesystems. The superblock of the ext2 filesystem contains several
- fields which indicate whether fsck should actually run (since checking
- the filesystem at boot can take a long time if it is large). fsck will
- run if the filesystem was not cleanly unmounted, if the maximum mount
- count has been exceeded or if the maximum time between checks has been
- exceeded.
- Feature Compatibility
- ---------------------
- The compatibility feature mechanism used in ext2 is sophisticated.
- It safely allows features to be added to the filesystem, without
- unnecessarily sacrificing compatibility with older versions of the
- filesystem code. The feature compatibility mechanism is not supported by
- the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in
- revision 1. There are three 32-bit fields, one for compatible features
- (COMPAT), one for read-only compatible (RO_COMPAT) features and one for
- incompatible (INCOMPAT) features.
- These feature flags have specific meanings for the kernel as follows:
- A COMPAT flag indicates that a feature is present in the filesystem,
- but the on-disk format is 100% compatible with older on-disk formats, so
- a kernel which didn't know anything about this feature could read/write
- the filesystem without any chance of corrupting the filesystem (or even
- making it inconsistent). This is essentially just a flag which says
- "this filesystem has a (hidden) feature" that the kernel or e2fsck may
- want to be aware of (more on e2fsck and feature flags later). The ext3
- HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply
- a regular file with data blocks in it so the kernel does not need to
- take any special notice of it if it doesn't understand ext3 journaling.
- An RO_COMPAT flag indicates that the on-disk format is 100% compatible
- with older on-disk formats for reading (i.e. the feature does not change
- the visible on-disk format). However, an old kernel writing to such a
- filesystem would/could corrupt the filesystem, so this is prevented. The
- most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because
- sparse groups allow file data blocks where superblock/group descriptor
- backups used to live, and ext2_free_blocks() refuses to free these blocks,
- which would leading to inconsistent bitmaps. An old kernel would also
- get an error if it tried to free a series of blocks which crossed a group
- boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem.
- An INCOMPAT flag indicates the on-disk format has changed in some
- way that makes it unreadable by older kernels, or would otherwise
- cause a problem if an old kernel tried to mount it. FILETYPE is an
- INCOMPAT flag because older kernels would think a filename was longer
- than 256 characters, which would lead to corrupt directory listings.
- The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel
- doesn't understand compression, you would just get garbage back from
- read() instead of it automatically decompressing your data. The ext3
- RECOVER flag is needed to prevent a kernel which does not understand the
- ext3 journal from mounting the filesystem without replaying the journal.
- For e2fsck, it needs to be more strict with the handling of these
- flags than the kernel. If it doesn't understand ANY of the COMPAT,
- RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem,
- because it has no way of verifying whether a given feature is valid
- or not. Allowing e2fsck to succeed on a filesystem with an unknown
- feature is a false sense of security for the user. Refusing to check
- a filesystem with unknown features is a good incentive for the user to
- update to the latest e2fsck. This also means that anyone adding feature
- flags to ext2 also needs to update e2fsck to verify these features.
- Metadata
- --------
- It is frequently claimed that the ext2 implementation of writing
- asynchronous metadata is faster than the ffs synchronous metadata
- scheme but less reliable. Both methods are equally resolvable by their
- respective fsck programs.
- If you're exceptionally paranoid, there are 3 ways of making metadata
- writes synchronous on ext2:
- per-file if you have the program source: use the O_SYNC flag to open()
- per-file if you don't have the source: use "chattr +S" on the file
- per-filesystem: add the "sync" option to mount (or in /etc/fstab)
- the first and last are not ext2 specific but do force the metadata to
- be written synchronously. See also Journaling below.
- Limitations
- -----------
- There are various limits imposed by the on-disk layout of ext2. Other
- limits are imposed by the current implementation of the kernel code.
- Many of the limits are determined at the time the filesystem is first
- created, and depend upon the block size chosen. The ratio of inodes to
- data blocks is fixed at filesystem creation time, so the only way to
- increase the number of inodes is to increase the size of the filesystem.
- No tools currently exist which can change the ratio of inodes to blocks.
- Most of these limits could be overcome with slight changes in the on-disk
- format and using a compatibility flag to signal the format change (at
- the expense of some compatibility).
- Filesystem block size: 1kB 2kB 4kB 8kB
- File size limit: 16GB 256GB 2048GB 2048GB
- Filesystem size limit: 2047GB 8192GB 16384GB 32768GB
- There is a 2.4 kernel limit of 2048GB for a single block device, so no
- filesystem larger than that can be created at this time. There is also
- an upper limit on the block size imposed by the page size of the kernel,
- so 8kB blocks are only allowed on Alpha systems (and other architectures
- which support larger pages).
- There is an upper limit of 32000 subdirectories in a single directory.
- There is a "soft" upper limit of about 10-15k files in a single directory
- with the current linear linked-list directory implementation. This limit
- stems from performance problems when creating and deleting (and also
- finding) files in such large directories. Using a hashed directory index
- (under development) allows 100k-1M+ files in a single directory without
- performance problems (although RAM size becomes an issue at this point).
- The (meaningless) absolute upper limit of files in a single directory
- (imposed by the file size, the realistic limit is obviously much less)
- is over 130 trillion files. It would be higher except there are not
- enough 4-character names to make up unique directory entries, so they
- have to be 8 character filenames, even then we are fairly close to
- running out of unique filenames.
- Journaling
- ----------
- A journaling extension to the ext2 code has been developed by Stephen
- Tweedie. It avoids the risks of metadata corruption and the need to
- wait for e2fsck to complete after a crash, without requiring a change
- to the on-disk ext2 layout. In a nutshell, the journal is a regular
- file which stores whole metadata (and optionally data) blocks that have
- been modified, prior to writing them into the filesystem. This means
- it is possible to add a journal to an existing ext2 filesystem without
- the need for data conversion.
- When changes to the filesystem (e.g. a file is renamed) they are stored in
- a transaction in the journal and can either be complete or incomplete at
- the time of a crash. If a transaction is complete at the time of a crash
- (or in the normal case where the system does not crash), then any blocks
- in that transaction are guaranteed to represent a valid filesystem state,
- and are copied into the filesystem. If a transaction is incomplete at
- the time of the crash, then there is no guarantee of consistency for
- the blocks in that transaction so they are discarded (which means any
- filesystem changes they represent are also lost).
- Check Documentation/filesystems/ext3.txt if you want to read more about
- ext3 and journaling.
- References
- ==========
- The kernel source file:/usr/src/linux/fs/ext2/
- e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/
- Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html
- Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/
- Filesystem Resizing http://ext2resize.sourceforge.net/
- Compression (*) http://e2compr.sourceforge.net/
- Implementations for:
- Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs
- Windows 95 (*) http://www.yipton.net/content.html#FSDEXT2
- DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
- OS/2 (+) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/
- RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/
- (*) no longer actively developed/supported (as of Apr 2001)
- (+) no longer actively developed/supported (as of Mar 2009)
|