zzz_format.md 23 KB

ZZZip Archive Format Specification

  • File extension: .zzz
  • Mime type: application/x-zzz

[[TOC]]

If the file starts with gzip, bzip2, xz, lz4, zstd etc. magic bytes, then it's stream compressed (like tar, cpio) and the filenames and attributes are also compressed. Otherwise if it starts with ZZZip header magic, it's compressed per file (like zip). The former offers better compression ratio, the latter offers easier parsing and handling.

The format was designed in a way so that it can be used in streaming, that is, compressed and decompressed without the need of seeking (unlike zip). But if archive can be seeked, the last 48 bytes contains an overall information. It was also designed to be a full archive format (capable of backup and restore system specific attributes), but also to be interoperable (capable of sharing files between different operating systems reliably).

All integers are in little-endian format. The CRC uses the same ANSI CRC method as the gzip format (also used by EFI GPT, PNG etc.), and not the Castagnoli CRC (used by bzip2 and network packets). All strings are UTF-8 encoded and zero terminated. In the archive, all directory separators are converted to a slash '/', regardless to the native separator on the OS.

The format was designed to be future-proof, file sizes are stored on 127 bits (however the current SDK only implements 63 bits), and the format is extensible without changing the structure of the archive, meaning it will remain backwards compatible.

The basic structure of the format is as follows:

Block Description
Encryption Optional, may not exists
Entity At least one entity block
Entity More entity blocks may follow
End Terminating block (fixed size)

An optional encryption block, then multiple entity blocks (which can be compressed individually by separate threads and then concatenated) one for each compressed file, then a terminating end block.

Encryption Block

This block is optional. If it exists, there can be only one, and it must be the first block. It is differentiated from normal data blocks by file name length being 0 so that Block Header magic can be used to identify the file format. The magic consist of two uppercase 'Z' letters, one lowercase 'z' letter, and an ASCII 26 (EOF).

Offset Length Description
0 4 magic, 'ZZz\032' 0x1A7A5A5A
4 2 size of the header (h)
6 2 must be zero
8 4 CRC of the password
12 1 content padding in power of 2
13 h-13 list of encryption filters

Regardless to the encryption used, the CRC of the password is always stored for quick universal verification.

Padding is stored in power of two: 0 - byte, 1 - word (2 bytes), 2 - dword (4 bytes), 3 - qword (8 bytes), 4 - 16 bytes, 5 - 32 bytes, etc. It is the biggests of the applied filters' paddings.

EncType Description Padding
0 AES-256-CBC 4 (16 bytes)
1 SHA-256-XOR 0 (no alignment)

With AES, first an SHA256 is calculated from the password. 16 bytes of the result starting from the offset (hash's first byte modulo 16) is used as the AES iv, then SHA256 is used again on the hash to get the AES key. It uses 16 bytes padding.

With SHA, a tricky XOR method is used. First B is calculated by summing the characters in the password modulo 256. Then SHA256 calculated for the password, each byte of the hash XOR'd with B, and SHA256 is used again on that. This gives the iv mask. As it's a symmetric cipher, both encoding and decoding uses the same method: each byte is first XOR'd with a running counter, starting from B overflowing at 256. Then another running counter starting from 0 and overflowing at 32 is used to get a byte from the mask and XOR'd with that too. If this second running counter reaches 32, it's resetted and the mask is replaced by the SHA256 of the mask with each byte XOR'd with B. This way the first counter gives a running XOR mask repeating after every 256th byte, while the second counter uses a hash which is recalculated after every 32 byte. This is fast and strong enough by itself for most applications, however it's purpose is to use it in conjuntion with AES.

When multiple filters specified, they are stored in extraction order, eg. reverse creation order. The minimum size of the block is 14, meaning at least one filter must be specified.

(Note: the same cipher can be used multiple times to strengthen encryption, or you can mix those. Do not use SHA-256-XOR twice in a row, as that's a symmetric cipher it will decrypt. An AES + SHA + AES combination is paranoid enough to store state secrets and nuclear missile launch codes.)

Entity Block

Each entity (file, directory etc.) starts with a fixed size block header, followed by varying length fields and the content's data, ending in two checksum values.

Offset Length Description
0 4 magic, 'ZZz\032' 0x1A7A5A5A
4 2 size of the header (h)
6 2 must be non-zero, file name length (n)
8 2 file modification year
10 1 file modification month (1 - 12)
11 1 file modification day (1 - 31)
12 1 file modification hour (0 - 23)
13 1 file modification minute (0 - 59)
14 1 file modification second (0 - 60)
15 1 block type (file format, numfilter c)
16 16 uncompressed size
32 16 content size (s)
48 2*c compression filters
48+2*c h-2*c-48 extra or OS-specific fields
h n file name (UTF-8, dir separator '/')
h+n s file content
h+n+s x optional padding (only when encrypted)
h+n+s+x 4 CRC of the uncompressed file
h+n+s+x+4 4 CRC of the block for bytes 0 - h+n+s+x+4

The highest bit of file name length gives header size's 17th bit. This means header can be 128k in length, and file name 32767 bytes (which is a lot bigger than what most OS implements, MAX_PATH under Windows is 260, and under UNIX the PATH_MAX limit is typically 4096 for example).

When encryption is used, then extra fields are padded with a special field and bytes (48+2*c) to (h) are encrypted, as well as the file content field. The size of the padding is defined by the encryption methods used, stored in the encryption block (if it exists).

File Modification Date and Time

This is a packed, OS-independent representation of last file modification time. It must be in UTC (GMT + 0) without timezones and daylight saving applied. For separate creation, modification and access times with nanosec precision, use the POSIX Timestamps extra field.

Block Type

The higher tetrad (block type & 0xF0) encodes the file format.

Block Type Description
0x0? regular file
0x10 hard link
0x20 symbolic link
0x30 union
0x40 character device
0x50 block device
0x60 directory
0x70 named FIFO
0x80 - 0xF0 reserved

Depending on the file format, file content differs. For file format 0x10 and above, number of compression filters and uncompressed size must be 0.

The lower tetrad (block type & 0x0F) encodes the number of compression filters used on data.

Uncompressed Size

This is the uncompressed size of the file. It's 0 for non-regular files. It's stored on 127 bits, but current implementation only uses 63 bits.

For text files this might be bigger than the actual size on UNIX, because all newline characters are calculated as 2 bytes.

Content Size

This is the size of the file content as stored in the file (minus the optional padding). For regular files, it is the compressed data size. It's stored on 127 bits, but current implementation only uses 63 bits.

Compression Filters

The number of compression filters in block type byte encodes how many filters applied. 0 means no compression, file is stored as-is (uncompressed size = content size). This is the only option when overall stream compression used on the entire archive. Otherwise that many words follow, describing filters in decompress order, or with other words in reverse compression order. The first byte in the word is the filter, the second byte is the compression level (specific to filters, it is only informative as most compression filters encode the level in their headers and that's what really counts). When encryption is used, it's always applied to the final compressed data, meaning when extracting files, a decompressor must iterate on the encryption filters first, then iterating on compression filters.

Filter Level Description
0 0 ASCII newline conversion
1 0-2 executable bcj (level=arch)
2 1-9 zlib deflate
3 1-9 bzip2
4 0-9 lzma
5 0-9 xz (lzma2)
6 7-8 PPMd
7 1-22 zstd
8 - 31 x reserved

Extra Fields

All extra fields have a common, 4 bytes long header:

Offset Length Description
0 1 magic, field type
1 1 magic, field format revision (0 for now)
2 2 size of field (h, including header)
4 h-4 field specific data

With OS-Specific fields (field type >= 0x10) the field format revision's most significant bit is set when saved on a big-endian machine (that's because OS-Specific field data is not converted to little-endian, unlike all the other integers in the archive).

With encryption there might be a padding field as the last field, and the entire extra fields block is encrypted.

Padding

Offset Length Description
0 2 magic, 0x0000
2 2 size of field (h, including header)
4 h-4 must be zero

Comment

Offset Length Description
0 2 magic, 0x0001
2 2 size of field (h, including header)
4 h-4 zero terminated UTF-8 text

Mime Type

Offset Length Description
0 2 magic, 0x0002
2 2 size of field (h, including header)
4 h-4 mime type (like 'text/html')

Mime type might contain optional character set information of the content, like "text/html;charset=latin-1", however converting text into UTF-8 is strongly recommended instead.

File Icon

Offset Length Description
0 2 magic, 0x0003
2 2 size of field (h, including header)
4 h-4 compressed icon (preferably PNG or SVG)

Icon can be in any format as long as the format can be determined by magic bytes, but for interoperability reasons PNG (bitmap) or SVG (vector) strongly recommended. Keep the icon's size small, around 32k tops.

Meta Info (XAttr)

Offset Length Description
0 2 magic, 0x0004
2 2 size of field (h, including header)
4 h-4 key value pairs

Each pair is stored with the following record format:

Offset Length Description
0 2 size of the attribute (s)
2 n zero terminated UTF-8 attribute name
2+n s attribute data (could be binary)

This record is repeated as many times as the number of meta (XAttr) keys on the file.

POSIX Timestamps

Offset Length Description
0 2 magic, 0x0005
2 2 size of field (12, 20, 28 or 36)
4 8 last modify time in nanosec, UNIX UTC Epoch
12 8 last access time in nanosec (optional)
20 8 last stat change time in nanosec (optional)
28 8 file creation time in nanosec (optional)

These should be used for all OS, even on non-UNIX systems the time specification (like MSDateTime) can be converted into POSIX timestamps. Regardless other OS-Specific fields might contain OS-Specific representation of times.

Note that POSIX timestamps will wrap around after 2554-07-21, however combining with the year from the OS-independent modification time (and using other timestamp's year in relative to modification timestamp's year) these can be up to 65535-12-31.

POSIX Access Rights

Offset Length Description
0 2 magic, 0x0006
2 2 size of field (h, including header)
4 4 st_mode
8 8 st_uid
16 8 st_gid
24 h-24 user name UTF-8, zero, group name, zero

The user name and group name both are converted to UTF-8, terminated by a zero, and concatenated.

POSIX Access Control List

Offset Length Description
0 2 magic, 0x0007
2 2 size of field (h, including header)
4 h-4 UTF-8 ACL data

Stores POSIX compatible access control lists as defined by NFSv4 ACL. Each ACL is stored in a newline terminated string, with the following format: "type:flags:principal:permissions\n".

Text-based Access Control List

Offset Length Description
0 2 magic, 0x0008
2 2 size of field (h, including header)
4 h-4 ACL data

Used to store OS/2 ACLs, but non-NFS POSIX ACLs (POSIX 1003e draft standard 17 mostly) can be stored with this extra field as well.

UUID Access Control List

Offset Length Description
0 2 magic, 0x0009
2 2 size of field (h, including header)
4 16 owner UUID and rights
20 h-20 access UUID and rights

Stores access control lists, each access control entry is a UUID where the last (15th) byte stores the right bits:

Bit Description
0 read
1 write
2 execute
3 append
4 delete
5 group ace
6 setuid
7 setgid, inherit ACL

OS-Specific Extra

Magic 0x000A to 0x000F are reserved. For a full list of other fields, see "Appendix: List of OS-Specific Extra Fields" below.

File Name

Encoded in UTF-8. Directory separator is '/', regardless to the OS. It must not start with '/', and terminated by a zero. If the block's file format is a directory, then the file name must end in '/', otherwise it must not. Cannot be empty, meaning the file name size field in the block's header must be at least 2 (one character plus the terminating zero).

File Content

Depends on file format field. For text files OS-Specific newline (CR or CR LF) must be converted to ASCII 10 (LF, '\n'). With binary no conversion is used. If compression filters are used, this field contains the compressed data.

For hard links, symbolic links and unions the content is a zero terminated UTF-8 path, the target.

For character and block devices, the content is 8 bytes long:

Offset Length Description
0 4 device major
4 4 device minor

For directories and named FIFOs the content is empty.

When encryption is used and the padding field isn't 0, then there are additional bytes to the content. For example if padding is 4 in encryption block (means 16 bytes) and content size is 119, then 128 bytes will be stored in the archive.

CRCs

The Entity Block is ended with two ANSI CRC values. The first one is the CRC of the uncompressed data if file type is text or binary. It is the content's CRC for all the other types. The second CRC is calculated from the beginning of the block to the end, including the uncompressed CRC but not including the second CRC value itself.

End Block

A ZZZip archive is always terminated with these 48 bytes:

Offset Length Description
0 4 magic, 'ZEnd'
4 2 size of the header (48)
6 2 mask of file types used
8 2 archive creation year
10 1 archive creation month (1 - 12)
11 1 archive creation day (1 - 31)
12 1 archive creation hour (0 - 23)
13 1 archive creation minute (0 - 59)
14 1 archive creation second (0 - 60)
15 1 file format version
16 16 sum of uncompressed size fields
32 8 total number of blocks
40 4 mask of compression filters used
44 4 CRC of the archive from 0 - this byte

Mask of File Type Used

This is a bitmask representing all file formats in the archive, (1 << ((block type) >> 4)). For example, if the archive contains at least one character device, then 4th bit in the mask will be set.

Archive Creation Date and Time

This is a packed, OS-independent representation of the creation time. It must be in UTC (GMT + 0) without timezones or daylight saving applied.

File Format Version

Currently 0. This only refers to the file format structures, adding new encryption methods, compression filters or field types should not bump the file format version.

Mask of Compression Filters Used

This is a bitmask representing all compression filters in the archive, (1 << filter). For example, if the archive uses zstd on at least one file, then 7th bit in the mask will be set.

CRC

This is the checksum for the entire archive from the beginning up to the End Block's 44th byte.

Appendix: List of OS-Specific Extra Fields

Extra fields with magic 0x0010 to 0x00FF are OS-specific fields. If they were saved on a big-endian machine, field revision byte will have the most significant bit set, so 0x8010 to 0x80FF. They are expected to be stored in key-value pairs, however an OS is allowed to use its own format.

10 OS/2

OS/2 specific data (magic = 0x0010) is a FEA2LIST structure.

11 OpenVMS

OpenVMS specific data (magic = 0x0011) is a standard key-value pair list of attribute records:

Offset Length Description
0 2 tag, as in ATR$C_XXXX and ATR$S_XXXX
2 2 size of the attribute (s, including header)
4 s-4 attribute data

12 MacOS

MacOS specific data (magic = 0x0012) is in RFC 1740 AppleDouble format. Consider using the meta info field instead.

13 Windows NT

Windows specific data (magic = 0x0013) used to store ACL: a binary ACL structure followed by DWORD aligned ACE entries. Currently all the other WinNT attributes can be described with common extra fields.

14 TargetFour

TBD.

15 OS/390

TBD.

16 AS/400

TBD.

17 Amiga / MorphOS

TBD.

18 BeOS / Haiku

TBD.

Other

Other OS specific data yet to be defined, but preferably using a similar tag attribute list like OpenVMS' with the same tags as in zip if possible. It is preferable to avoid OS-Specific extra fields, and attributes should be stored in common extra fields instead whenever possible.

Third party developers are welcome to contribute to this spec and add more Operating Systems and their attributes. OS type magic id will be assigned in PR filing order.