File format specification¶

Reference origin and license¶

A fileformat.md here is the reference description from writemdict project with copyright by zhansliu, distributed under the MIT License, the term is attached as MIT.txt.

Introduction¶

This is a description of version 2.0 of the MDX and MDD file format, used by the MDict dictionary software. The software is not open-source, nor is the file format openly specified, so the following description is based on reverse-engineering, and is likely incomplete and inaccurate in its details.

Most of the information comes from https://bitbucket.org/xwang/mdict-analysis. While xwang mostly focuses on being able to read this unknown format, I have added details that are necessary to also write MDX files.

Concepts¶

MDX and MDD files are both designed to store an associative array of pairs (keyword, record).

For MDX files, the information stored is typically a dictionary. The keyword and record are both (Unicode) strings, with the keyword being the headword for the dictionary entry, and the record giving a description of that word. An example of an MDX entry could be:

keyword: "reverse engineering"
record: "noun: a process of analyzing and studying an object or device, in order to understand its inner workings"

MDD files are instead designed to store binary data. Typically, the keyword is a file path, and the record is the contents of that file. As an example, we may have:

keyword: "\image.png"
record: 0x89 0x50 0x4e 0x47 0x0d 0x0a 0x1a 0x0a...

MDX files is designed to store a dictionary, i.e. a collection of pairs (keyword, record), which could be, for example, keyword="reverse engineering", record="noun: a process of analyzing and studying an object or device, in order to understand its inner workings".

Typically, MDD files are associated with an MDX file of the same name (but with extension .mdx instead of .mdd), and contains resources to be included in the text of MDX files. For example, and entry of the MDX file might contain the HTML code <img src="/images/image.png" />, in which case the MDict software will look for the entry "\image.png" in the MDD file.

File structure¶

The basic file structure is a follows:

MDX File
`header_sect`	Header section. See "Header Section" below.
`keyword_sect`	Keyword section. See "Keyword Section" below.
`record_sect`	Record section. See "Record Section" below.

Header Section¶

`header_sect`	Length
`length`	4 bytes	Length of `header_str`, in bytes. Big-endian.
`header_str`	varying	An XML string, encoded in UTF-16LE. See below for details.
`checksum`	4 bytes	ADLER32 checksum of `header_str`, stored little-endian.

The header_str consists of a single, XML tag dictionary, with various attributes. For MDX files, they look like this: (newlines added for clarity)

<Dictionary 
GeneratedByEngineVersion="2.0" 
RequiredEngineVersion="2.0" 
Encrypted="2" 
Encoding="UTF8"
Format="Html"
CreationDate="2015-01-01"
Compact="No"
Compat="No"
KeyCaseSensitive="No"
Description="This is a <i>test dictionary</i>."
Title="My dictionary"
DataSourceFormat="106"
StyleSheet=""
RegisterBy="Email"
RegCode="0102030405060708090A0B0C0D0E0F"/>

For MDD files, we have instead:

<Library_Data 
GeneratedByEngineVersion="2.0" 
RequiredEngineVersion="2.0" 
Encrypted="2" 
Format=""
CreationDate="2015-01-01"
Compact="No"
Compat="No"
KeyCaseSensitive="No"
Description="This is a <i>test dictionary</i>."
Title="My dictionary"
DataSourceFormat="106"
StyleSheet=""
RegisterBy="Email"
RegCode="0102030405060708090A0B0C0D0E0F"/>

The meaning of the attributes are explained below:

Attribute	Description
`GeneratedByEngineVersion`	The version of the file format. This document describes version 2.0. Apart from this, version 1.2 is also possible.
`RequiredEngineVersion`	Presumably the lowest format version compatible with this version.
`Encrypted`	An integer between 0 and 3 (inclusive). If the lower bit is set, indicates that the first part of the keyword section is encrypted, as described in the section Keyword header encryption. If the upper bit is set, indicates that the keyword index is encrypted, using the scheme described in Keyword index encryption.
`Encoding`	Only used for MDX files. The encoding used for text in the document. Possible values are "UTF-8", "UTF-16" (uses little-endian encoding), "GBK", and "Big5". For MDD files, the encoding used for the keywords (file paths) is always UTF-16, and the records consist of binary data.
`Format`	The format of the dictionary entry texts. Possible values include "Html" and "Text". For MDD files, this must be empty.
`CreationDate`	The date the dictionary was created.
`Compact`	If this is "Yes", indicates the dictionary entries is in an Mdict-specific compact format, where certain string are replaced according to the scheme specified in `StyleSheet`. See the documentation for the official MdxBuilder client for details.
`Compat`	Appears to be a typo for `Compact`, which certain versions of the official Mdict client look for instead of `Compact`.
`KeyCaseSensitive`	Indicates to the dictionary reader whether or not keys should be treated in a case-insensitive manner.
`Description`	A description of the dictionary, which appears as the ":about" page in the official MDict client.
`Title`	The title of the dictionary.
`DataSourceFormat`	Unknown.
`StyleSheet`	Used in conjunction with the `Compact` option. See the documentation for the official MdxBuilder client for details.
`RegisterBy`	Either "EMail" or "DeviceID". Only used if the lower bit of `Encrypted` is set. Indicates which piece of user-identifying data is used to encrypt the encryption key. See the section Keyword header encryption for details.
`RegCode`	When keyword header encryption is used (see Keyword header encryption), this is one way to deliver the encrypted key. In this case, this is a string consisting of 32 hexadecimal digits.

Keyword Section¶

The keyword section contains all the keywords in the dictionary, divided into blocks, as well as information about the sizes of these blocks.

`keyword_sect`	Length
`num_blocks`	8 bytes	Number of items in key_blocks. Big-endian. Possibly encrypted, see below.
`num_entries`	8 bytes	Total number of keywords. Big-endian. Possibly encrypted, see below.
`key_index_decomp_len`	8 bytes	Number of bytes in decompressed version of `key_index`. Big-endian. Possibly encrypted, see below.
`key_index_comp_len`	8 bytes	Number of bytes in compressed version of `key_index` (including the `comp_type` and `checksum` parts). Big-endian. Possibly encrypted, see below.
`key_blocks_len`	8 bytes	Total number of bytes taken up by key_blocks. Big-endian. Possibly encrypted, see below.
`checksum`	4 bytes	ADLER32 checksum of the preceding 40 bytes. If those are encrypted, it is the checksum of the decrypted version. Big-endian.
`key_index`	varying	The keyword index, compressed and possibly encrypted. See below.
`key_blocks[0]`	varying	A compressed block containing keywords, compressed. See below.
...	...	...
`key_blocks[num_blocks-1]`	varying	...

Keyword header encryption:¶

If the parameter Encrypted in the header has the lowest bit set (i.e. Encrypted | 1 is nonzero), then the 40-byte block from num_blocks are encrypted. The encryption used is Salsa20/8 (Salsa20 with 8 rounds instead of 20). In pseudo-Python:

def encrypt(message, key):
    salsa20_8_init(key_length = 128, #128 bits
       iv_length = 64, # 64 bits
       ivs = b"\0\0\0\0\0\0\0\0"), #64 bits of zeros)
    return salsa20_8_encrypt(message, key)

encrypted_block = encrypt(unencrypted_block, key=ripemd128(encryption_key))

Here, encryption_key is the dictionary password specified on creation of the dictionary.

This encryption_key is not distributed directly. Instead it is further encrypted, using a piece of data, user_id, that is specific to the user or the client machine, according to the following scheme:

reg_code = encrypt(ripemd128(encryption_key), ripemd128(user_id))

The string user_id can be either an email address ("example@example.com") that the user enters into his/her MDict client, or a device ID ("12345678-90AB-CDEF-0123-4567890A") which the MDict client obtains in different ways depending on the platform. The choice of which one to use depends on the attribute RegisterBy in the file header. (See Header section.) In either case, user_id is an ASCII-encoded string. On certain platforms, the official MDict client seems to default to the DeviceID being the empty string.

The 128-bit reg_code is then distributed to the user. This can be done in two ways:

If the MDX file is called dictionary.mdx, the dictionary reader should look for a file called dictionary.key in the same directory, which contains reg_code as a 32-digit hexadecimal string.
Otherwise, reg_code can be included in the header of the MDX file, as the attribute RegCode.

Keyword index¶

The keyword index lists some basic data about the key blocks. It is compressed (see "Compression"), and possibly encrypted (see "Keyword index encryption"). After decompression and decryption, it looks like this:

`decompress(key_index)`	Length
`num_entries[0]`	8 bytes	Number of keywords in the first keyword block.
`first_size[0]`	2 bytes	Length of `first_word[0]`, not including trailing null character. In number of "basic units" for the encoding, so e.g. bytes for UTF-8, and 2-byte units for UTF-16.
`first_word[0]`	varying	The first keyword (alphabetically) in the `key_blocks[0]` keyword block. Encoding given by `Encoding` attribute in the header.
`last_size[0]`	2 bytes	Length of `last_word[0]`, not including trailing null character. In number of "basic units" for the encoding, so e.g. bytes for UTF-8, and 2-byte units for UTF-16.
`last_word[0]`	varying	The last keyword (alphabetically) in the `key_blocks[0]` keyword block. Encoding given by `Encoding` attribute in the header.
`comp_size[0]`	8 bytes	Compressed size of `key_blocks[0]`.
`decomp_size[0]`	8 bytes	Decompressed size of `key_blocks[0]`.
`num_entries[1]`	8 bytes	...
...	...	...
`decomp_size[num_blocks-1]`	8 bytes	...

Keyword index encryption:¶

If the parameter Encrypted in the header has its second-lowest bit set (i.e. Encrypted | 2 is nonzero), then the keyword index is further encrypted. In this case, the comp_type and checksum fields will be unchanged (refer to the section Compression), the following C function will be used to encrypt the compressed_data part, after compression.

#define SWAPNIBBLE(byte) (((byte)>>4) | ((byte)<<4))
void encrypt(unsigned char* buf, size_t buflen, unsigned char* key, size_t keylen) {
	unsigned char prev=0x36;
	for(size_t i=0; i < buflen; i++) {
		buf[i] = SWAPNIBBLE(buf[i] ^ ((unsigned char)i) ^ key[i%keylen] ^ previous);
		previous = buf[i];
	}
}

The encryption key used is ripemd128(checksum + "\x95\x36\x00\x00"), where + denotes string concatenation.

Keyword blocks¶

Each keyword is compressed (see "Compression"). After decompressing, they look like this:

`decompress(key_blocks[0])`	Length
`offset[0]`	8 bytes	Offset where the record corresponding to `key[0]` can be found, see below. Big-endian.
`key[0]`	varying	The first keyword in the dictionary, null-terminated and encoded using `Encoding`.
`offset[1]`	8 bytes	...
`key[1]`	varying	...
...	...	...

The offset should be interpreted as follows: Decompress all record blocks, and concatenate them together, and let records denote the resulting array of bytes. The record corresponding to key[i] then starts at records[offset[i]].

Record section¶

The record section looks like this:

`record_sect`	Length
`num_blocks`	8 bytes	Number items in `record_blocks`. Does not need to equal the number of keyword blocks. Big-endian.
`num_entries`	8 bytes	Total number of records in dictionary. Should be equal to `keyword_sect.num_entries`. Big-endian.
`index_len`	8 bytes	Total size of the `comp_size[i]` and `decomp_size[i]` variables, in bytes. In other words, should equal 16 times `num_blocks`. Big-endian.
`blocks_len`	8 bytes	Total size of the `rec_block[i]` sections, in bytes. Big-endian.
`comp_size[0]`	8 bytes	Length of `rec_block[0]`, in bytes. Big-endian.
`decomp_size[0]`	8 bytes	Decompressed size of `rec_block[i]`, in bytes. Big-endian.
`comp_size[1]`	8 bytes	Length of `rec_block[1]`, in bytes. Big-endian.
...	...	...
`decomp_size[num_blocks-1]`	8 bytes	...
`rec_block[0]`	varying	A compressed block containing records. See below.
...	...	...
`rec_block[num_blocks-1]`	varying	...

Record block¶

Each record block is compressed (see "Compression"). After decompressing, they look like this:

`decompress(rec_block[0])`	Length
`record[0]`	varying	The first record. If in an MDX file, this is null-terminated and encoded using `Encoding`.
`record[1]`	varying	...
...	...	...

Compression:¶

Various data blocks are compressed using the same scheme. These all look like these:

`compress(data)`	Length
`comp_type`	4 bytes	Compression type. See below.
`checksum`	4 bytes	ADLER32 checksum of the uncompressed data. Big-endian.
`compressed_data`	varying	Compressed version of `data`.

The compression type can be indicated by comp_type. There are three options:

If comp_type is '\x00\x00\x00\x00', then no compression is applied at all, and compressed_data is equal to data.
If comp_type is '\x01\x00\x00\x00', LZO compression is used.
If comp_type is '\x02\x00\x00\x00', zlib compression is used. It so happens that the zlib compression format appends an ADLER32 checksum, so in this case, checksum will be equal to the last four bytes of compressed_data.