file
command
The file
command (which is usually installed by default) is used to determine the type of a given file.
Interestingly the result is almost always correct, just imagine the number of different file types (we have currently 2280 registred MIME types).
The Windows file type detection technique is very naive because it's mostly based on the file extension and you can easily break it.
You can find the complete manual of the file
command here: man file.
The file
command uses multiples technique to determine the file type.
The idea of the magic bytes
is to provide the type information (and sometimes version) in the first few bytes of the file. It is usually used for binary files, but can also be used for text files (since characters are also bytes).
For example, the WebAssembly binary format (wasm) has the following definitions:
magic ::= 0x00 0x61 0x73 0x6D
version ::= 0x01 0x00 0x00 0x00
module ::= magic version ...
module
is the binary file.
As you can see the first 4 bytes represents the file type (\0asm
for short) and the 4 next bytes the version (0x1
here).
Each file type has to be registred and has an unique sequence of magic bytes
to avoid any conflicts.
Text files can not use this kind of API, the file
command uses a collection of RegExp instead.
For example, to detect an HTML file the following RegExp are used:
\<head\>
\<title\>
\<a\ href=
In the case of shell
, the interpreter is usually indicated on top of the file (example #!/usr/bin/python
). It is also used for file detection.
You can find the sources of the file
command here and the rules for the detection here.