Image credit: [**pixabay**](https://pixabay.com/) Image credit: pixabay

Testing Magika

Google recently announced the release of Magika, an “AI-powered file-type identification system”. I tested this on a corpus of nearly 125k files to see how it fared.

Why?

File type detection is useful in a number of places, such as:

  • Anti-spam - detecting unwanted attachments, for example those with executable content.
  • Website uploads - allowing certain types of content whilst avoiding others.
  • Malware reverse engineering - determining what actions to run on a particular file.

Some of these use cases are speed sensitive. For example anti-spam measures typically need to run within the SMTP session (e.g. in seconds) and users want timely feedback when uploading documents to a website. Others are more forgiving, and can potentially run in a slower pipeline.

Tools for file type identification have been around for a long time, for example TRiD and file. I was specifically interested in Magika as a new approach to a problem that has various challenges. It also has a Python API, making it easy to build into other tools.

The test bench

All tests below were run inside a virtual machine on commodity hardware. The VM had access to 2 cores of an Intel CPU1 and 2GB RAM. The files were stored on a consumer solid-state drive.

This environment was chosen purposefully. It would be cheap and easy to spin up a powerful cloud compute instance with 16 cores and thousands of IOPs2. However, accurate and fast detection of file types is useful in lots of places, including on laptops and in containerised workflows.

All tests were repeated a small number of times, to minimise the effects of caching or other system load. No significant deviation from the figures below was observed, but the results are not presented as scientific.

The corpus

The files chosen for testing are attachments collected over the past few years from a spam sinkhole. The data totalled 6.1GB in 124,791 files. File sizes are dominated by two buckets at 16KB-32K and 64KB-128KB, as seen below.

All files are named with their SHA256 hash - none have file extensions or any other metadata that might be useful for type detection.

Using file

Before testing Magika it’s helpful to have a baseline. The virtual machine used has file version 5.443.

Running file takes just over a minute across the data4:

$ time find . -type f | file --extension -f - > ~/magic-results.txt
15.56s user 7.31s system 29% cpu 1:16.85 total

As the corpus comes from spam email it’s no surprise that ~98% are PDF files according to file (122,768 total).

Excluding the PDF files, the suggestion from file for the remainder is:

Values marked ??? are outliers that I chose not to correct when compiling the statistics. The majority of these are ZIP files which file correctly identified. However, file did not suggest an extension. This may or may not be relevant to a specific use-case for file type detection.

Using Magika

For the tests I used Magika version 0.5.1 with the model standard_v1 (the latest at the time of writing). The model has labels for 115 content types, plus an extra labelled unknown. In contrast the version of file used for testing has a magic database with approximately 3,549 patterns across 2,738 file types5.

Running the magika executable takes 22 minutes:

$ magika -r . --json > ~/magika-results.json
2203.23s user 20.81s system 166% cpu 22:19.53 total

Some of the speed difference can likely be explained by the implementation of Magika in Python, rather than C. Many of the underlying libraries such as numpy are C code but the transition from Python to C will prove to be a bottleneck. However, this is ~17x slower than file - a significant difference.

An example of the JSON output from Magika is shown below.

{
  "path": "<filename>",
  "dl": {
    "ct_label": "outlook",
    "score": 0.4435848295688629,
    "group": "application",
    "mime_type": "application/vnd.ms-outlook",
    "magic": "CDFV2 Microsoft Outlook Message",
    "description": "MS Outlook Message"
  },
  "output": {
    "ct_label": "unknown",
    "score": 0.4435848295688629,
    "group": "unknown",
    "mime_type": "application/octet-stream",
    "magic": "data",
    "description": "Unknown binary data"
  }
}

Native JSON output is very convenient and easy to parse, including at the command line with tools like jq.

In the example above the file was detected by the model as an email message in Outlook format. However, the confidence score of 44.36% was not high enough and the match was not reported. The --prediction-mode option can be used to change this.

Ignoring PDF files, Magika provided a suggestion for all except 11 files. One of these was an empty file which was accidentally included in the corpus and discovered during testing6.

A chart of the confidence intervals is shown below. This chart was generated from the 2,022 files which were not detected as PDF. The values are scaled to percentages and exclude files where score was 1 / 100%.

The confidence intervals are weighted toward the higher end, but as noted below there are occasions when Magika is confidently wrong.

Unclassified files

The files not classified by Magika were:

  • 7 ACE archives which are detected by file (see note below on this format).
  • 1 file in BlakHole archive data format which is detected by file.
  • 2 PDFs where the %PDF- header does not appear at the start of the file. Neither are detected by file.

Drilling into the differences

In many cases I found that file agreed with Magika. In some cases there were discrepancies due to the challenge of distinguishing between similar files, for example text vs. HTML vs. Javascript.

There were cases where one tool performed better than the other, a few are highlighted below.

RTF

Magika does a good job detecting RTF documents. For example 9fa6b1.. is correctly detected as RTF by Magika but only data by file. This file starts:

00000000  7b 5c 72 74 26 a7 5b 3f 39 39 23 3f 2e 3e 2f 27  |{\rt&§[?99#?.>/'|
00000010  34 35 23 3f 2a 60 3f 5f 25 b0 26 3d 34 7c 25 5e  |45#?*`?_%°&=4|%^|

The lack of detection by file is likely because the file has a malformed header (the standard requires {\rtf) and also includes non-ASCII characters. Despite this, Microsoft parsers will happily open the file, a fact which is well known to threat actors and red teams alike.

Magika classified a total of 65 RTF documents, file spotted 40 of them.

JAR

Three files were correctly classified as a Java archive by Magika, but file detected only one of these. The file listing from all three clearly shows the Java content.

  Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2022-02-17 19:58:33 ....A       940179       634855  qexqagy/resources/adihtgswsp
2022-02-17 19:58:33 ....A           58           58  META-INF/MANIFEST.MF
2022-02-17 19:58:33 ....A         1855         1001  qexqagy/Mcfwroxzpjb.class
------------------- ----- ------------ ------------  ------------------------

JAR files use the ZIP format therefore it is perhaps surprising that Magika performed better when file has the opportunity for a crude string match against MANIFEST.MF or .class.

The statistical model used by Magika has somehow determined the difference with scores of 0.6801, 0.9999 and 0.7058. It is possible that other ZIP metadata such as version numbers or compression levels has been learned by the model.

ACE archives

File e55e29.. was detected as ACE archive data version 20, from Win/32 by file and Macromedia Flash data by Magika (score 0.6460). In this case a quick look at the first few bytes shows that file is correct.

00000000  54 6c 31 00 00 00 90 2a 2a 41 43 45 2a 2a 14 14  |Tl1....**ACE**..|
00000010  02 00 26 6e 43 55 a1 39 97 6d 00 00 00 00 16 2a  |..&nCU¡9.m.....*|
00000020  55 4e 52 45 47 49 53 54 45 52 45 44 20 56 45 52  |UNREGISTERED VER|
00000030  53 49 4f 4e 2a 39 07 31 00 01 01 80 30 c8 08 00  |SION*9.1....0È..|

Another ACE archive was incorrectly classified by Magika as an ISO with a score of 0.6889. Magika is confidently wrong for these files. The inclusion of ACE archives in the next round of training data could improve this.

Specific software is required to open ACE files - my favourite archiver 7-Zip does not support them. Therefore the utility of this file format as a spam / malware delivery mechanism is questionable.

File 204b22.. was detected as ASCII text, with no line terminators by file and MS Windows Internet shortcut by Magika.

<META HTTP-EQUIV=Refresh CONTENT="0; URL=http://tiny.cc/<redacted>"></a>

This is a trickier one - the file isn’t a fully formed HTML webpage. Neither classification is technically correct, but neither is egregiously wrong either.

Testing startup time

The examples above execute the tool once to process a batch of data. Some use cases might execute the tool once per file. A simple script was written to simulate this by running the tools 1,000 times with a fixed list of files.

$ time ~/magika-1000.sh
231.92s user 88.24s system 138% cpu 3:51.90 total

$ time ~/file-1000.sh
1.07s user 0.47s system 92% cpu 1.659 total

The table below compares the mean time per file when running the tool once in batch mode (125k files) with running the tool repeatedly in a loop (1k files).

Tool Batch mode 1,000 files
file 3ms 1.6ms
magika 249ms 291ms

These differences are small and the imprecise timing is likely a contributing factor. For both tools they show that startup time is not a huge penalty. Crucially, Magika does not seem to be slower due to loading the model into memory.

Summary

Pros

  • A Python library is easily installed which can be integrated with existing code.
  • Magika natively produces JSON output, which is easy to parse.
  • A confidence interval is provided, which can be useful for certain use-cases.
  • Better detection of certain malformed files based on real world data (e.g. RTF).

Cons

  • The default model is trained on common files and misses some lesser used formats (e.g. obscure archives).
  • magika is slower on a large corpus. The difference is potentially acceptable per file, but adds up quickly with more data.
  • magika does not distinguish between different versions of file formats7.

Conclusion

In summary, Magika is an excellent tool that is worth considering for file type classification. It is slower than file but a classification time of ~300ms on average hardware is likely acceptable in many cases, even for interactive protocols such as SMTP.

It is worth building into processing tools as an alternative classifier, if only to say that your tool is powered by machine learning 😂.


  1. Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz. ↩︎

  2. At the time of writing a c6gd.4xlarge machine is 73 cents per hour from AWS. ↩︎

  3. Specifically file version 5.44-3 amd64 from Debian. ↩︎

  4. The find command takes about 350ms and was removed for brevity. ↩︎

  5. Rough figures from parsing the output of file --list↩︎

  6. Magika correctly reports the empty file as inode/x-empty↩︎

  7. The magic database used by file distinguishes between versions of many file formats, e.g. python 1.3 byte-compiled is different to python 1.4 byte-compiled↩︎

David Cannings
David Cannings
Cyber Security

My interests include computer security, digital electronics and writing tools to help analysis of cyber attacks.