Testing Magika
Google recently announced the release of Magika, an “AI-powered file-type identification system”. I tested this on a corpus of nearly 125k files to see how it fared.
Why?
File type detection is useful in a number of places, such as:
- Anti-spam - detecting unwanted attachments, for example those with executable content.
- Website uploads - allowing certain types of content whilst avoiding others.
- Malware reverse engineering - determining what actions to run on a particular file.
Some of these use cases are speed sensitive. For example anti-spam measures typically need to run within the SMTP session (e.g. in seconds) and users want timely feedback when uploading documents to a website. Others are more forgiving, and can potentially run in a slower pipeline.
Tools for file type identification have been around for a long time, for example TRiD and file. I was specifically interested in Magika as a new approach to a problem that has various challenges. It also has a Python API, making it easy to build into other tools.
The test bench
All tests below were run inside a virtual machine on commodity hardware. The VM had access to 2 cores of an Intel CPU1 and 2GB RAM. The files were stored on a consumer solid-state drive.
This environment was chosen purposefully. It would be cheap and easy to spin up a powerful cloud compute instance with 16 cores and thousands of IOPs2. However, accurate and fast detection of file types is useful in lots of places, including on laptops and in containerised workflows.
All tests were repeated a small number of times, to minimise the effects of caching or other system load. No significant deviation from the figures below was observed, but the results are not presented as scientific.
The corpus
The files chosen for testing are attachments collected over the past few years from a spam sinkhole. The data totalled 6.1GB in 124,791 files. File sizes are dominated by two buckets at 16KB-32K and 64KB-128KB, as seen below.
All files are named with their SHA256 hash - none have file extensions or any other metadata that might be useful for type detection.
Using file
Before testing Magika it’s helpful to have a baseline. The virtual machine used has file
version 5.443.
Running file
takes just over a minute across the data4:
$ time find . -type f | file --extension -f - > ~/magic-results.txt
15.56s user 7.31s system 29% cpu 1:16.85 total
As the corpus comes from spam email it’s no surprise that ~98% are PDF files according to file
(122,768 total).
Excluding the PDF files, the suggestion from file
for the remainder is:
Values marked ???
are outliers that I chose not to correct when compiling the statistics. The majority of these are ZIP files which file
correctly identified. However, file
did not suggest an extension. This may or may not be relevant to a specific use-case for file type detection.
Using Magika
For the tests I used Magika version 0.5.1 with the model standard_v1
(the latest at the time of writing). The model has labels for 115 content types, plus an extra labelled unknown
. In contrast the version of file
used for testing has a magic database with approximately 3,549 patterns across 2,738 file types5.
Running the magika
executable takes 22 minutes:
$ magika -r . --json > ~/magika-results.json
2203.23s user 20.81s system 166% cpu 22:19.53 total
Some of the speed difference can likely be explained by the implementation of Magika in Python, rather than C. Many of the underlying libraries such as numpy are C code but the transition from Python to C will prove to be a bottleneck. However, this is ~17x slower than file
- a significant difference.
An example of the JSON output from Magika is shown below.
{
"path": "<filename>",
"dl": {
"ct_label": "outlook",
"score": 0.4435848295688629,
"group": "application",
"mime_type": "application/vnd.ms-outlook",
"magic": "CDFV2 Microsoft Outlook Message",
"description": "MS Outlook Message"
},
"output": {
"ct_label": "unknown",
"score": 0.4435848295688629,
"group": "unknown",
"mime_type": "application/octet-stream",
"magic": "data",
"description": "Unknown binary data"
}
}
Native JSON output is very convenient and easy to parse, including at the command line with tools like jq
.
In the example above the file was detected by the model as an email message in Outlook format. However, the confidence score of 44.36% was not high enough and the match was not reported. The --prediction-mode
option can be used to change this.
Ignoring PDF files, Magika provided a suggestion for all except 11 files. One of these was an empty file which was accidentally included in the corpus and discovered during testing6.
A chart of the confidence intervals is shown below. This chart was generated from the 2,022 files which were not detected as PDF. The values are scaled to percentages and exclude files where score
was 1 / 100%.
The confidence intervals are weighted toward the higher end, but as noted below there are occasions when Magika is confidently wrong.
Unclassified files
The files not classified by Magika were:
- 7 ACE archives which are detected by
file
(see note below on this format). - 1 file in
BlakHole archive data
format which is detected byfile
. - 2 PDFs where the
%PDF-
header does not appear at the start of the file. Neither are detected byfile
.
Drilling into the differences
In many cases I found that file
agreed with Magika. In some cases there were discrepancies due to the challenge of distinguishing between similar files, for example text vs. HTML vs. Javascript.
There were cases where one tool performed better than the other, a few are highlighted below.
RTF
Magika does a good job detecting RTF documents. For example 9fa6b1..
is correctly detected as RTF by Magika but only data
by file
. This file starts:
00000000 7b 5c 72 74 26 a7 5b 3f 39 39 23 3f 2e 3e 2f 27 |{\rt&§[?99#?.>/'|
00000010 34 35 23 3f 2a 60 3f 5f 25 b0 26 3d 34 7c 25 5e |45#?*`?_%°&=4|%^|
The lack of detection by file
is likely because the file has a malformed header (the standard requires {\rtf
) and also includes non-ASCII characters. Despite this, Microsoft parsers will happily open the file, a fact which is well known to threat actors and red teams alike.
Magika classified a total of 65 RTF documents, file
spotted 40 of them.
JAR
Three files were correctly classified as a Java archive by Magika, but file
detected only one of these. The file listing from all three clearly shows the Java content.
Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
2022-02-17 19:58:33 ....A 940179 634855 qexqagy/resources/adihtgswsp
2022-02-17 19:58:33 ....A 58 58 META-INF/MANIFEST.MF
2022-02-17 19:58:33 ....A 1855 1001 qexqagy/Mcfwroxzpjb.class
------------------- ----- ------------ ------------ ------------------------
JAR files use the ZIP format therefore it is perhaps surprising that Magika performed better when file
has the opportunity for a crude string match against MANIFEST.MF
or .class
.
The statistical model used by Magika has somehow determined the difference with scores of 0.6801, 0.9999 and 0.7058. It is possible that other ZIP metadata such as version numbers or compression levels has been learned by the model.
ACE archives
File e55e29..
was detected as ACE archive data version 20, from Win/32
by file
and Macromedia Flash data
by Magika (score 0.6460). In this case a quick look at the first few bytes shows that file
is correct.
00000000 54 6c 31 00 00 00 90 2a 2a 41 43 45 2a 2a 14 14 |Tl1....**ACE**..|
00000010 02 00 26 6e 43 55 a1 39 97 6d 00 00 00 00 16 2a |..&nCU¡9.m.....*|
00000020 55 4e 52 45 47 49 53 54 45 52 45 44 20 56 45 52 |UNREGISTERED VER|
00000030 53 49 4f 4e 2a 39 07 31 00 01 01 80 30 c8 08 00 |SION*9.1....0È..|
Another ACE archive was incorrectly classified by Magika as an ISO with a score of 0.6889. Magika is confidently wrong for these files. The inclusion of ACE archives in the next round of training data could improve this.
Specific software is required to open ACE files - my favourite archiver 7-Zip does not support them. Therefore the utility of this file format as a spam / malware delivery mechanism is questionable.
Links, or not
File 204b22..
was detected as ASCII text, with no line terminators
by file
and MS Windows Internet shortcut
by Magika.
<META HTTP-EQUIV=Refresh CONTENT="0; URL=http://tiny.cc/<redacted>"></a>
This is a trickier one - the file isn’t a fully formed HTML webpage. Neither classification is technically correct, but neither is egregiously wrong either.
Testing startup time
The examples above execute the tool once to process a batch of data. Some use cases might execute the tool once per file. A simple script was written to simulate this by running the tools 1,000 times with a fixed list of files.
$ time ~/magika-1000.sh
231.92s user 88.24s system 138% cpu 3:51.90 total
$ time ~/file-1000.sh
1.07s user 0.47s system 92% cpu 1.659 total
The table below compares the mean time per file when running the tool once in batch mode (125k files) with running the tool repeatedly in a loop (1k files).
Tool | Batch mode | 1,000 files |
---|---|---|
file |
3ms | 1.6ms |
magika |
249ms | 291ms |
These differences are small and the imprecise timing is likely a contributing factor. For both tools they show that startup time is not a huge penalty. Crucially, Magika does not seem to be slower due to loading the model into memory.
Summary
Pros
- A Python library is easily installed which can be integrated with existing code.
- Magika natively produces JSON output, which is easy to parse.
- A confidence interval is provided, which can be useful for certain use-cases.
- Better detection of certain malformed files based on real world data (e.g. RTF).
Cons
- The default model is trained on common files and misses some lesser used formats (e.g. obscure archives).
- magika is slower on a large corpus. The difference is potentially acceptable per file, but adds up quickly with more data.
- magika does not distinguish between different versions of file formats7.
Conclusion
In summary, Magika is an excellent tool that is worth considering for file type classification. It is slower than file
but a classification time of ~300ms on average hardware is likely acceptable in many cases, even for interactive protocols such as SMTP.
It is worth building into processing tools as an alternative classifier, if only to say that your tool is powered by machine learning 😂.
-
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz. ↩︎
-
At the time of writing a
c6gd.4xlarge
machine is 73 cents per hour from AWS. ↩︎ -
Specifically
file
version5.44-3 amd64
from Debian. ↩︎ -
The
find
command takes about 350ms and was removed for brevity. ↩︎ -
Rough figures from parsing the output of
file --list
. ↩︎ -
Magika correctly reports the empty file as
inode/x-empty
. ↩︎ -
The magic database used by
file
distinguishes between versions of many file formats, e.g.python 1.3 byte-compiled
is different topython 1.4 byte-compiled
. ↩︎