Determining if a File is a PDF in Rust

In my work generating BI (Business Intelligence) reports as PDFs with Apache FOP (the Java XSL-FO to PDF engine), I often need to verify that an output file is actually a PDF before further processing. A reliable way to do this is to inspect the file’s “magic bytes” – the fixed signature at the start of the file. Magic bytes (also called file signatures or magic numbers) are small sequences of bytes placed at the beginning of a file to identify its format. Because these signatures are defined by the file format itself and deliberately chosen to be unique, checking them is a fast and reliable way to recognize a PDF (or any file type) regardless of its extension.

Checking the PDF Magic Number in Rust (with CLI)

To implement this check in Rust, we can open the file, read its first few bytes, and compare them to the PDF signature. Let’s make this a command-line tool so you can run it as check-pdf file_to_test.pdf.

The steps are:

  1. Accept the file path as a command-line argument.
  2. Open the file using std::fs::File::open.
  3. Read the first N bytes into a buffer (for PDF, N = 5, since %PDF- is 5 bytes).
  4. Compare the buffer to the byte string b"%PDF-".
  5. Print the result and handle errors gracefully.

If they match, the file is a PDF. Otherwise, it is not (or the file may be invalid or too short). This approach is efficient because it reads only a handful of bytes instead of the whole file.

To get started, create a new Rust binary project called check-pdf:

cargo new --bin check-pdf

This will create a new directory check-pdf with a src folder inside. Replace the contents of src/main.rs with the following code:

// main.rs (place this in check-pdf/src/main.rs)
use std::env;
use std::fs::File;
use std::io::{self, Read};
use std::path::Path;

/// Checks if the given file is a PDF by inspecting its first bytes.
fn is_pdf(path: &Path) -> io::Result<bool> {
    let mut file = File::open(path)?;
    let mut header = [0u8; 5]; // Buffer for "%PDF-"

    let bytes_read = file.read(&mut header)?;

    Ok(bytes_read == header.len() && &header == b"%PDF-")
}

fn main() {
    let args: Vec<String> = env::args().collect();

    if args.len() != 2 {
        eprintln!("Usage: {} <file>", args[0]);

        std::process::exit(1);
    }
    
    let path = Path::new(&args[1]);

    match is_pdf(path) {
        Ok(true) => println!("{} is a PDF file", path.display()),
        Ok(false) => println!("{} is NOT a PDF file", path.display()),
        Err(e) => {
            eprintln!("Error checking file '{}': {}", path.display(), e);

            std::process::exit(2);
        }
    }
}

You can now build your project with cargo build --release and run it as ./target/release/check-pdf file_to_test.pdf.

Understanding Magic Bytes

A file signature (magic number) is data used to identify or verify the content of a file. Most common formats put a signature at the very start of the file. For example, PNG images begin with the bytes 89 50 4E 47 0D 0A 1A 0A, GIF images start with 47 49 46 38, and PDFs always start with the ASCII string %PDF-. These signatures are not chosen at random; their ASCII forms are meant to be recognizable and unique to that format. In practice this means if a file’s first bytes match the PDF signature, we can be confident the file is a PDF.

PDF files start with the literal characters %PDF- (hex 25 50 44 46 2D). If we see exactly that sequence at the start, the file is almost certainly a PDF (most valid PDF generators, including Apache FOP, produce files that start %PDF-).

How the Code Works

  • The program expects a single argument: the file path.
  • We call File::open(path)? to open the file (propagating any I/O error).
  • We create a fixed-size buffer header of 5 bytes ([0u8; 5]) since the PDF signature %PDF- is 5 bytes long.
  • Using file.read(&mut header)?, we read up to 5 bytes from the file. The function returns how many bytes were actually read.
  • Finally we check two things: that we read exactly 5 bytes, and that those bytes match the literal b"%PDF-". The expression &header == b"%PDF-" compares the buffer to the PDF magic bytes.
  • Errors are printed to stderr and the process exits with a nonzero code.

If both conditions hold, is_pdf returns Ok(true). Otherwise it returns Ok(false) (or an error if reading failed).

Conclusion

By checking magic bytes, we can quickly and reliably determine a file’s format in Rust. The PDF format’s magic number %PDF- is defined by the specification and will appear at the very start of any valid PDF file. Reading just the first few bytes and comparing them to this signature is a fast check that doesn’t rely on file extensions. In summary, using Rust’s std::fs::File and Read to read the first 5 bytes and comparing to b"%PDF-" is a concise, idiomatic way to verify a PDF file.

References: The concept of magic bytes (file signatures) and their use in type-sniffing is well-documented. For PDF files specifically, the magic number is %PDF- (hex 25 50 44 46 2D).