Corrupt Corrupted: Safe Ways To Reproduce Errors For QA Teams

Intro
Corrupted files are a goldmine for testing robustness: they surface brittle parsers, shaky error handling, and risky assumptions. This guide shows safe, repeatable ways to create controlled corruption so your apps fail gracefully instead of crashing. You’ll get copy-paste commands for macOS/Linux and Windows, plus automation and CI tips. Always work on disposable, test-only files and keep backups and hashes—never touch production data.

Prerequisites

Test-only files & backups
- Work in a scratch directory; copy your original to sample.test.ext and snapshot its hash (SHA256).
Tools (exact names/versions)
- xxd (vim-common 8.2+), hexdump (util-linux/bsdmainutils 2.37+), truncate (GNU coreutils 9.0+), zip (3.0+)
- Windows: PowerShell 7+ (Get-FileHash, Set-Content, fsutil available on Windows), or Windows Terminal with Git Bash for Unix tools

Method 1 — Header Byte Flip (fast)

Concept (1–2 sentences)
Many formats rely on magic bytes at the start (e.g., PDF → %PDF, PNG → 89 50 4E 47). Changing one byte in the header often forces deterministic “invalid file format” errors.

macOS/Linux (bash, copy-paste):

# 0) Setup (duplicate original)
cp sample.png sample_corrupt.png

# 1) Show first 16 bytes (for context)
xxd -g 1 -l 16 sample_corrupt.png

# 2) Flip/overwrite a single header byte at offset 0x01 (second byte) to 0x00
# (adjust offset as needed)
printf '\x00' | dd of=sample_corrupt.png bs=1 seek=$((0x01)) count=1 conv=notrunc

# 3) Validate: force open error / identify type / checksum mismatch
file sample.png
file sample_corrupt.png

sha256sum sample.png sample_corrupt.png
# On macOS: shasum -a 256 sample.png sample_corrupt.png

Windows (PowerShell):

# 0) Setup
Copy-Item sample.png sample_corrupt.png -Force

# 1) Inspect first 16 bytes
$bytes = [IO.File]::ReadAllBytes("sample_corrupt.png")
$bytes[0..15] | ForEach-Object { '{0:X2}' -f $_ } -join ' '

# 2) Overwrite byte at offset 0x01 with 0x00
$bytes[0x01] = 0x00
[IO.File]::WriteAllBytes("sample_corrupt.png", $bytes)

# 3) Validate
Get-FileHash sample.png -Algorithm SHA256
Get-FileHash sample_corrupt.png -Algorithm SHA256

Validation Ideas

Method 2 — Insert/Delete Bytes at Offset

Altering structure mid-file catches boundary checks, offset tables, and chunk parsers.

Steps (macOS/Linux):

# 0) Copy source
cp sample.bin sample_insert.bin
cp sample.bin sample_delete.bin

# 1) INSERT bytes at offset (e.g., insert 4 bytes 0xFF at offset 0x100)
OFFSET=$((0x100))
LEN=4
head -c $OFFSET sample_insert.bin > prefix.tmp
head -c $LEN < /dev/zero | tr '\0' '\377' > insert.tmp    # 0xFF x LEN
tail -c +$((OFFSET+1)) sample_insert.bin > suffix.tmp
cat prefix.tmp insert.tmp suffix.tmp > tmp && mv tmp sample_insert.bin
rm -f prefix.tmp insert.tmp suffix.tmp

# 2) DELETE bytes at offset (e.g., remove 8 bytes at 0x200)
OFFSET=$((0x200))
LEN=8
head -c $OFFSET sample_delete.bin > prefix.tmp
tail -c +$((OFFSET+LEN+1)) sample_delete.bin > suffix.tmp
cat prefix.tmp suffix.tmp > tmp && mv tmp sample_delete.bin
rm -f prefix.tmp suffix.tmp

# 3) Verify SHA256 before/after
sha256sum sample.bin sample_insert.bin sample_delete.bin
# macOS: shasum -a 256 ...

Steps (Windows PowerShell):

# 0) Load
$src = "sample.bin"
$insertOut = "sample_insert.bin"
$deleteOut = "sample_delete.bin"
Copy-Item $src $insertOut -Force
Copy-Item $src $deleteOut -Force

# 1) INSERT 4 bytes 0xFF at offset 0x100
$bytes = [IO.File]::ReadAllBytes($insertOut)
$offset = 0x100
$inject = [byte[]](0xFF,0xFF,0xFF,0xFF)
$new = New-Object byte[] ($bytes.Length + $inject.Length)
[Array]::Copy($bytes, 0, $new, 0, $offset)
[Array]::Copy($inject, 0, $new, $offset, $inject.Length)
[Array]::Copy($bytes, $offset, $new, $offset + $inject.Length, $bytes.Length - $offset)
[IO.File]::WriteAllBytes($insertOut, $new)

# 2) DELETE 8 bytes at offset 0x200
$bytes = [IO.File]::ReadAllBytes($deleteOut)
$offset = 0x200; $len = 8
$new = New-Object byte[] ($bytes.Length - $len)
[Array]::Copy($bytes, 0, $new, 0, $offset)
[Array]::Copy($bytes, $offset + $len, $new, $offset, $bytes.Length - ($offset + $len))
[IO.File]::WriteAllBytes($deleteOut, $new)

# 3) Verify SHA256
Get-FileHash $src,$insertOut,$deleteOut -Algorithm SHA256

Mini Table:

Input	Offset (hex)	Length (bytes)
`sample.bin`	`0x100`	`4` (insert)
`sample.bin`	`0x200`	`8` (delete)

Always capture hashes before and after to document the mutation and for reproducibility.

Method 3 — Archive/Container Corruption (if relevant)

Corrupting metadata in archives (e.g., ZIP) helps test error paths like “CRC mismatch” and “truncated central directory.”

Truncate / Break EOCD (ZIP):

# macOS/Linux
cp logs.zip logs_corrupt.zip
truncate -s -10 logs_corrupt.zip           # remove last 10 bytes
# Expected unzip errors:
# - "End-of-central-directory signature not found"
# - "zipfile is empty" / "premature EOF"

# Windows PowerShell
Copy-Item logs.zip logs_corrupt.zip -Force
# Shorten file by 10 bytes (PowerShell .NET)
$bytes = [IO.File]::ReadAllBytes("logs_corrupt.zip")
[IO.File]::WriteAllBytes("logs_corrupt.zip", $bytes[0..($bytes.Length-11)])
# Expected errors similar to above in Windows unzip tools

Tweak CRC inside a ZIP entry:

# Overwrite a few bytes inside the first local file header payload
cp data.zip data_crc.zip
# Find offset to first local file payload (rough): skip first 1KB and flip a byte
printf '\x00' | dd of=data_crc.zip bs=1 seek=$((1024)) count=1 conv=notrunc
# Expected unzip errors:
# - "bad CRC  dddd (should be eeee)"
# - Extract may fail or produce corrupted file

(For precise entry offsets, parse with a ZIP inspector or zipinfo -v, then mutate within a file’s compressed data to trigger CRC mismatches.)

Automation Snippet

Python (single-file CLI) — supports --input --mode --offset --bytes --out and JSON summary.

#!/usr/bin/env python3
# corruptor.py
import argparse, json, sys, os

def read_bytes(path):
    with open(path, 'rb') as f: return bytearray(f.read())

def write_bytes(path, data):
    with open(path, 'wb') as f: f.write(data)

def main():
    p = argparse.ArgumentParser()
    p.add_argument("--input", required=True)
    p.add_argument("--out", required=True)
    p.add_argument("--mode", choices=["flip","insert","delete","truncate"], required=True)
    p.add_argument("--offset", type=lambda x:int(x,0), default=0, help="byte offset (0x.. ok)")
    p.add_argument("--bytes", default="", help="hex bytes like: FF,00FF, or len for truncate/delete (decimal)")
    args = p.parse_args()

    try:
        src = read_bytes(args.input)
        dst = bytearray(src)
        offset = args.offset
        summary = {"input": args.input, "out": args.out, "mode": args.mode, "offset": offset}

        if args.mode == "flip":
            if offset >= len(dst): raise ValueError("offset out of range")
            # Flip means overwrite with provided byte(s) or invert single byte if empty
            if args.bytes:
                b = bytes.fromhex(args.bytes.replace(",","").replace(" ",""))
                dst[offset:offset+len(b)] = b
                summary["bytes_written"] = args.bytes.upper()
            else:
                dst[offset] ^= 0xFF
                summary["bytes_written"] = "XOR FF (1 byte)"

        elif args.mode == "insert":
            b = bytes.fromhex(args.bytes.replace(",","").replace(" ",""))
            if offset > len(dst): raise ValueError("offset out of range")
            dst = dst[:offset] + b + dst[offset:]
            summary["bytes_inserted"] = len(b)

        elif args.mode == "delete":
            n = int(args.bytes, 0) if args.bytes else 1
            if offset+n > len(dst): raise ValueError("range out of range")
            dst = dst[:offset] + dst[offset+n:]
            summary["bytes_deleted"] = n

        elif args.mode == "truncate":
            n = int(args.bytes, 0) if args.bytes else 1
            if n >= len(dst): raise ValueError("truncate too large")
            dst = dst[:-n]
            summary["bytes_truncated"] = n

        write_bytes(args.out, dst)
        print(json.dumps({"ok": True, **summary}, indent=2))
        return 0
    except Exception as e:
        print(json.dumps({"ok": False, "error": str(e)}), file=sys.stderr)
        return 3

if __name__ == "__main__":
    sys.exit(main())

Exit Codes

0 = success
3 = I/O or validation error (bad offset/length, missing input)

Examples

python3 corruptor.py --input sample.png --out sample_bad.png --mode flip --offset 0x01 --bytes FF
python3 corruptor.py --input sample.bin --out sample_ins.bin --mode insert --offset 0x200 --bytes "00 FF 00 FF"
python3 corruptor.py --input logs.zip  --out logs_trunc.zip --mode truncate --bytes 10

JSON output is suitable for parsing in CI logs.

CI/Test Oracles

Expected Signals

Parser returns known error message (e.g., “invalid signature”, “bad CRC”, “unexpected EOF”).
Exit codes from your app: non-zero for failure, specific codes for specific faults.
Optional: stderr patterns + “no crash” assertion.

Sample CI step (Bash)

set -euo pipefail
python3 corruptor.py --input sample.pdf --out sample_bad.pdf --mode flip --offset 0x00 --bytes 00 > result.json

# Run SUT
if ./pdf-checker sample_bad.pdf 2>err.txt; then
  echo "Unexpected success"; exit 1
fi

grep -Ei "invalid|signature|header|corrupt|crc|eof" err.txt
test $? -eq 0

# Ensure process didn't crash (optional watchdog or core dump check)

Sample CI step (PowerShell)

$null = python.exe corruptor.py --input sample.zip --out sample_trunc.zip --mode truncate --bytes 16 | Out-Null
$proc = Start-Process .\zip-checker.exe -ArgumentList "sample_trunc.zip" -PassThru -Wait -NoNewWindow
if ($proc.ExitCode -eq 0) { throw "Expected failure on corrupt zip" }
Select-String -Pattern "bad CRC|EOF|central" -Path .\zip-checker.log -Quiet | Out-Null

Troubleshooting

File still opens fine? Target stronger invariants:
- Flip the magic bytes (first 2–8 bytes), or
- Insert/delete inside critical headers (index tables, chunk lengths), or
- Corrupt size/length fields (e.g., set to 0xFFFFFFFF).
App crashes vs graceful fail: That’s a bug candidate; collect the sample, logs, and stack traces. Restore from your clean backup and minimize the delta to aid debugging.
“Permission denied” or locked files: Close the target app, copy to a new filename, or work in a temp directory.
Binary vs text: For text formats (JSON, XML), breaking a single character in the prolog or a closing brace can be enough; for binary formats, aim at headers and length fields.

Risks & Safe Practices

Never corrupt real or production data. Use disposable copies only.
Keep original SHA256 hashes and record mutation offsets/lengths for reproducibility.
Version your corrupt samples along with test cases; name files clearly (e.g., sample_pdf_bad_magic_0x00.pdf).
Document: format, tool used, exact command, expected error message, and SUT version.

FAQ

Q1: Will antivirus flag my corrupted files?
Sometimes. Corrupted archives or executables can trip heuristic flags. Store samples in a quarantined QA directory and consider excluding that path from real-time scanning in test environments.

Q2: How do cloud previews behave with corrupted files?
Most cloud viewers fail fast with “Cannot preview this file.” That’s useful: assert that your system surfaces a clear, user-friendly message instead of hanging or looping.

Q3: How large should a mutation be?
Minimal is best. Start with a 1-byte flip or a small (4–16 bytes) insert/delete. Larger corruptions can mask the precise failure path and reduce diagnostic value.

Q4: Can I guarantee a specific error message?
Not always. Different libraries produce different wording. Target well-known invariants (magic bytes, checksums, EOCD) to maximize consistency, and assert on patterns rather than exact text.