/LikelyMalware

Reverse Engineering ML Models: What Weights Actually Tell You

Applying traditional reverse engineering thinking to neural network weights — what you can learn, what remains opaque, and what tools exist.

February 10, 2026 2 min read

Reverse engineering a binary means recovering structure and intent from machine code. Reversing a neural network means something different — but the spirit is the same.

What's In a Weight File?

A .safetensors or .gguf file is a serialized tensor store. At the binary level: a header, a namespace map, and a blob of floating point numbers.

import safetensors
from safetensors.torch import load_file
 
tensors = load_file("model.safetensors")
for name, tensor in tensors.items():
    print(f"{name}: {tensor.shape} | dtype={tensor.dtype}")

The shape tells you architecture. The values tell you... something.

What You Can Actually Learn

Architecture recovery is straightforward. Tensor shapes map directly to:

  • Embedding dimension
  • Number of attention heads
  • MLP expansion ratio
  • Number of layers

From these alone you can often identify the model family.

Fine-tune detection is harder. The statistical distance between a base model's weights and a fine-tuned variant can reveal:

  • Which layers were frozen
  • Whether PEFT (LoRA, etc.) was applied
  • Roughly how much data was used

Backdoor detection is an active research area. Some approaches:

# Neural cleanse-style trigger inversion sketch
def find_trigger(model, target_class, steps=1000):
    trigger = torch.zeros(input_shape, requires_grad=True)
    optimizer = torch.optim.Adam([trigger], lr=0.01)
    for _ in range(steps):
        loss = -model(apply_trigger(clean_input, trigger))[target_class]
        loss.backward()
        optimizer.step()
    return trigger

What Stays Opaque

The hard part: what the model learned from its training data. Weights don't store facts explicitly. They store a compressed, lossy, non-human-readable encoding of statistical patterns. You can probe, but you can't read.

This is why model interpretability is genuinely hard — and why "just look at the weights" isn't an answer to alignment concerns.


More on mechanistic interpretability next time.