Reverse Engineering ML Models: What Weights Actually Tell You
Applying traditional reverse engineering thinking to neural network weights — what you can learn, what remains opaque, and what tools exist.
Reverse engineering a binary means recovering structure and intent from machine code. Reversing a neural network means something different — but the spirit is the same.
What's In a Weight File?
A .safetensors or .gguf file is a serialized tensor store. At the binary level:
a header, a namespace map, and a blob of floating point numbers.
import safetensors
from safetensors.torch import load_file
tensors = load_file("model.safetensors")
for name, tensor in tensors.items():
print(f"{name}: {tensor.shape} | dtype={tensor.dtype}")The shape tells you architecture. The values tell you... something.
What You Can Actually Learn
Architecture recovery is straightforward. Tensor shapes map directly to:
- Embedding dimension
- Number of attention heads
- MLP expansion ratio
- Number of layers
From these alone you can often identify the model family.
Fine-tune detection is harder. The statistical distance between a base model's weights and a fine-tuned variant can reveal:
- Which layers were frozen
- Whether PEFT (LoRA, etc.) was applied
- Roughly how much data was used
Backdoor detection is an active research area. Some approaches:
# Neural cleanse-style trigger inversion sketch
def find_trigger(model, target_class, steps=1000):
trigger = torch.zeros(input_shape, requires_grad=True)
optimizer = torch.optim.Adam([trigger], lr=0.01)
for _ in range(steps):
loss = -model(apply_trigger(clean_input, trigger))[target_class]
loss.backward()
optimizer.step()
return triggerWhat Stays Opaque
The hard part: what the model learned from its training data. Weights don't store facts explicitly. They store a compressed, lossy, non-human-readable encoding of statistical patterns. You can probe, but you can't read.
This is why model interpretability is genuinely hard — and why "just look at the weights" isn't an answer to alignment concerns.
More on mechanistic interpretability next time.