MCMC Substitution Cipher Solver: C++ High-Performance Edition
This project is a C++ implementation of a substitution cipher solver using a Metropolis–Hastings MCMC algorithm enhanced with Simulated Annealing. It models deciphering as a search over permutation keys, using a training corpus (e.g., War and Peace, Shakespeare) to estimate unigram and bigram statistics that define a log-likelihood for candidate decodings.
From Python to C++: Why it’s Faster
- Raw speed: compiled C++ with low-level control and -O3 optimizations.
- True parallelism: runs multiple independent MCMC chains concurrently (std::thread).
- Fine-grained optimization: careful data structures and precomputation.
Core Improvements
- Optimized log-likelihood via precomputed bigram transition counts. The secret message is scanned once to build a counts matrix, reducing each MH step from O(n) over text length to O(1) using the counts.
- Simulated Annealing: a temperature parameter starts high and cools down, improving exploration early and convergence later.
- Precomputed logarithms of frequency/transition matrices to avoid repeated std::log calls in the hot loop.
Parallelism
- Multiple MCMC chains launched via std::thread, with starting points seeded differently.
- std::mutex protects updates to the global best solution across threads.
- Chain count auto-scales to std::thread::hardware_concurrency().
Tooling & Usability
- CMake build with -O3; consistent style via clang-format.
- Two executables:
- scramble_text: encodes plaintext with a random key and saves the key.
- run_deciphering: decodes with flags like -i (training), -d (secret), -iters, -print_every, -o (output file).
Performance
- ~95× faster than the Python baseline when accounting for total iterations (C++ with 8 threads in tests).
- Higher quality results from precomputed counts + annealing + throughput.
Notes
- The C++ version handles some inputs that the Python baseline rejected due to unsupported characters.