Custom C-functions Executed in a Single CPU Cycle (For Use in Embedded Systems)
Are they really executed in a single CPU cycle? I doubt it - many x86 instructions take more than a single cycle.
I also wonder how well this approach would work when you're working with an algorithm where the data access pattern is as important as CPU.