I will have to dig up my 6502 documentation, but, IIRC, by the time the processor executed the NOP (CLI, INX etc) it already fetched the next instruction, so, if it's another NOP, it will complete in one cycle instead of two. Unless you crossed a page boundary, which implies a one-cycle penalty.
Since I never wrote timing-critical code for the 6502 (apart from "make it as fast as possible") I cannot recall many specifics. Since you did, you certainly have a better understanding of how it worked.
I am restoring a 65c02-based //e clone, so, I may be able to properly measure instruction timings, but I won't hold my breath.
Yes you are right, the instruction timings were very exact as far as I remember. The only cases where there was an option was in the case of a branch taken or not.