Some other architectures like PDP-11 and 680x0 had a dedicated "clear register" instruction.
It could have been added to x86, even as a group of single-byte opcodes with the register encoded in three bits (as with PUSH, POP, and INC/DEC outside of long mode). But the XOR idiom was already established on the 8080 by that point.
Interesting, since the general culture at IBM seems to have preferred SUB over XOR -- their earlier business-oriented machines didn't even have a XOR instruction, and even on later ones the use of SUB has persisted, including in the IBM PC and AT BIOS.
(There was another, now deleted, comment somewhere in this thread that mentioned IBM's preference for SUB. Source of that statement was Claude, but it seems very likely to be correct. The BIOS code I've checked myself, lots of 'SUB AX,AX', no XOR)
You may not be looking for the right thing. On the aforementioned CSP, the instruction that performed XOR was called "XR" and not "XOR". My source is firsthand knowledge; I was a CE and performed service calls on the System/34, System/36, 370, and 390.
In any case, I am describing equipment built mostly in late 60s through the late 70s at IBM Rochester and Poughkeepsie. The IBM PC was developed by an entirely different team at IBM Boca Raton, and IBM didn't design its CPU.
I don't doubt that this specific processor special-cased XOR (regardless of how it was called in the assembly language)!
Merely pointing out that where both operations were available, there seems to have been a preference to use SUB instead, with some continuity from early business-oriented mainframes, to the 360, to the PC.
Another thing I should point out is that the CSP instruction set was not documented to the customer. The CSP software was called "Microcode" and the customer was not told about the CSP's design or how it worked. The documented instruction set for the System/34 and System/36 is that of the Main Storage Processor or MSP, which was an evolution of the IBM System/3.
Comparing for equality can use either SUB or XOR: it sets the zero flag if (and only if) the two values are equal. That's why JE/JNE (jump if equal/not equal) is an alias for JZ/JNZ (jump if zero/not zero).
There's also the TEST instruction, which does a logical AND but without storing the result (like CMP does for SUB). This can be used to test specific bits.
Testing a single register for zero can be done in several ways, in addition to CMP with 0:
TEST AX,AX
AND AX,AX
OR AX,AX
INC AX followed by DEC AX (or the other way around)
The 8080/Z80 didn't have TEST, but the other three were all in common use. Particularly INC/DEC, since it worked with all registers instead of just the accumulator.
Also any arithmetic operation sets those flags, so you may not even need an explicit test. MOV doesn't set flags however, at least on x86 -- it does on some other architectures.
Again a very interesting look at how this chip works internally!
I've decoded the entry point PLA of the 80286 (not the actual microcode though). It also has separate entries for real and protected mode, but only for segment loads from a general purpose register, HLT, and for those opcodes that aren't allowed in real mode like ARPL.
Loading a segment register from memory on the 286 uses the same microcode in both modes, as does everything else that would certainly have to act differently, like jump/call far. That was a bit surprising, since it would have to decide at run time which mode it's in. Is this the same on the 386?
Tested on my 286 machine what happens when opcodes are decoded while in real mode but executed after PE is set: Segment load from memory works (using protected mode semantics), whereas the load from register only changes the visible selector and nothing else. The base in the descriptor cache keeps whatever was set there before -- I assume on the 386, SBRM would update the base the same way it does in real mode in that situation, because it's also used for V86 mode there. Illegal-in-real-mode instructions trap, but do so correctly using the protected mode IDT.
Also seems like executing three pre-decoded instructions without a jump after setting PE causes a triple fault for some reason.
Nice findings. For segment loads from memory, the entry point is actually shared between real and protected mode on the 386. The microcode branches later based on PE and does the extra descriptor work only in protected mode. So maybe it's done similarly on the 286.
The decode vs. execution behavior is more interesting. From both Intel docs and my own core, PE is effectively checked in both stages independently, but decode happens ahead of execution (prefetch queue). So if an instruction is decoded in real mode, it’ll still follow the real-mode path even if PE is set before it executes.
That’s exactly why Intel requires a jump right after setting PE — it flushes the prefetch queue and forces re-decode in protected mode. As the 80386 System Software Writer’s Guide (Ch. 6.1) puts it: "Instructions in the queue were fetched and decoded while the processor was in real mode; executing them after switching to protected mode can be erroneous."
> Also seems like executing three pre-decoded instructions without a jump after setting PE causes a triple fault for some reason.
It's been a while, but I recall Intel documenting that a jump was required almost immediately after setting PE. Probably because documenting "you must soon jump" was easy. Vs. handling the complexities of decoded-real/executed-PE - and documenting how that worked - would have been a giant PITA.
The two-instruction grace period was to let you load a couple segment or descriptor table registers or something, which were kinda needed for the jump. And that triple fault - if you failed to jump in time - sounds right in line with Intel's "when in doubt, fault or halt" philosophy for the 286.
Well, Intel documented that the very first instruction after enabling protected mode had to be an "intra-segment" (not inter-segment) jump, to flush the prefetch queue. At least that was what it said in the 286 and 386 documents I read. You were supposed to set up everything else needed before that, do this near jump, and then jump to the new protected mode code segment.
Some later documentation contradicted this, saying that instead this first jump had to be to the protected mode segment.
From the patent (US4442484), it is apparent that the processor decodes opcodes into a microcode entry point before they are executed, and the PE bit is one of the inputs for the entry point PLA. So that would be the obvious reason for flushing the prefetch queue - but it turns out that at least on the 80286, most instructions go to the same entry point regardless of the mode they are decoded in. So they should work the same without flushing the queue.
And yet for some reason, what I've seen in my experiments is that the system would reset if there were three instructions following the "LMSW" without a jump. Even something harmless like "NOP" or "MOV AX,AX", that couldn't be different between real and protected mode. Maybe there is some clock phase where the PE bit changing during the decoding of an instruction leads to an invalid entry point, that either causes a triple fault or resets the processor?
I disassemble and read a lot of vintage bioses for fun. Recently I looked at something more ~recent, an Atom N270 945GSE Mini-ITX industrial board from 2010. Phoenix bios:
Yes, the far jump was never necessary on any processor, only a convention. You can stay in the same segment as in real mode and it will continue to work. But some kind of control transfer to flush the queue must be done shortly after the LMSW / MOV CR0, or things may break in ways that I'm not entirely clear on.
My test code looked like this:
mov ax,1 ;new MSW
mov bx,TestSel ;pointer to selector value into BX
mov dx,[bx] ;and load into DX
mov cl,31 ;shift count for delay
cli ;disable interrupts
lgdt [Gdtr]
lidt [Idtr]
jmp enter_pm ;flush queue now
align 2
enter_pm: ;go!
rol cl,cl ;delay while following instructions decode
lmsw ax ;set PE bit
mov es,[bx] ;should load selector 0x0010 into ES
mov ds,dx ;should set DS base to 0x00100 [NOPE]
str ax ;should trap because not allowed in real mode
ud2 ;trap anyway in case it didn't
On the 286, this always caused the processor to reset. Replacing one of the two segment load instructions with a same-length "mov ax,ax" didn't change that, but removing one of them did.
In that case the "str ax" acted as the control transfer that flushed the queue (it was still decoded in real mode, so it went to the "invalid opcode" entry point). No clue as to what exactly happens to cause the reset when three instructions are run from the queue, some timing issue related to when the PE bit actually changes vs. what the decoder is doing at this point?
Guess: Intel changed the spec. There's quite a few generations between a 286 and a P4, and new BIOS code doesn't need to run on discontinued CPU types. And new execution contexts like https://en.wikipedia.org/wiki/System_Management_Mode might benefit from minimizing the setup needed to run in protected mode.
For this use case, there would also have to be an extension to the SSH protocol to send such out-of-band information. Maybe this already exists and isn't used?
The broader problem with terminal control sequences didn't exist on Windows (until very recently at least), or before that DOS and OS/2. You had API calls to position the cursor, set color/background, etc. Or just write directly to a buffer of 80x25 characters+attribute bytes.
But Unix is what "serious" machines -a long time ago- used, so it has become the religion to insist that The Unix Way(TM) is superior in all things...
> For this use case, there would also have to be an extension to the SSH protocol to send such out-of-band information. Maybe this already exists and isn't used?
I don’t think one already exists, but it would be straightforward to create one. SSH protocol extensions are named by strings of form NAME@DNSDOMAIN so anyone can create one, registration would not be required.
The hardest part would be getting the patches accepted by the SSH client/server developers. But that’s likely easier than getting the feature past the Linux kernel developers.
The Unix way died with Plan9/9front and there are no teletypes, period. Just windows with shells running inside as any other program. You can run a browser under a window(1) instead of rc(1) which is the shell.
>All of those have control data in the same stream under the hood.
Not true. For most binary protocols, you have something like <Header> <Length of payload> <Payload>. On magnetic media, sector headers used a special pattern that couldn't be produced by regular data [1] -- and I'm sure SSDs don't interpret file contents as control information either!
There may be some broken protocols, but in most cases this kind of problem only happens when all the data is a stream of text that is simply concatenated together.
The header and length of the payload are control data. It's still being concatenated even if it's binary. A common way to screw that one up is to measure the "length of payload" in two different ways, for example by using the return value of strlen or strnlen when setting the length of the payload but the return value of read(2) or std::string size() when sending/writing it or vice versa. If the data unexpectedly contains an interior NULL, or was expected to be NULL terminated and isn't, strnlen will return a different value than the amount of data read into the send buffer. Then the receiver may interpret user data after the interior NULL as the next header or, when they're reversed, interpret the next header as user data from the first message and user data from the next message as the next header.
Another fun one there is that if you copy data containing an interior NULL to a buffer using snprintf and only check the return value for errors but not an unexpectedly short length, it may have copied less data into the buffer than you expect. At which point sending the entire buffer will be sending uninitialized memory.
Likewise if the user data in a specific context is required to be a specific length, so you hard-code the "length of payload" for those messages without checking that the user data is actually the required length.
This is why it needs to be programmatic. You don't declare a struct with header fields and a payload length and then leave it for the user to fill them in, you make the same function copy N bytes of data into the payload buffer and increment the payload length field by N, and then make the payload buffer and length field both modifiable only via that function, and have the send/write function use the payload length from the header instead of taking it as an argument. Or take the length argument but then error out without writing the data if it doesn't match the one in the header.
>It's user data in JSON in an HTTP stream in a TLS record in a TCP stream in an IP packet in an ethernet frame. Then it goes into a SQL query which goes into a B-tree node which goes into a filesystem extent which goes into a RAID stripe which goes into a logical block mapped to a physical block etc. All of those have control data in the same stream under the hood.
It's true that a lot of code out there has bugs with escape sequences or field lengths, and some protocols may be designed so badly that it may be impossible to avoid such bugs. But what you are suggesting is greatly exaggerated, especially when we get to the lower layers. There is almost certainly no way that writing a "magic" byte sequence to a file will cause the storage device to misinterpret it as control data and change the mapping of logical to physical blocks. They've figured out how to separate this information reliably back when we were using floppy disks.
That the bits which control the block mapping are stored on the same device as a record in an SQL database doesn't mean that both are "the same stream".
> There is almost certainly no way that writing a "magic" byte sequence to a file will cause the storage device to misinterpret it as control data and change the mapping of logical to physical blocks.
Which is also what happens if you use parameterized SQL queries. Or not what happens when one of the lower layers has a bug, like Heartbleed.
There also have been several disk firmware bugs over the years in various models where writing a specific data pattern results in corruption because the drive interprets it as an internal sequence.
I distinctly remember bugs with non-Hayes modems where they would treat `+++ATH0` coming over the wire as a control, leading to BBS messages which could forcibly disconnect the unlucky user who read it.
In this particular case, IIRC Hayes had patented the known approach for detecting this and avoiding the disconnect, so rival modem makers were somewhat powerless to do anything better. I wonder if such a patent would still hold today...
What was patented was the technique of checking for a delay of about a second to separate the command from any data. It still had to be sent from the local side of the connection, so the exploit needed some way to get it echoed back (like ICMP).
DOS had a driver ANSI.SYS for interpreting terminal escape sequences, and it included a non-standard one for redefining keys. So if that driver was installed, 'type'ing a text file could potentially remap any key to something like "format C: <Return> Y <Return>".
Yes, that seems unneccessary. The overhead of trapping and rewriting every syscall instruction once can't be (much) greater than that required for rewriting them at the start either.
Even if you disallow executing anything outside of the .text section, you still need the syscall trap to protect against adversarial code which hides the instruction inside an immediate value:
foo: mov eax, 0xc3050f ;return a perfectly harmless constant
ret
...
call foo+1
(this could be detected if the tracing went by control flow instead of linearly from the top, but what if it's called through a function pointer?)
Thinking a bit more about it (and reading TFA more carefully), what's the point of rewriting the instructions anyway?
I first assumed it was redirecting them to a library in user mode somehow, but actually the syscall is replaced with "int3", which also goes to the kernel. The whole reason why the "syscall" instruction was introduced in the first place was that it's faster than the old software interrupt mechanism which has to load segment descriptors.
So why not simply use KVM to intercept syscall (as well as int 80h), and then emulate its effect directly, instead of replacing the opcode with something else? Should be both faster and also less obviously detectable.
Good point, an int3 is not going to be faster than a syscall, and if they implement the sandboxing policy in guest userspace is seems it would be quite easy to disable.
I think the point here is optimizing for the common case, the untrusted code is still running inside a VM, so you can still trap malicious or corner cases using a more heavy-handed method. The blog post does mention "self-healing" of JIT-generated code for instance.
It is possible to restrict the call-flow graph to avoid the case you described, the canonical reference here is the CFI and XFI papers by Ulfar Erlingsson et.al. In XFI they/we did have a binary rewriter that tried to handle all the corner cases, but I wouldn't recommend going that deep, instead you should just patch the compiler (which funnily we couldn't do, because the MSVC source code was kept secret even inside MSFT, and GCC source code was strictly off-limits due to being GPL-radioactive...)
The follow on posts describe where I plan to run the binaries. the idea is to run in a guest with no kernel and everything running at ring 0 that makes the sysret a dangerou thing to call. we don't have anything running at ring 3 also the syscall instruction clobber some registers all in all between the int3 and syscall instruction i counted around 20 extra instructions in my runtime. ( This is a guess me trying to figure what would happen). That is why the int3 becomes faster for what i am trying to build. The toolchain approach suffers from the diversity of options you have to support even if ignore stuff you guys encountered. Might be easier with llvm based things but still too many things to patch and the movement you tell people used my build environment it meets resistance.
I am currently aiming for python which is easy to do. The JIT is when i want to do javascript which i keep pushing out because once i go down there i have to worry about threading as well. Something i want to chase but right now trying to get something working.
Dedicated mail clients have existed for a lot longer than GMail has, work with any service using the POP3 or IMAP protocol, and don't run inside a web browser.
I recall we could dial up a super slow connection over telephone lines, get all our mails into such client in less than 4 minutes over said slow line, just to dial off again.¹ Afterwards we would read all our mails offline with all the time in the world, carefully crafting replies and put those into an "Outgoing" folder for the next time we could dial up a connection again (usually the next day). :)
¹) back then you paid Internet by the minute, or in case of the Deutsche Telekom it was a 4 minute tact in the evening, so you had to wait until after 21:00 to get the cheaper prices.
That worked because while the link may have been slow, it was circuit-switched and generally provided the 2400 bits. "Bad wifi" is unbelievably bad compared to an old dial-up link. It's so much worse than you're imagining.
reply