Interesting. The important development is the porting tool which automatically i...

duneroadrunner · on March 19, 2019

> "Unknown" ranges from 49% to 76%.

Yeah, this is interesting. They're saying they can't determine whether a pointer targets an array buffer or not? Perhaps they might want to take a look at the (long neglected) "C to SaferCPlusPlus" translator[1] which can do this. (It was an unexpectedly taxing undertaking though.) It converts C arrays and allocated buffers used as arrays into memory safe implementations of std::array<>s and std::vector<>s, so failure to properly identify them would generally result in output code that wouldn't compile.

The examples they give of problematic code in the paper:

    void f(int* a) {
        *(int**)a = a;
    }

and

    f1(((int*) 0x8f8000));

don't strike me as the kind you would often encounter in real-world code.

> The syntax they use is rather clunky

The output code of the "C to SaferCPlusPlus" translator replaces the types and declarations with macros[2] that can be redefined with a compile-time directive to either use the safe C++ implementation, or revert to the original unsafe native C implementation. The argument being that using macros instead of custom syntax makes the source code more versatile. And existing C programmers already "get" macros.

[1] shameless plug: https://github.com/duneroadrunner/SaferCPlusPlus-AutoTransla...

[2] https://github.com/duneroadrunner/SaferCPlusPlus/blob/master...

Animats · on March 19, 2019

Saw this in the translated "SaferCPlusPlus" output examples.

    static void string_set(char** out, const char* in)

What happened there? Where are the array types? Wrong place to look?

If inference can't make a definitely good decision, maybe translators should guess, conservatively. That is, if it looks like something needs an array type parameter, make it an array type parameter with subscript checking. Then run tests on the translated program and see if that works. That's what humans do on such code. Machine learning has potential here. For any array in a working program, there must be some expression of some variables that expresses the size of the array. If humans can't find that expression, the program is unmaintainable and probably has a bug.

There are really 3 cases.

1. this is a pointer, and it's never subscripted or offset. That's a pointer to a single instance of something.

2. this is a pointer which is subscripted or offset, and we can tell from context how big the array is.

3. This is a pointer which is subscripted or offset, but auto-translation fails to figure out how big the array is supposed to be.

The problem is to convert (3) into (2).

I tend to think that a good metric for C code quality is how hard that is. If it's not obvious by looking how big something is supposed to be, there's probably a potential bug.

[1] https://github.com/duneroadrunner/SaferCPlusPlus-AutoTransla...

duneroadrunner · on March 19, 2019

Thanks for noticing :) It's been quite a while since I worked on the code, but I believe that the translator intentionally left types declared as "char {star}" unmodified assuming that they were being used as strings [1] rather "regular" array buffers. I'm guessing that dealing with strings would have been a lot more work because it would require providing safe compatible replacements for all the standard C library string functions.

I think you should find that array buffers of other types, like "unsigned char" or "const unsigned char", and their associated pointer iterators are translated to their corresponding macros. I'd be interested if you find otherwise. If you're interested, the relevant code for the translator is in the "safercpp" subdirectory [2]. It's not super-well commented so if you have any questions feel free to post them in the "issues" section of the repository.

[1] https://github.com/duneroadrunner/SaferCPlusPlus-AutoTransla...

[2] https://github.com/duneroadrunner/SaferCPlusPlus-AutoTransla...

Animats · on March 19, 2019

OK, Here's a non-string function where the translator is trying to deal with C written like it's 1980:

    static unsigned countZeros(MSE_LH_ARRAY_ITERATOR_TYPE(const unsigned char) data,
        size_t size, size_t pos)
    {
      MSE_LH_ARRAY_ITERATOR_TYPE(const unsigned char)  start = data + pos;
      MSE_LH_ARRAY_ITERATOR_TYPE(const unsigned char)  end = start + 
          MAX_SUPPORTED_DEFLATE_LENGTH;
      if(end > data + size) end = data + size;
      data = start;
      while(data != end && *data == 0) ++data;
      /*subtracting two addresses returned as 32-bit number
          (max value is MAX_SUPPORTED_DEFLATE_LENGTH)*/
      return (unsigned)(data - start);
    }

What guarantees that the "while" loop will not run away and take "data" outside the array bounds?

I proposed a version of C with slices and references, where you could write that like this:

    static unsigned countZeros(const unsigned char &(data)[size],
        size_t size, size_t pos)
    {
      const unsigned char &(data1)[size-pos] = data[pos:size-pos]; // slice
      size_t cnt = 0;
      while (cnt < LENGTH(data1) && cnt < MAX_SUPPORTED_DEFLATE_LENGTH && data1[cnt] == 0) ++cnt;
      return(cnt);
    }

The "data" parameter has size info, so the language knows how big it is. The "work" variable is a slice of "data". This eliminates the need for pointer arithmetic. Much pointer arithmetic in C, especially where you have a pointer partway into an array, is an attempt to emulate a slice.

Automatically extracting slice usage from code with pointer arithmetic is a tough problem. But not impossible. When you see code constructing something like

    data = start;
    while(data != end && *data == 0) ++data;

you have to recognize that as subscripting.

    while(data != end && *data == 0) ++data;
    return (unsigned)(data - start);

should become first

    data = start;
    size_t dataix;
    while(&data[dataix] != end && data[dataix] == 0) ++dataix;
    return (unsigned)(&data[dataix] - start);

by substituting subscripting for pointer arithmetic.

Next, when you see an offset array being created, as in

    start = data + pos;

turn that into a slice:

    const unsigned char &(data1)[size-pos] = data[pos:size-pos]; // slice

The slice is the same pointer, but the there's now valid size information associated with it.

If you do transformations like that, you get a version of C where subscript checking is possible. You can then hoist or prove out many of the subscript checks. Here, the compiler would be expected to understand that if an array subscript is less than LENGTH of the array, it's safe. LENGTH here, as I wrote in my paper, refers to the length of the array as known to the compiler from the array declaration. Here, array lengths can be expressions evaluated at declaration time. That's how length info gets passed around.

    const unsigned char &(data)[size]

as a parameter means "this is an array of size "size". "size" comes in via another parameter. The function can assume "size" is valid, and all callers must check that, either at compile time or run time.

If you can't write an expression for the size of something, you have a big problem with your program.

duneroadrunner · on March 19, 2019

> What guarantees that the "while" loop will not run away and take "data" outside the array bounds?

What do you mean "the array bounds"? The code is memory safe. "data" is an iterator that knows exactly what array/container it's pointing to, and that container knows its own size. Dereferences are bounds checked (by default).

This translated code is not intended to be performance optimal. The translator does not add, remove or rearrange any of the original source code elements, it simply replaces some of them with macros that are defined as functionally equivalent, memory safe C++ substitutes for the original element. Doing it this way has the benefit of allowing you to "disable" the memory safety mechanisms by reverting the macro definitions to the original (unsafe) elements.

I have not yet gotten around to addressing performance of the translated code. In order to preserve the ability to revert back to pure C code, there would need to be an additional set of macros (like maybe an "array view" macro) that could be mapped to their (safe) high performance C++ counterparts but that would be more restricted in their usage.

But at this point I think the value of that is questionable. If you need your code to be memory safe and high performance, the most expedient thing to do is to just accept the translated code as C++ code (or SaferCPlusPlus code) and re-optimize the performance bottlenecks as idiomatic SaferCPlusPlus code. SaferCPlusPlus is, along with Rust, the fastest [1] option for memory (and data race) safe programming.

And if you don't like the C++ language as whole, just (define and) stick to a subset you're comfortable with, right? I mean, (I think your proposal is fine as an extension of C, but) I don't see the point in extending the C language with things like views/slices/spans, when the C language is already extended with those. It's called C++ (or some subset thereof) right? And with C++ you can solve the memory (and data race) issues much more comprehensively and performantly (if that's a word :) than with any extension to C. No?

[1] https://github.com/duneroadrunner/SaferCPlusPlus-BenchmarksG...

mpweiher · on March 18, 2019

> The big design mistake in C was "pointer" and "array" being the same thing syntactically.

Couldn't agree more. I actually tend to think that C is totally fine for me, but "forget" that this is really the C that is embedded in Objective-C. Meaning we have basic collection classes like NSData (bytes), NSArray (objects), NSString (well, strings, not the same as bytes) that take care of these things, yet are close enough to the metal that you can still largely treat them as C.

Well, maybe not all of them, but particularly NSData.

pjmlp · on March 19, 2019

I never understood why it would be so hard to write ptr = &var[0] like in any sane systems programming language.

Is saving to type those extra 4 characters worthwile the CVE mess we got into?

adrianmonk · on March 19, 2019

That's assuming it was about saving characters. It's possible the motivation was to save bytes at runtime. On systems whose total RAM is measured in single or double digits of kilobytes, that's a strong motivator.

Another possible motivator could be simplifying the compiler implementation by making all arrays work the same way from a code generation point of view. Heap arrays have a size known only at runtime, so it makes sense to carry this size with the array, as the malloc() call does. But stack and global (and static) arrays have a size which is known at compile time, so they do not need a size carried around with them.

Back then, using fixed sized arrays all over the place was common practice. Look at DOS 8.3 character filenames for example. So, dynamic heap allocation might have been viewed as something people wouldn't use constantly. And compilers were a lot simpler, so creating two different implementations for arrays might have seemed excessive.

pjmlp · on March 19, 2019

First of all, outside Bell Labs there were computers being programmed in high level languages since 1961, definitely much weaker computers than a PDP-11.

Secondly, ptr = &array[0] is something would happen at compile time.

Third, C authors seem to have a predisposition to ignore state of the art in existing languages.

mpweiher · on March 19, 2019

Not sure it's quite that easy. Wouldn't you have to have a size field so the size of the array is known? What integer is the size field? And now the array is a struct, and I dimly recall that passing structures by value came later.

Or do we do it all in the type-system? Except we don't really have much of one. And Pascal just showed what a dumb idea having the size of the array encoded in the type was, with functions that take arrays of different sizes incompatible with each other.

Anyway, I think it's really kind of the other way around: it's not that C has arrays and pointers and wanted there to be compatibility. C doesn't actually have arrays, it only has pointers. What passes as arrays is a minimal bit of syntactic sugar that doesn't affect the semantics, which are just plain pointer semantics.

pjmlp · on March 19, 2019

It is so hard that a large majority of system programming languages do it.

That Pascal example is getting tired by now. It only applies to the first edition of ISO Pascal (superceded by ISO Extended Pascal), and improved by any systems language that was born afterwards.

mpweiher · on March 19, 2019

Hmm, I guess I should have been more explicit: ...in the context of the rest of C...

pjmlp · on March 19, 2019

Meaning the C language, or system languages derived from it?