Hacker News new | past | comments | ask | show | jobs | submit login
Checked C: extension to C that adds static and dynamic checking (github.com/microsoft)
129 points by ingve on March 14, 2018 | hide | past | favorite | 92 comments



From the technical paper:

    Checked C adds new types to C:
    • ptr<T>, the checked pointer to singleton type,
    • array_ptr<T>, the checked pointer to array type,
    • T checked[N], the checked array type,
Many people have been down that road. GCC had "fat pointers" for subscript checking years ago.[1] That never caught on. Walter Bright had a proposal.[2] I had a proposal. Microsoft had "Managed C++". It's technically possible, but retrofitting old code rarely happens.

[1] http://williambader.com/bounds/example.html [2] https://news.ycombinator.com/item?id=8509155


It only works when companies take the "my way or the high way" approach.

Hence why Microsoft went C++ and .NET Native, keeping C compatibility just as far as ANSI C++ requires it. And support C++ kernel code since Windows 8.

Why the NDK is such a pain to use on Android, with a very tiny API surface.

Why all Objective-C improvements are now only related to improve interoperability with Swift.

So maybe if Microsoft decides to also force Checked C on Windows, it might eventually work.


10-15 years ago, Microsoft had the clout to do that. They no longer do.


They still do, just in HN bubble people think they don't.

Azure is a money printing machine, the "Year of Desktop Linux" will never happen, hybrid tablets/laptops with Windows are being bought instead of Android ones, IoT devices for healthcare or ticketing machines run on Windows, on the enterprise Linux mostly matters on the server room, they own PC/XBox gaming.

So just like Apple and Google, they can do whatever they feel like it, because non-technical consumers aren't going to suddenly start buying alternative systems, just because devs don't like Apple, Google and Microsoft roadmaps regarding which programming languages one is allowed to use on their systems.


> if Microsoft decides to also force Checked C on Windows

how does this happen without diverging from the history of backwards compatibility?


> It's technically possible, but retrofitting old code rarely happens.

Yeah. A more realistic solution might be auto-translation of existing code. Here[1] is an example that replaces declarations of arrays, and pointers used as array iterators, with macros that enable the code to be compiled as either unsafe C code or "safe" C++ code (bounds checked and iterator use-after-free safe).

Determining in general whether or not a pointer is being used as an array iterator, or a pointer to an array is not trivial. It sometimes has to be deduced from context. Which may be why it doesn't seem to have been done already?

[1] shameless plug: https://github.com/duneroadrunner/SaferCPlusPlus-AutoTransla...


I highly recommend the Static Analyzer that clang already uses. I use this frequently in the C project I am currently working on, and it's very good at spotting memory leaks and other potential issues. I've had more than one occasion where I've looked at an issue it's reported and said, nah, this is a false positive...oh, no, wait.


It's apparently a Microsoft research project: https://www.microsoft.com/en-us/research/project/checked-c/?...

Very interesting to add additional compiler support to am existing language, in a sense. Sanitizers and static analyzers have made C/C++ much more bearable, and this adds to the pile!


If anyone is looking for a battle-tested (sometimes literally) production ready, statically and dynamically checked C-like language, take a look at https://ada2012.org!


Ada is in practice almost completely maintained by one single (consulting) company, Adacore, who, except for the GNAT compiler, release all of their OSS contributions under GPLv3 "to limit your freedom [of use/reuse]" (it's mentioned explicitly on their Website).

They do NOT want an autonomous Ada community to pop up, they just want to attract programmers they can contract. Since they're basically the only noteworthy contributors for Ada, they're impossible to bypass, hence using Ada means submitting to Adacore's whims.

Ada is indeed a very well designed language, and with some recent language revisions, so have you ever wondered why it did not get any traction at all? Well, that's why.

Every 6 months or so on HN or Reddit, someone rediscovers Ada, gets psyched and tells everyone ... and then discovers the licensing / controlled ecosystem issues, and walks away. And we don't hear about it. For about 6 months.


> They do NOT want an autonomous Ada community to pop up, they just want to attract programmers they can contract. Since they're basically the only noteworthy contributors for Ada, they're impossible to bypass, hence using Ada means submitting to Adacore's whims.

Disclaimer: I work at AdaCore. Both forces are working in tension in the company: Some people push for the community, some for commercial interests. Of course as a company we want both to coexist, which is the difficult balance to find. The community advocates are getting stronger and trying to push their visions and make it understood. This is a long process but it's making headway. But, anyway, saying that we don't want an autonomous Ada community to pop up is dead wrong.

The libraries we release on GitHub are generally under GPLv3+runtime exception, so you can actually dynamically link against them in code with a different licence. https://github.com/AdaCore/gnatcoll-core for example.

It is very possible, if not ideal, to develop native software in Ada today, with whatever licence you choose.


Adacore is not the only compiler vendor in town.

https://www.ghs.com/products/ada_optimizing_compilers.html

https://www.ptc.com/en/products/developer-tools/objectada

https://www.ptc.com/en/products/developer-tools/apexada

Ada it is pretty live in European universities, a constant presence at FOSDEM during the last decades, and high integrity computing conferences.

Also, if you want the industry to care about any C or C++ feature, you have to buy your seat at the ANSI/ISO table.


You are correct, but I was speaking in terms of general-purpose programming, with an OSS/free implementation available, and to my knowledge all the other vendors only target the embedded and high-integrity spaces. And they cost money.

Moreover Adacore is the only one to have a general-purpose OSS (albeit kind of captive) ecosystem.

Your ANSI/ISO remark is off-topic. You don't need a seat there to use the extensive C++ ecosystem. With Adacore it's either GPLv3 all the way or you pay up. Also, to underline their gatekeeper status even more: they could one day decide to stop releasing their compiler to GNAT, without giving any reason, and there's nothing anyone could do about it.


> You don't need a seat there to use the extensive C++ ecosystem.

Thanks to GCC and Sun, and later Apple.

GCC was a toy compiler until Sun decided to start charging for their UNIX development SDK, which made many companies contribute to GCC's development.

Likewise if it wasn't for GPLv3, Apple would probably never bothered to create clang.

> With Adacore it's either GPLv3 all the way or you pay up

Actually I do have issues with people earning money with the work from others for free without contributing anything back.

GPLv3 is quite appealing for free software.

Want to earn money without contributing anything back? Use commercial licensed software.

> Also, to underline their gatekeeper status even more: they could one day decide to stop releasing their compiler to GNAT, without giving any reason, and there's nothing anyone could do about it.

There is always the very latest GPLv3 version, that everyone willing to contribute can carry on using.


> Actually I do have issues with people earning money with the work from others for free without contributing anything back.

With the BSD license, you can fork it all you want, but you have to maintain that fork. For most people/companies the feature is not a competitive advantage. So they donate the code back, and it means they don't have to spend so much time maintaining a fork.

The way I see it is, I'm a working stiff that's used a lot of opensource in the past. I don't mind releasing things as BSD clause to help another working stiff, as long as it doesn't contain my company's "secret sauce."


> Ada is in practice almost completely maintained by one single (consulting) company, Adacore

One would think this is a reason not to inform people of the existence of Ada – but I actually see it the other way around.

With a large enough community, the influence of Adacore would diminish and we would see more independent contributions. Ada has all the right things in place to create a great FOSS community, with an open, published standard, democratic contribution model, GNU support and so on. It's just missing the people to make it truly free.


The problem is that said GNU support (which AFAIK is the one everyone interested in the language uses) is contributed by ... Adacore. They can decide to stop doing that should they wish to do so. I might be out of date, but their Ada compiler is the only one that's up-to-date, targets multiple platforms, and can be used for free. So they have a sort of de facto vendor lock-in.

Also, if you have to recreate a collection of lib from scratch, I have the feeling people would rather contribute to a newer language in the same (or same-ish) space doing just that, such as Rust -- which I think is what is happening, even if Ada still has some distinctive features.


The GNAT compiler suffices for general programming in Ada though, so I'm not sure I understand your objection.

Also, Adacore is not the reason Ada didn't catch on. There were a lot of limitations in the standard library as of Ada95, because Ada focused so much on the embedded space. Because of this, some things that were easy in C/C++ were unnecessarily difficult in Ada.

Frankly, I think Rust achieves better ergonomics without Ada's verbosity, but Ada still has some great features that I wish were in other languages, like ranges.


I really want Ada to get its fair share of attention. It's an impressively well thought out language. It's like a team of academics and industry pros took a long hard at C and C++ and then made something to fit the the same niche (high performance, real time, and very important software), but made it almost impossible to write incorrect code.

Actually, that's exactly what happened.


Is it?

To start, it isn't really a C-like language, is it? It looks more like Pascal (or Modula), which is a different offshoot from ALGOL than C is.

Wikipedia tells me that Ada-0 was based on LIS from the mid-1970s, so a couple of years before the K&R C book came out.

The Ada '83 Rationale, at https://web.archive.org/web/19970407040255/http://sw-eng.fal... , mostly compares Ada to Pascal equivalents, though many other languages are mentioned.

The bibliography lists those languages: https://web.archive.org/web/19970407041806/http://sw-eng.fal... . You'll note that C is not present in the list.

The Ada 95 Rationale does reference C and C++ ("Ada 95 incorporates the benefits of Object Oriented languages without incurring the pervasive overheads of languages such as SmallTalk or the insecurity brought by the weak C foundation in the case of C++" - http://www.adahome.com/LRM/95/Rationale/rat95html/rat95-p1-2... ) but as you can see, it also refers to Smalltalk. The references include Modula (though not Modula-2) and Wirth's paper on "Type Extensions".

It comes across like they drew much more from languages other than C and C++, and were in the niche of "high performance, real time, and very important software", but from a different path than C/C++.


Sure, it's obvious it wasn't based on C/C++, but these days, among the most popular languages, it plays in the same sandbox. I'm sure if you'd run an unsupervised clustering over the Wikipedia pages for various programming languages, you'd get Ada in the same cluster as C, C++, Rust, D, and maybe Go.

...actually, this is something I'd really like to do if I had more time...


You say it's obvious, and I agree. But I want to know why kqr called it a "C-like language" and why inamberclad said it comes from "a team of academics and industry pros [who] took a long hard at C and C++ and then made something to fit the the same niche."

Those seem to be ahistorical comments.

It's no surprise that languages used in embedded and real-time systems would cluster together. But that doesn't reveal intent and influence.


Last I heard about Ada, there was the issue of having virtually no community that shares code freely, just commercial vendors selling theirs.

Has that situation improved?


There's a free compiler called Gnat, but the community is still rather small. I think it's something of a catch-22. Most of the resources you'll find on the internet are old (mid-90s) blogs and educational sites.

Here's some of the better ones I've found:

https://www.adacore.com/gems/

https://blog.adacore.com/

https://two-wrongs.com/tags.html#ada

The official reference manual is freely available too, but it's quite dense.


Given the parent comment :

>>> fit the the same niche (high performance, real time, and very important software

I don't see any community emerging from the web any time soon :-) To me this niche means "avionics, military, nuclear, medical, spatial". Not your week-end project.


C and C++ have big communities and they're also not primarily targeted at weekend projects.


They used to be, using "scripting" languages as general purpose is relatively recent development. Up until the early 2000 few people would have considered writing anything remotely CPU-intensive in Perl or Javascript. For instance I first started coding in C in high school because I wanted to write games for my calculator and the BASIC interpreter was too slow. I don't think there was an Ada compiler targeting my calculator, but there sure was a gcc port.

C's history is tightly coupled with that of Unix, it's always been rather open and "hacker-friendly".


> C's history is tightly coupled with that of Unix, it's always been rather open and "hacker-friendly".

I'm not sure I agree. Sure, it's reasonably easy to write your own implementation of something that closely resembles C, but C is also infamously dependent on implementation specifics and it, last I checked, had a paywalled standard. That's the opposite of open and hacker-friendly.

The primary reason C feels open is, I suspect, precisely its Unix coupling. If you want to read the sources to your Unix-like OS, you're likely to find sources written in C. You'll also find system libraries with C calling conventions. If you crack open your OS, you find C. It's easy to see the connection between "open", "hacker-friendly" and "C" – but the connection is between your OS and your knowledge of C – not C itself.


>You'll also find system libraries with C calling conventions. If you crack open your OS, you find C. It's easy to see the connection between "open", "hacker-friendly" and "C" – but the connection is between your OS and your knowledge of C – not C itself.

You're right (and it's a bummer that the C standard is behind a paywall) but don't you think that having a large open ecosystem is more important and significant than having an open technical standard, especially when several high quality open source compiler implementations already exist? Having an open standard won't do you much good if all the tooling and libraries are closed source.


FYI, I get a tls error on that link, in case you're affiliated with the webmaster.


I'm not affiliated, so not much I can do there. But it did pique my interest: I get the warning on my desktop computer but not on my laptop nor my phone. All three are running a recent version of Firefox. No idea why that is.

Edit: Silly me. I know why. I visited the HTTP version of the site on those two devices...


Kind of nitpick, but actually the first look at https://ada2012.org is a CERT_INVALID error from the browser.


One thing that C should have is an automatic string type (more like a byte[] than a String) that knows its own size. A huge amount of programming bugs arise from the lack of strings, and I can't see the reason why they could not be supported even in small systems. Delphi/Object Pascal had something like 3 or 5 function pointers that support the string type (create, free, resize, things like that).


The funny thing is that the memory allocator typically knows the size of the string.

You call free(foo) not free(foo, 12) or free(foo, strlen(foo));

But there's no standard way to ask for this size. Why doesn't strcpy just check with the allocator about how much memory is available and refuse to write past that.


Because the programmer can do pointer arithmetic, treat non-char[] memory as char[] memory and probably other weird stuff that makes this more complicated than it sounds, I guess.


just make a type where doing pointer math and other such use cases are a compiler error?


Just use a language where types automatically know their size? I thought the point was to make existing code safer by exploiting information that's already there (e.g. the allocator knowing the size of things on the heap).


The point is that C could be made much safer to use with a few "simple" (at least seemingly) tweaks. C has great tooling support on all platforms, including embedded, so it would be a compelling advancement if the language standard were to include such a thing. The important thing would be to coerce such a string type to character arrays so it remains compatible with all libraries without casting.


Not every string is allocated from the heap. There is no way to know if a string passed to a function, for example, came from the heap or not.


That's true, although there's no reason the stack allocator couldn't record this information somewhere as well.

And if the memory allocator doesn't know about the string then you could just revert to the existing behaviour so you're no worse off.


Clang offers a memory sanitizer that does a lot of extra checking. People don't ship their software with the sanitizer activated because the performance hit is nontrivial.


You're free to do as you please here. You don't have to use malloc()/free(), just implement your own LIBRARY_Alloc()/LIBRARY_Free()/LIBRARY_GetMemorySizeBytes() :)


There's more ways to allocate memory than the heap and stack.


What are you referring to? sbrk? mmap? I typically use either the heap or the stack myself.


Something like memory pools or manual memory management. Yes the memory pool will come from the heap or stack but the heap/stack allocator will have no idea what is going on in there.


Static memory is at the bottom of the address space, below the heap. But that would really only be relevant in the case of string literals unless I'm mistaken.


You can have a static buffer:

    static char buf[128];
and use strcpy() at any point to copy into it, so I don't think statically allocated strings must be literals.


There is a distinction between const and non-const static memory, though, in that the former can be stored in Read-Only Memory. Still, static memory is generally used for many fewer purposes than the stack or the heap, and if I understand correctly, sbrk is still working with the heap, just more manually.


That's right. Allocators use the sbrk and mmap to manage the heap. IIRC high performance allocators tend to use mmap exclusively.


Okay but you don't have to implement a 100% solution to make 90% of existing code much safer.

How much C code is out there that only uses malloc/free? All of those programs could be made safer if the default string methods were able to prevent writing past the end of the array.


And all of them know how much memory they've allocated.


Yes, but unless you expose at the language level an interface for all types of allocators and enforce it then a whole lot of allocation techniques like memory pools couldn't be used with strcpy or would have to use char* type of strings again.

The point is not that allocators don't know how much memory has been allocated, but that C has no idea about the underlying allocator, so it cannot ask it about the length of the string as it was proposed.


However that doesn't mean that you shouldn't be able to either

Would be nice to be able to tell if something is allocated on the heap or stack in a portable way.

As in being able to ask if and object is thread local or not. The language and run time could support that but studiously does not.

Just like the language could allow you to directly monkey with the ABI but does not.


That something could be allocated by custom allocator like new or g_slice_alloc. You can monkey with the ABI as numerous FFI libraries do, but it is not portable by definition, unless you’re using, well, portable FFI library. Personally I can’t see why string, abi, etc libraries must be included. Include GLib++/GObject++ then and use its actually large ecosystem — what problem are you trying to solve by blaming low-level? That low-level is too low? Do you need high-level low-level that everyone will obey next morning? Check Java?


> portable FFI library

I looked into that, not available on many many platforms.


strcpy has no access to the "allocator". It doesn't know about malloc, realloc, or free.


Yes -- and that has turned out to be a bad design decision. It's somewhat convenient to use methods like strcat() to build up strings, but it also leads to very insecure code.

I think the inclusion of strdup() shows that it was a mistake to make C strings completely allocator-agnostic.


Strings don’t have to fill the char array they live in. Overallocating a buffer is a common technique to avoid frequent reallocation.


We have tried to do this by adding a "null terminated string" type. Start in section 2.4 of the specification (https://github.com/Microsoft/checkedc/releases/download/v0.7...).


I've used something similar --- a null-terminated string, with a length prefix at a negative offset. Resizeable ones have an "allocated size" field before that one. This makes for strings that are both easy to manipulate and compatible with existing functions that only read a regular C string.



Good to see I'm not the only one who thought of that idea, although my implementation has a different order of the fields (and IMHO better, since it allows for statically allocated strings with only a length field --- a sort of subclassing.)


So, BSTRs? They’re common in old-timey COM apis


BSTR size field is at positive offset, IIRC.


You do not RC :)


So in a sense implementing Pascal strings in C?


Yes. But also "abusing" C's permissiveness with pointers to get pascal strings that interop with standard functions. It get's passed around as a pointer to a char array but the length function would return the value at that pointer - sizeof(size_T). In most languages this wouldn't be possible.


The point is to keep C small and flexible. It supports whatever kind of string implementation needed by the present environment, platform and algorithm.

That said, several “high-level” string libraries for C exist that make it safer, more efficient, etc. Some of them are even pretty clever, allocating extra space right before the C pointer to a null terminated C string that it gives you, so it can do its own bookkeeping.


The main issue is that there's already a lot of standard library support for classical strings--things that are just a pointer to a chunk of bytes, and no more; rewriting those functions would be destructive. The C standards group could add to the standard library, I suppose, but generally they are a bit reticent to do so.

It is not the case that C has a benevolent dictator with a particular vision and drive for the language. It's more like a small group of people who don't want to mess things up. And so not much changes. This is neither good nor bad to me--it simply is; and anyway, these days, you have quite a few options that do offer automatic strings that can also compile into machine code (Go, Rust I suppose?).


> rewriting those functions would be destructive

Getting rid of standard string functions would be the best thing to happen to the c language in the last 25 years.

There are two types of c programs. Those that scrupulously avoid standard string functions and brittle programs shot through with security holes.


At the point when you have to convert between string types and can no longer easily use much of libc without conversions, why not just use a different language that's already gone much farther in the name of safety, usability or other features?

In other words, if your pseudo-c candidate already suffers from most the interop problems that Rust, D, Nim and Go do, why not just use one of those and at least reap the other benefits they provide?


Anything to help legacy C code base is welcome. Hard part is knowing whether touching decades-old, tested code is worth the risk.


I mean, I guess it's the best thing if you want to wreak havoc and chaos? Which, perhaps, you do!

Your suggested dichotomy is, of course, a little bit false. But I'm sure you knew that when you wrote it down.

It's entirely possible to write secure programs in C, even with standard functions. Writing your own code does not somehow confer a level of security-consciousness that you lacked when sticking to strings.h. (It does give you a wonderful opportunity to write your own security holes that no one has discovered yet!)

I mentioned this somewhere else, but we're in a pretty good place right now with languages; we finally have really solid alternatives to C that can compile to machine code, in both Go and Rust.


> It's entirely possible to write secure programs in C, even with standard functions.

You realize that most standard string functions are outright banned by organizations that care about security. As in you're not allowed to use them not even if you pinky swear to be 'careful'

https://msdn.microsoft.com/en-us/library/bb288454.aspx


I think Fortran uses a number followed by a byte array to represent strings. This is the big argument with C which used null terminators. In retrospect, maybe Fortran was right after all.

Edit: Actually seems I confused Fortran and Pascal. Also, Fortran uses an internal length variable.


I think the failure of C to support a built in array type is exactly like Steve Jobs refusing to give up one button mice.

The reasons aren't technical or based on merit but on stubborn ideology.


Arrays are actually pointers to the first element... They get translated to pointer arithmetic: adding an offset to the first element. You can easily create a struct with one field for the size and another with the pointer to the first character


Semantically, arrays are different from pointers. They do, however, degrade into pointers in many situations, which causes that misunderstanding.


A language with an array type, e.g. Java, distinguishes between a 'null' array reference and an array of zero length. which could substitute in languages that don't have an option type; SOME([]) vs NONE

But a plain *char in C makes no such distinction.


There is a difference. An empty string in C is a non-null char* pointer pointing to a zero byte. Passing a null pointer to strlen will crash.


A zero-length array can't ever be a valid cstring at all since they have to be null terminated. I don't think that is actually relevant to the claim you are replying to, which is about arrays and pointers and not strings.


That type of thinking led to C++, a language whose contributors do not know how to stop adding features.


Is there a whitepaper associated with this? I found the project page and the github page, but I want to see:

a) What's involved in getting the C code to be 'safe' - how much annotation overhead, etc

b) How this interacts with C Preprocessor

c) What the runtime overhead is (usually the killer)


There is a specification: https://github.com/Microsoft/checkedc/releases/download/v0.7...

We have been trying to publish a paper on this, perhaps one day we will be successful! Our runtime overhead is very good, and our interaction with the preprocessor is minimal.


Wonderful, thank you for the link.


I wish that when all these guys tried to create a better C/C++ they made it compatible such that I could create shared libraries usable from other languages.

C, not even C++, is still the only common denominator between languages. The fact that I cannot create a shared library in Golang or Python that is callable from the other is very disappointing.


I wonder if that has more to do with C++ undefined name mangling than anything else. C exports are beautifully simple to read. C++ exports on the other hand are some sort of alternate reality mind-bending shitstorm due to the language not defining a scheme for decorators.

Maybe I'm wrong and hopefully someone can correct me; I don't have deep export knowledge outside of simply using C exports, but it seems this decision has had long-standing repercussions for people wanting to use FFIs outside of C.


Anyone know what stage the implementation (https://github.com/Microsoft/checkedc-clang) is at, is it ready for use?


It is not ready for production use.

Do try it out though!


How does this compare to SAL, out of interest? Somewhat competing aims, no?


the github is just a spec, you have to use llvm/clang to use this extension. it would be nice to make this an independent executable to do C checking, apparently not the case here.


How does it compare to CCured ?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: