Building a Language and Compiler for Machine Learning

eigenspace · on Dec 3, 2018

Mike Innes is one of the real rockstars in the Julia community. He has a knack for making surprisingly small and elegant packages which compose so naturally with the base language that they feel built-in.

After a `using Flux`, julia is suddenly a machine learning language, instead of a language with a machine learning library. I'd argue it shouldn't be surprising that he found his way to Julia because Julia is one of few languages that allows one to make such packages.

His other packages such as MacroTools.jl, Lazy.jl and Zygote.jl are also well worth checking out.

mark_l_watson · on Dec 4, 2018

As an old Lisp user, I am impressed by Flux (which I started using this weekend, after someone on HN recommended Flux to me) as transforming Julia, like building Lisp up to a new language for whatever problem you are working. I also appreciate how incredibly easy it was to get started with Flux which ‘just worked’ with CUDA 10 and the GPU in my laptop and the model zoo was great to get started. Really quality stuff!

ChrisFoster · on Dec 4, 2018

Hah, that someone was me, I'm glad it clicked for you! I had a similar experience of it "just working" when trying a few weeks ago. I particularly enjoyed being able to build and run Flux models interactively just like any other julia code. Also, having a hand written loss function just run on the GPU with no extra effort kind of blew my mind.

atrilumen · on Dec 4, 2018

How "ready" is Flux, for building something like Rasa?

https://rasa.com/

ChrisFoster · on Dec 5, 2018

My feeling is that Flux is a fantastic tool for playing with innovative models which can't be expressed in the usual frameworks. And quite possibly already the best thing that exists for open ended experimentation in ML.

On the other hand, for training and deploying models which are easily expressed in other frameworks you will find a lot more ready made pieces of infrastructure elsewhere.

atrilumen · on Dec 5, 2018

Thanks!

dnautics · on Dec 4, 2018

I think your major problem is going to be orchestration, which is not easy to solve.

stabbles · on Dec 4, 2018

Since the blog post does not have many code samples, this non-trivial AD example with Zygote.jl is worth sharing (it's from their readme):

    julia> using Zygote
    julia> fs = Dict("sin" => sin, "cos" => cos);
    julia> derivative(x -> fs[readline()](x), 1.0)
    cos
    -0.8414709848078965
    julia> -sin(1.0) # 'true' derivative in 1.0
    -0.8414709848078965

So Zygote can apply AD to an anonymous function that looks up a function in a hash table from user input.

https://github.com/FluxML/Zygote.jl

UncleOxidant · on Dec 4, 2018

That's pretty impressive.

I notice this in the zygote readme:

"The Julia compiler does not yet support all features needed to make Zygote fast, particularly in the presence of control flow."

Any idea of when the Julia compiler will support these features?

viksit · on Dec 4, 2018

Could you elaborate why this is so impressive? As I understand it, it's because things like TF or PyTorch can only AD specific things as part of a spec vs generic functions like this?

FridgeSeal · on Dec 4, 2018

It's impressive and super super cool, because now any function you can write in Julia, the AD can differentiate.

You don't have to hope that your AD/ML package has that function, you can just write it or find it in a package and punch it straight in. That's awesome.

0-_-0 · on Dec 4, 2018

How is that different from e.g. Tensorflow? You can compose any function you want that has differentiable components, and (re)define the derivatives of elementary or composed functions.

Setepenre · on Dec 5, 2018

No tensorflow does not allow you to do that. Tensorflow differentiate a TF graph. Flux differentiate Julia's IR.

Tensorflow has OP(eration)s that are differentiable that you can compose and that's it. If you want to implement something that is going to be differentiable you need to implement it using Tensorflow OPs or add your own OPs to [tensorflow] with their gradient.

With Flux you can take code written by a random guy on the internet that never thought about using his stuff in ML and Flux will be able to differentiate it anyway.

[1]: https://www.tensorflow.org/guide/extend/op

0-_-0 · on Dec 5, 2018

You mean like AutoGraph in Tensorflow?

https://github.com/tensorflow/tensorflow/tree/master/tensorf...

FridgeSeal · on Dec 5, 2018

Still a small subset of the language and features

They've also got to manually implement the Python -> autograph translation for a whole variety of language features (so any language features that get added or changed will break autograph until it's updated.

Flux gets this essentially for free, for the entire Julia language, without the need to manually build that language -> tensorflow translation layer. With the added benefit of Julia's non-trivial performance benefits.

byt143 · on Dec 5, 2018

Still a very restricted subset of python, can't autodiff custom types etc

twtw · on Dec 4, 2018

Yeah, it's impressive because it's differentiating the language - it doesn't require putting 'tf.' in front of everything and hoping tf has the function you want.

ChrisRackauckas · on Dec 3, 2018

As someone who works in merging differential equations and machine learning, I have found this kind of work essential for what I do. Pervasive AD that allows merging neural networks and diffeq solvers is allowing us to explore all of kinds of new models and new problems. Sure it doesn't impact vanilla machine learning all that much (though Zygote.jl does allow for a lot of optimizations that wouldn't be possible with tracing-based AD), but it definitely opens up a new wave of AI possibilities.

ArtWomb · on Dec 4, 2018

Neural ODE solver just announced as a Best Paper at NeurIPS 2018 conference. What's interesting is that not only do preliminary results yield significantly less error in prediction vs RNNs. Differentiable models allow for extrapolations well beyond range of observable test data set.

Neural Ordinary Differential Equations

https://arxiv.org/pdf/1806.07366.pdf

ChrisRackauckas · on Dec 4, 2018

I am working on very different approaches with very different applications, but yes the neural ODE (not a solver BTW) is a good example of this kind of application and the effectiveness of the approach. It's a fantastic paper if anyone hasn't read it.

YetAnotherNick · on Dec 3, 2018

Can you define what is pervasive automatic differentiation please.

byt143 · on Dec 3, 2018

AD arbitrary julia code with mutation, closures, built in and user defined functions, fancy data structures etc (given the algorithm itself can be differentiated).

YetAnotherNick · on Dec 4, 2018

Something like eager execution in tensorflow?

ChrisRackauckas · on Dec 4, 2018

Eager execution in TensorFlow doesn't work on all Python code or all Swift code. What I am essentially asking for is equivalent to sticking AD on the source code of say any part of SciPy and expecting it to work without hard coding workarounds.

Here's my deal. I wrote and maintain the differential equation solver library in Julia. This thing is a huge piece of code composed of almost 70 packages and took something like 5,000 commits and almost entirely pure-Julia. I will keep writing and maintaining this code as its own research project for many reasons. And, sometimes, I want to AD this code or stick it in a neural network.

It's a very non-standard application of this kind of code, so it wasn't built to do this from the start. But I am a greedy hacker. I don't want to have to re-write it onto some computational graph package (TensorFlow) or build interfaces. I want to just call some AD library function on any pure-Julia function in my package and have it output derivatives just like it was any numerical diff package. ForwardDiff.jl, ReverseDiff.jl, Flux.jl, etc. all have autodiffs that work on these routines. I find that magical. In fact, I was shocked that it can be faster than the standard way of calculating these derivatives via sensitivity analysis (we'll put a paper out on Arxiv in a few days about this). There's still a few hiccups that can be fixed up for non-ML applications, but these Julia differentiation tools have really impressed me.

ChrisFoster · on Dec 4, 2018

> faster than the standard way of calculating these derivatives via sensitivity analysis

This is quite shocking indeed, I'll look for the paper when it comes out. I'm particularly interested in whether this is true for the sensitivity functions used in shape optimization of composite materials. For example, when optimizing for a stiff, conductive material, eg https://www.sciencedirect.com/science/article/pii/S002076830...

ChrisRackauckas · on Dec 6, 2018

This is it: https://arxiv.org/abs/1812.01892

ChrisFoster · on Dec 12, 2018

Thanks! At a quick glance looks like this may not be useful for the problem I had in mind, but I'll have to read it in more detail.

byt143 · on Dec 4, 2018

Except much faster and much more generic. Eager execution is very limited in the subset of python it can differentiate, for example. There's more info in the OP blog post.

S4M · on Dec 4, 2018

> As someone who works in merging differential equations and machine learning

That sounds really interesting, could you expand on your work/research?

faitswulff · on Dec 4, 2018

A bit of googling yields http://chrisrackauckas.com/ which seems like a likely match.

jlebar · on Dec 4, 2018

As an XLA:GPU person I'm curious how the performance of Julia natively compiling to CUDA compares to using XLA:GPU.

In particular, is this a promising approach, or do you see it as a dead end compared to generating GPU code natively? If it's promising, are there things we need to do in XLA:GPU to make it less awful for you?

(Reasons you might want to use XLA:GPU include, you don't have to reinvent all our performance and correctness hacks for cudnn, and maybe our kernels run faster since we're targeting such a limited domain?)

KenoFischer · on Dec 4, 2018

We've been meaning to run this comparison, but haven't gotten around to it yet. I expect it to work and am hoping to see some performance benefits. It should be fairly straightforward to see it working, the only reason we haven't tried so far is that we only have xrt hooked up and the TF infeed ops are not open source, so the existing code doesn't just work. It should be straightforward to hook up the xla service instead, but it's a bit of additional code to write that we haven't gotten to.

UncleOxidant · on Dec 3, 2018

I'm tasked with running several ML algorithms on a new hardware accelerator. Currently there is an LLVM toolchain for that new hardware, but no Python support is expected for a while which means implementing a bunch of ML code in C or maybe C++ (not a very pleasant prospect). I'm wondering, since Julia has an LLVM backend would it be possible to emit LLVM IR from Julia which could then be fed into our LLVM toolchain?

One thing that comes to mind here: does Julia use some kind of primitives for various things like matrix multiplication that might be difficult to export at the LLVM-IR level?

KenoFischer · on Dec 3, 2018

Yes, Julia tends to be quite good at this kind of thing. Which level you want to operate at will depend on the details of the accelerator. Happy to give some pointers if you can give me a rough idea of the target architecture and what software already exists. My email is in my HN profile.

vchuravy · on Dec 4, 2018

Take a look at CUDAnative.jl (https://github.com/JuliaGPU/CUDAnative.jl) which uses the NVPTX LLVM backend to compile Julia code to the GPU. What you described sounds very similar to that and could definitely made to work.

byt143 · on Dec 4, 2018

I believe Julia has generic kernels for many linear algebra ops (like matrix multiply) that the compiler can use to generate code for different backends and types as a fallback to BLAS.

orbifold · on Dec 4, 2018

I would not dismiss C or C++ as viable avenues of progress. There are viable C++ / C only deep learning libraries https://pjreddie.com/darknet/, available that are easily ported and embedded C++ can be pretty pleasant to write. We currently use C++14 on a embedded Power processor with custom vector unit and 16kB of SRAM and it works like a charm.

shafte · on Dec 4, 2018

I'd be interested in a direct comparison with similar efforts undertaken by existing frameworks; for example Torch Script[1], which aims to produce a language which shares a syntactic frontend with Python while getting all the goodies that ahead-of-time compilation gives you (symbolic diff, operator fusion, etc).

Seems to me that the primary challenge for any "next-generation" framework or language is getting people to actually use the thing. Sharing a front-end with Python and a backend with PyTorch seems like a good way to bootstrap that.

[1] https://pytorch.org/docs/master/jit.html?highlight=torchscri...

lostmsu · on Dec 4, 2018

I wonder if this feature is in any way different from LINQ for expressions?

In C# you can say

  Expr<Func<float,float>> sin = Math.Sin;

And then write a function

  Expr Derivative(Expr expr) => ...

Which will take the above sin, and compute its derivative as another Expr, which can be later compiled using Expr.Compile()

In C# this has been introduced to make SQL bindings.

So far, the only difference I see is that in C# there's a distinction between expression trees and functions themselves, but in Julia there's not.

StefanKarpinski · on Dec 4, 2018

This isn't really a blog post about a specific language feature so your question doesn't make too much sense to me, which may be why it hasn't gotten any answers. In general, being able to map specific primitives to their derivatives is not sufficient for AD, although I'm sure AD is possible in C#, it's just considerably more involved than that.

pjmlp · on Dec 4, 2018

I love the work being done in Julia, as competition is good, and maybe it would make Python community be more supportive of the ongoing JIT attempts.

Myrmornis · on Dec 4, 2018

> Meanwhile, the idea of ML models fundamentally being differentiable algorithms – often called differentiable programming – has caught on.

> We need a language to write differentiable algorithms, and Flux takes Julia to be this language.

Recently on HN there was some discussion of this paper by Conal Elliott on automatic differentiation in a pure functional language (Haskell): https://arxiv.org/abs/1804.00746

This is a rather large and vague question but I'm curious whether people have comments on the relative merits of Julia vs a pure functional language for supporting "differentiable programming" for ML?

StefanKarpinski · on Dec 3, 2018

Also posted here: https://hackernews.hn/item?id=18593453. Maybe a mod could combine the two?

tehsauce · on Dec 3, 2018

This looks awesome, and makes me wonder if anyone done any real-time graphics experiments with Julia? With this great AD and gpu support, I would love to try using this on some graphics applications!

byt143 · on Dec 4, 2018

Check out Makie https://github.com/JuliaPlots/Makie.jl

And the cool thing is, Julia's wonderful generic programming facilities mentioned in the blogpost are used to make Makie generic to backend (GL, WebGL, cairo etc). Relies on this library https://github.com/JuliaPlots/AbstractPlotting.jl

Here's a cairo version of makie https://github.com/JuliaPlots/CairoMakie.jl

glemmaPaul · on Dec 4, 2018

This is gonna be very interesting when this can be combined with a distributed network of specialized CNNs for highly specialized tasks (if there isn't already)