I have played around with Pluto.jl, and colleagues of mine use it for research, but I keep going back to Jupyter. I tend to have long running cells that are pulling information from external sources or training models, and triggering one of those cells accidentally will waste a lot of time running something that may not be reliably interrupted.
There is talk about putting in execution barriers that would help with this, at the risk of making Pluto more complicated for users:
The fact that Pluto only runs dependent cells on changes mostly solves this for me. For example, a cell can load things into the variable data, and then another cell can apply a function f(data). If I alter f, data is not reloaded and f(data) automatically runs.
That is fine if you are working sequentially, but often tasks involve going back to the original data and doing some wrangling.
data -> model(data) -> output(model)
So if you go back to mess around with the data, your model and output could be or would be recomputed, which you would need to do eventually but not while making iterative tweaks.
Another commenter suggested adding checkboxes which is a good idea, although then you are managing a bunch of checkbox states.
> So if you go back to mess around with the data, your model and output could be or would be recomputed, which you would need to do eventually but not while making iterative tweaks.
On the other hand, not everyone remembers to re-run dependent cells. I’ve had many R notebooks handed in to me where an author didn’t check it runs top to bottom with fresh workspace.
I think the ideal user-friendly system would switch between automatic and manual recomputation depending on expected time of recomputation and expected time until the user triggers another recomputation (and clearly indicate which cells need recomputation to make them reflect the latest state of the system). If you’re editing a file path, for example, you don’t want the system to read or, worse, write that file after every key you press. Similarly, if you change one cell and within a second start editing a second one, you don’t want to start recomputation.
So, if the system thinks it takes T seconds to compute a cell, it could only start recomputation after f(T) seconds without user input.
Finding a good function f is left as an exercise for the reader. That’s where good systems will add value. A good system likely would need a more complex f, which also has ideas about how much file and network I/O the steps take and whether steps can easily be cancelled.
For the general case, I am pretty sure what you describe is the halting problem [1]. This does not mean that I believe some approximation is impossible (your “write to file” comment is particularly true). Just feeling the need to highlight that a clean, general solution is most likely not something that gets done in an afternoon.
Yeah, that “left as an exercise” was tongue-in-cheek. Even past executions do not tell you much. Change “n=10” to “n=12”, and who knows what will happen to execution times? The code might be O(n⁴), or contain an “if n=12” clause.
Looking at the world’s best best reactive system, I think it never automatically fetches external data, and only recalculates stuff it knows it can cancel, and also has a decent idea about how much time each step will take.
Working in a nonlinear manner is the whole point of Pluto. You can modify some intermediate processing in the script and none of the upstream cells, like loading the data, will run again. I also don’t need to fish through the whole damn notebook to run all the cells my change impacts. If you really, really don’t want downstream stuff to run you can either do some of the button tricks the other comments mentioned or copy (a subset) of the data. Usually I find I want to see the results of my change on everything downstream, though.
FWIW I've significantly improved my experience by breaking up my notebooks into smaller pieces such that each notebook only does "one thing", while using DVC to run them and keep track of intermediate results. Or in a case where the intermedaite result was itself somewhat "exploratory", having the notebook itself check for the existence of an intermediate result and load it from disk instead of recomputing it.
Execution barriers are a nice idea though. There is/was a Jupyter notebook extension for "initialization cells", but the whole notebook extension ecosystem seems kind of dead and it's unclear if Jupyter Lab will ever have equivalents.
There is talk about putting in execution barriers that would help with this, at the risk of making Pluto more complicated for users:
https://github.com/fonsp/Pluto.jl/discussions/298