I really wish the Julia ecosystem would stop assuming that you always interact with your computer through the Julia REPL and started supporting proper command line interfaces. This is one of the big annoyances and mistakes of the R ecosystem, and I think it's unwise to carry that mistake over to Julia.
Also, big "ugh" to browser-based tooling. I want to browse webpages in my browser, I don't want to do my data science work there. We don't even have a good native client for Jupyter notebooks yet, let alone for this new Jupyter alternative that doesn't support the existing Jupyter kernel protocol.
In short: nice idea, but I'd rather see continued unification around Jupyter and a proper IDE that can at least emit and interact with Jupyter-compatible data.
On the other hand, the Jupyter notebook JSON format is bad for a variety of reasons (e.g. you need special tools for readable Git diffs) and I really wish we had all settled on R Markdown instead. But R has its own NIH tooling problem and nobody was ever going to adopt it because the R community itself (driven by RStudio) has little interest in sharing or interoperability with other languages.
Confession: after doing Data Science work for the past 4 years I STILL don't really understand why people like Jupyter.
R was my first programming language and I got really spoiled with RStudio where everything "just works" and the "highlight code -> run in REPL" workflow is super smooth and tightly integrated. All I want is for that to work in other languages, but it seems like if you want it in Python you need to be running PyCharm or a similarly-heavyweight IDE (seriously, despite all the hype of VSCode there are still a ton of issues with just highlighting code and running it in an IPython terminal) and for Julia it just doesn't exist. If you really want a Jupyter-like workflow you can just use R Notebooks, which are literally just better in every way.
Well, R isn’t the best language when it comes to building systems. Most R code is essentially one file written to produce an output once (for a paper, project, etc.). This means that people want a better language to build systems which Python fit. That explains why people moved to Jupiter.
I don’t like RStudio for the same reason I don’t like Matlab. I already have my editor and terminal workflow. I don’t want to use/learn a new tool for the privilege to use the language. Notebooks hit an acceptable middle ground where I can launch them via terminal. Notebooks have plenty of problems. Mainly, running cells out of order is just an incredibly dumb thing to be possible. This same problem is present in RStudio which you seem to enjoy (highlight and REPL) and you want it in other languages. If the code isn’t written to run in an order, a tool shouldn’t allow it.
> Well, R isn’t the best language when it comes to building systems. Most R code is essentially one file written to produce an output once (for a paper, project, etc.). This means that people want a better language to build systems which Python fit. That explains why people moved to Jupiter.
I definitely agree that Python is a better general purpose computing language than R, but R's deployment story (i.e. packages) is much, much better than that of Python (pip/poetry/pipenv/conda/whatever came out this week). I honestly don't think that's the reason though, it's more that Python has much, much, much better developer mindshare.
Jupyter is a whole other world though, like iPython was the best thing ever as a proper REPL for python, and Jupyter was good for being able to do graphics with your code. That was all standard in the R world, with Sweave (which I wrote my thesis in), so it didn't appear to add a lot of value (to me, at least).
> I don’t like RStudio for the same reason I don’t like Matlab. I already have my editor and terminal workflow. I don’t want to use/learn a new tool for the privilege to use the language.
I am 100% with you on this, but Rstudio is just a nicer interface over the tools for literate programming in R, and the wonderfulness of Rmd vs ipynb is a thing of joy (to me, at least).
> Mainly, running cells out of order is just an incredibly dumb thing to be possible. This same problem is present in RStudio which you seem to enjoy (highlight and REPL) and you want it in other languages. If the code isn’t written to run in an order, a tool shouldn’t allow it.
So, this is a tricky one. I agree in principle, and I have a habit of continually re-running my documents to ensure that this doesn't cause problems, but there is definitely valid use-cases for out of order execution. Consider that you may often fit a model (which can take ages) and iterate on the visualisation/analysis code, but you don't want to re-run the modelling code every time you change a plot, which your solution would require.
Most of the tools claim to allow you to cache particular blocks, but I've never been able to get it to work reliably.
Yeah, I find that the out-of-order execution issue is common with people who have a software development mindset, but for data analysis/science is basically the only sensible way to work. The "load data" command might be one line but takes 3 minutes to run, while a huge chunk of code that plots the data might take 1 second and I might want to tweak it 50 different ways before settling on something that I like/delivers insight. Producing a standalone script that develops the same insight you get from "playing" with the data is an afterthought in some cases.
As long as you're aware of the dangers, it's fine. Personally I try to model offline from analysis to avoid this issue, and set eval to no in org for those cases where I've built the model inline with the analysis.
Unfortunately, it generally takes a couple of terrible situations before people learn the problems with this.
I agree that data analysis needs a tool to persist data while iterating over certain functions. But in this vein, said tool should aim to try to prevent the user from having to run the load_data() function more than once. Not encourage it by allowing someone to permanently manipulate the output of load_data().
This is an option in many tools, but it doesn't tend to work that well in practice.
I do agree that this is the ideal though (As an example if Pluto is always reactive, then this workflow becomes much more difficult as when you change a downstream datapoint, the model will be re-run).
Spyder is basically an RStudio clone for Python, but I never had a great experience using it. Not really sure why, somehow I just ended up using Jupyter because that's what my coworkers all used. When doing "solo" stuff, it doesn't matter because I dislike every interface so I'm never happy anyway...
Pluto notebooks are Julia scripts, usable at the command line.
Edit: Pluto uses Julia's package manager; moreover, Manifest.toml can be used to pin all of your project's dependencies so the notebook is repeatable, from a code perspective.
How does RStudio have little interest in interoperability with other languages? They produce the reticulate package[1] to allow calling Python code for R, they have added support for Python to RMarkdown and RStudio[2], they let you host Python apps on their RStudio Connect product[3], they sponsor Ursa Labs to work on the Arrow project for easy data interchange[4].
To me this seems like an improvement in the direction that you want, in particular that notebooks are reactive. All too often I get a Jupyter notebook from someone else and try to run it on my machine only to find that some intermediate step does not work any more, because the original developer ran something out of order or removed a critical step. A reactive notebook seems more likely to still work after a lot of changes are made while experimenting.
> I really wish the Julia ecosystem would stop assuming that you always interact with your computer through the Julia REPL and started supporting proper command line interfaces.
What does it even mean? What is a CLI interface for a programming language if not a REPL ?
I also do not really get the complaint, but it is along the lines of people wanting to write `julia-pkg install Pluto` instead of `julia -e 'using Pkg; Pkg.add("Pluto")'`. It seems it is a big pet peeve for some people.
Yes. I agree with this complaint. The REPL is useful in some cases but in general I avoid interacting with it whenever possible. My impression is that workflow is highly task-dependent (perhaps obvious) but there are many of us who just want to write a script, run the script, and repeat.
The `-e` thing gets very messy quickly if you need to pass non-trivial data from the outside world into the application (have you ever tried to "parameterize" a Sed script?). It also doesn't compose well with other CLI tools.
I think these are two perfectly reasonable things to be annoyed by.
The package manager that comes with Julia is actually way better than what is available in python, and it has an unmatched "foreign language dependencies" support. It just happens to be mostly used from the REPL, not the command line (hence the execute -e flag above).
> > I really wish the Julia ecosystem would stop assuming that you always interact with your computer through the Julia REPL and started supporting proper command line interfaces.
> What does it even mean? What is a CLI interface for a programming language if not a REPL ?
I guess they mean that the julia interpreter should be a good unix citizen (which is quite not at the moment). For example, while you can in theory create "julia scripts" by adding a julia shebang, this usage is not really well thought and has several friction points. Most notably, a very slow startup time, even of several seconds if you import some common packages. This makes said julia scripts essentially unusable.
The usual response of the julia community to these complaints is that "you are holding it wrong", and that you should use julia inside the proper REPL. Some people do not like this answer, and there's a tiny bit of drama around that.
I think there needs to be a distinction between on one hand Julia's startup time, which is an inevitable consequence on its compilation model and unlikely to change, and on the other hand whether there is a lack of command line functionality in Julia, e.g. the package manager. The latter is much easier to amend.
but is this "compilation model" inherent to the language? It seems to be an implementation choice. It is conceivable an independent interpeter for the same Julia language but with fast startup.
Julia already comes with an interpreter, try starting your session with `julia --compile=min`.
One part of the ongoing effort to reduce latencies is to allow package authors to specify optimization levels on a per-module basis. This is great for plotting packages for example, since they usually don't benefit much from overly aggressive optimizations, so spending less time optimizing codes generally leads to a snappier experience. It is now even possible to opt into a module-specific fully interpreted mode, which can make a lot of sense for typical scripting tasks.
Plenty of people use the REPL in terminal and sublime text or vim or whatever. I also dislike browser-based tooling and think Julia has done a good job avoiding Rstudio-style dependencies.
But if your point is the inability to do `julia script.jl` , yeah thats a pain point. Fortunately there has been some tooling to make running many jobs in a row easier: https://github.com/dmolina/DaemonMode.jl
How is it that I do `julia script.jl` all the time? Or by “inability” do you mean that it’s slow because of the startup time? If you need a utility that starts up instantly, create a sysimage.
In contrast to interpreted languages, creating a sysimage is yet another step (in addition to installing a third party package).
In contrast to AOT-compiled languages, PackageCompiler.jl doesn't statically analyze your code. So you need a "precompile script" that hopefully hits all callable methods (such a script will have to be made manually). The resulting "binary" is also massive.
This preference seems to depend a lot on where you come from. Having come from Scheme / Lisp (same as some of the original Julia developers AFAIK), I find I prefer the REPL in Emacs for coding. I do use Jupyter quite a bit for running simulations, doing data analysis, etc. For me, the main reason to use Jupyter has been (i) interacting with sessions on remote machines without needing to bother with X, and (ii) being able to easily incorporate LaTeX and share whole documents (math + working code) to collaborators and students.
Is Julia different from Python in this regard? I use Python mostly by executing scripts, but it’s nice to have the REPL and IPython and Jupyter. With Julia I’m free to just run “julia script.jl”, aren’t I? There’s probably more to your complaint than I naively realize, though. Maybe Python has better IDE support?
Python has a decent command line argument parser in its standard library, and there are several even-better options in the 3rd party library ecosystem, e.g. https://pypi.org/project/click/.
I wish there was a plain text format as base that everyone agreed on no matter what UI or backend is used; that would suddenly make it usable in any text editor and people could build tools and plugins that "just work" no matter whether Jupyter or something else is used.
The closest we got was the org-mode file format with human-readable data for everything, but it seems tightly coupled with Emacs unless you only want to use it as Markdown replacement.
I use PyCharm's Juypter plugin and it's seemed far better for me. I work in Python everyday but I'm more on the data engineering and app security/architecture side of things than straight up Data Science. I don't use notebook's as often as I'd like but I live in PyCharm.
Is VSCode's Jupyter extension much better than PyCharm's? Just curious, I prefer PyCharm over VSCode for normal python dev work. By a lot but I get it's personal preference, so I'm curious.
I tried this for the first time the other day and it was a great experience. Ironically the most cumbersome part continues to be Python environment management. I'll spare you my usual rant about that, but hopefully by Python 4 they'll find a solution.
Hear hear! A simple web-view inside a native application window is a huge improvement imho. If only JuptyerLab provided a simple interface to access menu elements as well, you could easily have a nearly complete native experience.
I used Pluto for last year's Advent of Code. It's extremely good for these sorts of problems — rapid iteration with modest computational requirements.
Think of something you might use a spreadsheet for — Pluto has a similar feeling of instant feedback.
---
Some features that are missing:
– Some things are difficult to do with the keyboard; I used my mouse more than with other tools. The author doesn't like modal editing, but ideally they could be implemented with
modifier keys (https://github.com/fonsp/Pluto.jl/issues/65)
- It's hard to understand what happens _within_ a cell — logging goes to the terminal rather than the notebook — and there aren't many introspection tools. This is an environment where transparency / introspection would be particularly helpful.
---
Pluto doesn't solve every problem, or completely replace notebooks; to respond to a couple of comments:
> I have many extremely long notebooks that would almost certainly crash if you tried to recompute the whole thing
Right, don't use Pluto for that! It's not one environment to rule them all
> Many of the cells won't work at all because the inputs are long gone
That seems bad! Pluto will help you ensure that doesn't happen.
I have played around with Pluto.jl, and colleagues of mine use it for research, but I keep going back to Jupyter. I tend to have long running cells that are pulling information from external sources or training models, and triggering one of those cells accidentally will waste a lot of time running something that may not be reliably interrupted.
There is talk about putting in execution barriers that would help with this, at the risk of making Pluto more complicated for users:
The fact that Pluto only runs dependent cells on changes mostly solves this for me. For example, a cell can load things into the variable data, and then another cell can apply a function f(data). If I alter f, data is not reloaded and f(data) automatically runs.
That is fine if you are working sequentially, but often tasks involve going back to the original data and doing some wrangling.
data -> model(data) -> output(model)
So if you go back to mess around with the data, your model and output could be or would be recomputed, which you would need to do eventually but not while making iterative tweaks.
Another commenter suggested adding checkboxes which is a good idea, although then you are managing a bunch of checkbox states.
> So if you go back to mess around with the data, your model and output could be or would be recomputed, which you would need to do eventually but not while making iterative tweaks.
On the other hand, not everyone remembers to re-run dependent cells. I’ve had many R notebooks handed in to me where an author didn’t check it runs top to bottom with fresh workspace.
I think the ideal user-friendly system would switch between automatic and manual recomputation depending on expected time of recomputation and expected time until the user triggers another recomputation (and clearly indicate which cells need recomputation to make them reflect the latest state of the system). If you’re editing a file path, for example, you don’t want the system to read or, worse, write that file after every key you press. Similarly, if you change one cell and within a second start editing a second one, you don’t want to start recomputation.
So, if the system thinks it takes T seconds to compute a cell, it could only start recomputation after f(T) seconds without user input.
Finding a good function f is left as an exercise for the reader. That’s where good systems will add value. A good system likely would need a more complex f, which also has ideas about how much file and network I/O the steps take and whether steps can easily be cancelled.
For the general case, I am pretty sure what you describe is the halting problem [1]. This does not mean that I believe some approximation is impossible (your “write to file” comment is particularly true). Just feeling the need to highlight that a clean, general solution is most likely not something that gets done in an afternoon.
Yeah, that “left as an exercise” was tongue-in-cheek. Even past executions do not tell you much. Change “n=10” to “n=12”, and who knows what will happen to execution times? The code might be O(n⁴), or contain an “if n=12” clause.
Looking at the world’s best best reactive system, I think it never automatically fetches external data, and only recalculates stuff it knows it can cancel, and also has a decent idea about how much time each step will take.
Working in a nonlinear manner is the whole point of Pluto. You can modify some intermediate processing in the script and none of the upstream cells, like loading the data, will run again. I also don’t need to fish through the whole damn notebook to run all the cells my change impacts. If you really, really don’t want downstream stuff to run you can either do some of the button tricks the other comments mentioned or copy (a subset) of the data. Usually I find I want to see the results of my change on everything downstream, though.
FWIW I've significantly improved my experience by breaking up my notebooks into smaller pieces such that each notebook only does "one thing", while using DVC to run them and keep track of intermediate results. Or in a case where the intermedaite result was itself somewhat "exploratory", having the notebook itself check for the existence of an intermediate result and load it from disk instead of recomputing it.
Execution barriers are a nice idea though. There is/was a Jupyter notebook extension for "initialization cells", but the whole notebook extension ecosystem seems kind of dead and it's unclear if Jupyter Lab will ever have equivalents.
I'm always impressed by the quality of the Julia ecosystem. It seems to be in that sweet spot with sufficient use & contribution to be viable, but not so popular that quality suffers.
I love Julia and part of its charm is that everything is relatively new and so quite consistent, also helped by the community ethos and technical features that aid composition.
Python and R (especially R) have plenty of libraries that are high-quality, or even industry standard, but which are decades old and feel it. Python's NLTK is 20 years old for example and it can feel grating switching between NLTK and spaCy. R has three different object systems (four according to some), so you might be using some ancient battle tested library with Hadley Wickham's latest cutting edge libraries.
I don't get why people dislike reactivity. This feature alone makes Pluto superior to Jupyter. If you don't want recomputation of some dependent cells there are easy ways to avoid that. But there are no easy ways to add reactivity to Jupyter.
Besides that, Pluto can bind UI elements to your code. You can make simple interactive games that run in Pluto! How it's not awesome?
For those that are put off byt the "weird" cell execution behavior there is also https://github.com/compleathorseplayer/Neptune.jl A non reactive fork of Pluto that has basically all the benefits of pluto and multi-line cell execution without begin without the reactive behaviour. Also running code blocks with inline results in vscode also has some notebook feel to me.
Why would someone use Neptune instead of just using Jupyter? I see how Pluto has a new value proposition that Jupyter lacks (reactivity), but it looks to me like Neptune simply removes that value.
It's also an unmaintained fork. Forked in February and hasn't a commit since to the source. None of the patches are getting downstreamed. It just keeps updating its README and posting more advertising. If someone wants to do this project they should do it correctly, but this is just not how you do that. You'd need to keep floating your patches over a changing master, not just force an old version, force all packages to be on older versions without patches (HTTP.jl), etc.
It's funny because this is probably a really non-standard sentiment but I really wish that they would make an electron app out of this. Installing it is reasonably easy but definitely beyond a lot of people who could get value from it.
I like the idea of Pluto, because I cannot stand the non-deterministic cells of Jupyter notebooks anymore. Reading this page is like having sex with someone you love. Where has Pluto been all this time? I have finally found all what was missing for a complete life! There's even things that I didn't know I needed because I didn't even have the language to express them! This is my favorite page on the internet and Pluto is my favorite thing ever. I can see no downside to this, no defects, even with a conscious effort to do so.
Yet, trying Pluto, it seems to be outrageously slow and clunky. Is it expected? Sometimes it takes a few seconds to do something. I'm not talking about the initialization (which is still a shame, but that's a different issue). I'm talking about running individual cells with simple code. This is unusable as of today, at least on my 3-year old laptop.
Pluto is quite fast for me - could you perhaps be hitting the first-run JIT startup time in Julia? Do the cells re-evaluate quickly, after whatever code they depend on has been JITted?
I'm talking about my second run of the notebook. On the first run it took one minute and a half just to open the notebook (it seemed it was downloading stuff, and then compiling).
I'm using 1.5.3. Gonna update to 1.6 and see what happens.
EDIT: just running it on julia's "master" branch (v1.7.0-DEV), the initialization seems to be slower, but then the cells run maybe marginally faster. Looks good, but I could not push this to my students yet...
I wouldn't recommend master to anyone but the experts in julia. 1.6 is pretty stable though. Do remember that in Julia second execution of an instruction is much faster than the first. Thus, it may take some time to initialize, but after that it will run smoothly.
But yes, my tolerance towards this is higher as I am used to matlab.
> my tolerance towards this is higher as I am used to matlab.
You may like octave for that (my daily runner, that I want to replace with julia in the future). It recognizes the same language as matlab; but it is free software and, crucially, the startup time is negligible. For example, you can run octave scripts inside a bash loop, as if it was a calculator.
An 'Excel' that is less opaque, easier to test and debug and backed by a more sane and powerful language is what a lot of the world is clamoring for. So yea, that would be great.
LOL, how often do you want your entire notebook to recompute just because you change something somewhere? Have you never tried pursuing a little side experiment in an existing notebook, or have ten abandoned false starts leading to one good result? I have many extremely long notebooks that would almost certainly crash if you tried to recompute the whole thing, and many of the cells won't work at all because the inputs are long gone. Some of these notebooks are years old. The datasets they have in memory aren't saved anywhere else. What possible motivation do I have to lose all of this precious state?
If I wanted a software-grade, rock-solid data pipeline, I would just copy-paste some code from an existing notebook and run it on Papermill.
Some of these notebooks are years old. The datasets they have in memory aren't saved anywhere else.
That sounds dangerous to me. If your computer crashes or you introduce a bug to your notebook, you could lose all that data. Personally, I prefer my notebooks to be reproducible at any point.
These are usually small aggregates and summaries, so I just display them in notebook output. It does make it take a bit longer to scroll through the notebook to find something, but that's what being disciplined with organization is for.
Sorry, I'm not sure I'm following your argument. Are you saying your notebooks hold state that's easily reconstitutable, and so it's not actually such a big deal to regenerate your "precious state"?
No worries, apology accepted! You misunderstood what I wrote: parts of my state are small enough for me to print() them in a cell and use the output as reference.
The whole notebook doesn't recompute only cells that are dependent on the cell that changed. This is extremely powerful because you never end up with stale cells that are showing incorrect values.
This is extremely counterproductive, because I want results you're calling "stale" to use as a reference or inspiration. I don't want to destroy old results just because I changed some parameter value to test an idea.
It improves reproducibility, consistency, and sharing, but reduces convenience for some operations. It's a trade-off in favor of programming in the large.
If you don't want to recompute dependent nodes, then use new names for your experiments rather than redefining old functions and variables. Yes, in some ways this is less convenient for you, but it's more convenient for people receiving your notebooks, that the notebook is always in a consistent state and reproducible.
Maybe it doesn't work well for your workflow, particularly if you're not sharing notebooks and keeping your notebooks small. On the other hand, if your workflow requires significant amounts of leaving notebooks in an inconsistent state, you may end up saving yourself significant frustration with larger notebooks and losing work due to losing track of your mental tracking of inconsistencies.
Also, if you hit a state that you really don't want to lose, you should probably do a quick git commit. You can always squash commits later if needed.
It might be worth changing your workflow, or it might not.
I think this is the interesting point though. Many people want to use Jupyter notebooks so that it looks reproducible. Not to make it actually reproducible. God forbid it actually has to be re-ran, it could have different results!
I think that's my main notebook gripe: they make it look like if you run the code you'll get these results, but that's not even close to the case. Many people abuse this. At this point, I pretty much assume anything in a Jupyter notebook isn't reproducible.
If you’re so attached to that data, you should probably do something to save it other than let it sit in RAM or maybe an old plot in a random notebook.
Then instead of reassigning new data to foo, just assign new data to foo2. You can still use the notebook to experiment, what you are doing is removing ambiguity.
> how often do you want your entire notebook to recompute just because you change something somewhere?
This is exactly what I want, always. In Jupyter I'm continuously doing the restart kernel and re-run all cells dance. It is annoying and I love another system optimized for that like Pluto, without those stupid non-deterministic cells.
Also, big "ugh" to browser-based tooling. I want to browse webpages in my browser, I don't want to do my data science work there. We don't even have a good native client for Jupyter notebooks yet, let alone for this new Jupyter alternative that doesn't support the existing Jupyter kernel protocol.
Not only that, but Pluto also apparently has some obnoxious UX limitations that remind me of other less-than-usable wannabe-Jupyter-notebooks (e.g. Apache Zeppelin, Databricks): https://towardsdatascience.com/could-pluto-be-a-real-jupyter...
In short: nice idea, but I'd rather see continued unification around Jupyter and a proper IDE that can at least emit and interact with Jupyter-compatible data.
On the other hand, the Jupyter notebook JSON format is bad for a variety of reasons (e.g. you need special tools for readable Git diffs) and I really wish we had all settled on R Markdown instead. But R has its own NIH tooling problem and nobody was ever going to adopt it because the R community itself (driven by RStudio) has little interest in sharing or interoperability with other languages.
</cynical-angry-rant>