HN2new | past | comments | ask | show | jobs | submitlogin

> The Taco Bell answer? xargs and wget. In the rare case that you saturate the network connection, add some split and rsync. A "distributed crawler" is really only like 10 lines of shell script.

As someone who has had to cleanup the messes of people who started with this and built many hundred line dense bash scripts... please do not do this.

> I made most of a SOAP server using static files and Apache's mod_rewrite. I could have done the whole thing Taco Bell style if I had only manned up and broken out sed, but I pussied out and wrote some Python.

I feel sad for whoever inherited this person's systems.

"Write code as if whoever inherits it is a psychopath with an axe who knows where you live" is something I heard pretty early on in life and it's been pretty useful.



> "built many hundred line dense bash scripts... please do not do this"

Of course. But that's with any language that's many hundred lines of dense code.

His point is if something can be done simply with built-in proven tools, use them until you need something more.


No, it's not the same.

Most experienced programmers know a little bash and enough UNIX commands to get by. This is enough to write a script that handles the happy path, but not enough to handle all error conditions correctly. There are all sorts of tricks you need to know that are commonly skipped. (Forgetting to use -print0 for example, and that's an easy one.) The resulting script is probably okay if you run it interactively and check the output but will blow up or silently do the wrong thing for unexpected input in production. To properly review a bash script for errors you need to be an expert.

By contrast, Go programmers with a few months of experience typically know all of Go.

The older tool is not necessarily better if it has lots of obscure sharp edges that most people don't learn.


+1 - If you thought "it works on my machine" was bad with binaries, shell scripts are so much worse.

Just like Excel "programs", shell scripts can be easily mined for requirements for a real program though.


This is a simple UNIX pipeline, not multi-hundred line spaghetti of korn, c, bash or even zsh shell scripts.

No builtins were used in the example, just core utilities deployed the way they were designed.

Rewriting the wheel is completely bogus, doubly so when you ultimately make calls to those utilities, as is common when 'admins-cum-programmers' start getting their hands dirty with Python.


I am generally in favor of the idea of the OP, but the core utilities do differ across environments and this will bite you sooner or later.

I was bit recently by some pretty boring search and replace functionality differing between sed on OSX and on Debian. Like, I would have had to pass a different argument to sed based on the version of sed (So I switched to Perl for the task). But this is certainly an insidious category of bug where you don't discover it until to try to run the script in another environment and then you're potentially stuck debugging the script from the top down.


If you need something that is only used once or for a short time, in one place, script away!

I'm actually strongly in favor of scripts; but the web is eating everything. If it has to scale, put it on the web.


> By contrast, Go programmers with a few months of experience typically know all of Go.

Not really. Just to give an example of things that new Go programmers don't know: what the limits of json serialization are, how introspection works, the funcitonality and limits of the "virtual inheritance", how the GC handles stuff like Go routines... There might be a selection bias though


Xargs with crawled data sounds like a nightmare. Allow me to link to example.com?$(rm -rf /).


xargs doesn't pass its arguments to the shell, it directly invokes exec:

    $ echo 'test   $(foo   bar)   test' | xargs echo
    test $(foo bar) test
    $ echo 'test   $(foo   bar)   test' | xargs python -c 'print(len(__import__("sys").argv))'
    5
the $ and ( are nothing special to xargs, nor to echo (or wget).

Not to say yay xargs is always great, just that this specific counter example doesn't hold up.


It's also expensive in resources and by extension energy. Unless your data centre is powered by renewables inefficient code can become an ethical issue.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: