Delegating complex treatments to filters in shell programs (2015)

jstimpfle · on May 12, 2017

Yes, the critizised code is definitely bad. Glues two user-defined constructs together (is-in-list and a variable). Also, unquoted variables and relying on shell splitting...

But you don't necessarily need to shell out to awk / perl / grep / whatever.

    is_in_list() {
        local thing=$1
        shift
        while [ $# -gt 0 ] ; do
            [ "$1" = "$thing" ] && return 0
            shift
        done
        return 1
    }

Now just do `is_in_list "$thing" "$@"` where the positional array `"$@"` is the list (or by all means use unquoted `$COMPILER_VERSIONS`, i.e. unsafe shell-splitting).

Another possibility, where the list is given as a space-separated string, is with case statements. That approach is logically equivalent to the awk version, but doesn't fork so is much more performant.

    is_in_list() {
        case $2 in
        $1|*" $1 "*|*" $1"|"$1 "*) return 0;;
        *) return 1;;
        esac
    }

Not tested, but that's the idea.

falsedan · on May 12, 2017

Bash is all about shelling out: it's a shell. I would happily use grep here, and if I had an array:

  printf "%s\0" "$@" | grep -qz needle

> much more performant

I believe that performance-critical code shouldn't be written in bash. I have seen pipelines be much faster than tweaked bash functions, simply because they run in parallel & thus the cost of forking is paid once, up-front.

jstimpfle · on May 12, 2017

I agree with you. I think we are just speaking about different things.

I was more thinking about the things that happen in a configure script, for example. Not a "hoot loop" like a big grep over millions of lines or something.

There are situations in shell code where the difference between a case-match or string-suffix-replace written in shell and a process spawn matters. Not only because forking a process for one simple string operation is wasteful (cost: about 1ms), but also because of the semantic problems that come with child processes (error handling).

falsedan · on May 12, 2017

There's another advantage to using filters in pipelines: each part of the pipeline runs in parallel!

I recently rewrote some bash code to replace

  for subdir in $names
  do
    for parent in $(parent_of $dir)
    do
      [ -d $parent/$subdir ] && …
    done
  done

with

  parents_of $dir | append_subdirs $names | is_dir | xargs …

I expected a modest speed improvement (due to not calculating the parents of a dir repeatedly), but was surprised to see it was twice as fast!

10165 · on May 12, 2017

The author's first sentence regarding "novice" shell programmers I think may be applicable to the other submission about shell scripts currently on the front page.

For example, in C the idioms often combine several instructions into a single line, i.e., "nesting". Kernighan suggested nesting in an early C tutorial:

   while ( putchar( getchar( ) ) != '\0' )

In the shell maybe it is better to test each "expression" on a line by itself. Or maybe not. I do this anyway.

More often I see on the web that other shell scripters prefer to nest as many commands as they can, perhaps to reduce the number of lines.

For example,

    variable=$(command1 $(command2));

This could be alternatively expressed as something like

   variable1=$(command2);
   # can now test variable1 before proceeding to next line
   variable2=$(command1 $variable1);

The result of nesting is subshells and complexity that I am not sure reluctant, occasional shell scripters are prepared to think about.

And if I am not mistaken that was at least part of the problem that Jane Street had in the other submission.