Parallel shells with xargs: Utilize all your cpu cores on UNIX and Windows


Introduction

One particular frustration with the UNIX shell is the inability to easily schedule multiple, concurrent tasks that fully utilize CPU cores presented on modern systems. The example of focus in this article is file compression, but the problem rises with many computationally intensive tasks, such as image/audio/media processing, password cracking and hash analysis, database Extract, Transform, and Load, and backup activities. It is understandably frustrating to wait for gzip * running on a single CPU core, while most of a machine’s processing power lies idle.

This can be understood as a weakness of the first decade of Research UNIX which was not developed on machines with SMP. The Bourne shell did not emerge from the 7th edition with any native syntax or controls for cohesively managing the resource consumption of background processes.

Utilities have haphazardly evolved to perform some of these functions. The GNU version of xargs is able to exercise some primitive control in allocating background processes, which is discussed at some length in the documentation. While the GNU extensions to xargs have proliferated to many other implementations (notably BusyBox, including the release for Microsoft Windows, example below), they are not POSIX.2-compliant, and likely will not be found on commercial UNIX.

Historic users of xargs will remember it as a useful tool for directories that contained too many files for echo * or other wildcards to be used; in this situation xargs is called to repeatedly batch groups of files with a single command. As xargs has evolved beyond POSIX, it has assumed a new relevance which is useful to explore.


Why is POSIX.2 this bad?

A clear understanding of the lack of cohesive job scheduling in UNIX requires some history of the evolution of these utilities.

The shell as defined by POSIX.2 has primitive job control features. This functionality originated from one source, the csh as written by Bill Joy and first distributed in 1978, and has not significantly progressed since that time, even after job control was absorbed by the Korn shell. Below is an example of [c]sh job management as implemented in bash, to which POSIX.2 shells remain constrained. In this session, ^Z and ^C imply a Control key combination.

$ xz -9e users00.dat ^Z [1]+ Stopped xz -9e users00.dat $ bg [1]+ xz -9e users00.dat & $ xz -9e users01.dat ^Z [2]+ Stopped xz -9e users01.dat $ xz -9e users02.dat ^Z [3]+ Stopped xz -9e users02.dat $ jobs [1] Running xz -9e users00.dat & [2]- Stopped xz -9e users01.dat [3]+ Stopped xz -9e users02.dat $ bg 3 [3]+ xz -9e users02.dat & $ jobs [1] Running xz -9e users00.dat & [2]+ Stopped xz -9e users01.dat [3]- Running xz -9e users02.dat & $ fg 2 xz -9e users01.dat ^C $ jobs [1]- Running xz -9e users00.dat & [3]+ Running xz -9e users02.dat &

In the above example, three compression commands have been launched, the second canceled, and the remainder pushed to the background.

Prompting discussion, a partial list of the obvious flaws with this design:

  • There is no reporting or allocation of available CPUs to take up jobs as resources become available.

  • Failed commands that return a non-zero exit status or otherwise terminate abnormally are not communicated well. Placing such cases in a failed queue for rerun would be helpful.

  • No global system scheduling of jobs is available. Any user can issue background jobs which overwhelm the machine, either on their own or in concert with others.

While SMP first appeared in computer systems marketed in 1962, and was firmly established with the release of the IBM System/370 that emerged the same year as the birth of UNIX, such powerful machines were not available to the developers in the “poverty” of what is known as Research UNIX. Systems with these capabilities would not become generally prevalent for many years.

“[The] UNIX system did not support multiprocessing… The IBM 3033AP processor met the requirement with approximately 15 times the computing power of a single PDP-11/70 processor.”

It appears that the first SMP-capable UNIX platform was the Sperry/UNIVAC 1100, an internal AT&T port begun in 1977. This port, and the later IBM effort on the System/370, both built upon OS components provided by the vendors (EXEC 8 and TSS), and did not appear to rely on general SMP implemented in the 7th edition kernel.

“Any configuration supplied by Sperry, including multiprocessor ones, can run the UNIX system.”

Since the csh could not have been written on a multiprocessing machine, and the intervening years prior to UNIX System V did not generally introduce SMP, shell job control likewise has no visibility of multiple processors, and was not designed to exploit them.

This lack of progress was cemented in POSIX.2 due to the UNIX wars, where these standards were issued as a defensive measure by a consortium led by IBM, HP, and DEC (among others), locking UNIX System V capabilities upon the industry for all time. For many, innovation beyond POSIX is not permitted.

When POSIX.2 was approved, all the major players had implemented SMP, but no motivation was found to expand the POSIX.2 standard shell beyond System V. This has left x86 server NUMA and embedded big.LITTLE equally underrepresented in any strictly-conformant POSIX implementation.

The reason that issuing gzip processes in parallel remains a non-trivial task is due to codified defensive marketing.


GNU xargs

Due to the lack of modern job control within the POSIX.2 shell, one hack is available that provides expanded capability within GNU xargs. Other solutions include GNU parallel and pdsh, not presented here.

The classic xargs utility combines standard input and positional parameters to fork commands. A simple xargs example might be to list a few inode numbers:

$ echo /etc/passwd /etc/group | xargs stat -c '%i %n' 525008 /etc/passwd 525256 /etc/group

This basic invocation is incredibly useful when dealing with a large number of files that exceeds the maximum size of a shell command line. Below is an example from an ancient commercial UNIX of xargs used to address shell memory failure:

$ uname -a HP-UX localhost B.10.20 A 9000/800 862741461 two-user license $ cd /directory/with/lots/of/files $ chmod 644 * sh: There is not enough memory available now. $ ls | xargs chmod 644 $ echo * sh: There is not enough memory available now. $ ksh $ what /usr/bin/ksh | grep Version Version 11/16/88 $ echo * ksh: no space $ /usr/dt/bin/dtksh $ echo ${.sh.version} Version M-12/28/93d $ echo * Pid 1954 received a SIGSEGV for stack growth failure. Possible causes: insufficient memory or swap space, or stack size exceeded maxssiz. Memory fault $ /usr/old/bin/sh $ ls * /usr/bin/ls: arg list too long $ ls * * no stack space

Good luck finding that in the manual.

There is a problem with POSIX xargs, in that it does not cope well with spaces or newlines in the files on standard input. The only universally prohibited character in a UNIX filename is the forward slash (/). A GNU extension, the -0 argument, sets the file delimiter to the NUL, or zero-byte value, which simplifies file processing dramatically, and greatly improves safety. GNU find has switches to exploit this feature in a pipeline. Realistically, an xargs lacking -0 is not worth using.

The second major GNU extension allows for parallel processing with the -P # argument. By itself, this will not trigger parallel processing, but when combined with the -L 1 option, all input files will be launched separately with the target program, running only the number of process slots alloted.

Before launching our first parallel script, verify this program, which reports the number of processor CPU cores that are visible to Linux:

$ nproc 4

This number may not reflect physical cores, but also SMT/hyperthreads that may be implemented in multiples per core. Some commands do not work well when run within threads implemented on a single core.

Let’s now present a parallel compression script, flexible enough to generate several file formats. It is POSIX-compliant, and runs under Debian DASH and the BusyBox shell.

$ cat ~/ppack_lz  #!/bin/sh PARALLEL="$(nproc --ignore=1)" EXT="${0##*_}" case "$EXT" in  bz2) CMD='bzip2 -9' ;;  gz) CMD='gzip -9' ;;  lz) CMD='lzip -9' ;;  xz) CMD='xz -9e' ;;  zst) CMD='zstd --rm --single-thread --ultra -22' ;; esac if [ -z "$1" ] then echo "Specify files to pack into ${EXT} files." else for x  do printf '%s' "$x"  done | nice xargs -0 -L 1 -P "$PARALLEL" $CMD fi

A few notes on this example:

  • The script is configured to use all but one of the CPUs reported by nproc. Depending upon machine load, it might be better to set this manually.

  • The script detects the type of compression to perform by the last characters after the underscore (_) in the script’s filename. If the script is named foo_bz2, then it will perform bzip2 processing instead of the lzip selected above by ppack_lz.

  • The files to compress that are specified as arguments to the script will be emitted by the for loop on it’s standard output, NUL-delimited, to be scheduled by xargs.

To observe this script in action, it is helpful to have a (nearly POSIX-compliant) shell function to search the output of the ps command:

psearch () { local xx_a xx_b xx_COLUMNS IFS=| [ -z "$COLUMNS" ] && xx_COLUMNS=80 || xx_COLUMNS="$COLUMNS" ps -e -o user:7,pid:5,ppid:5,start,bsdtime,%cpu,%mem,args | while read xx_a do if [ -z "$xx_b" ] then printf '%sn' "${xx_b:=$xx_a}" else for xx_b do case "$xx_a" in *"$xx_b"*) printf '%sn' "$(expr substr "$xx_a" 1 "$xx_COLUMNS")" ;; esac done fi done }

With that monitor ready, we can run this script on a few WAV files with a quad-core CPU:

$ ~/ppack_lz *.wav

In another terminal, the xargs that is scheduling these commands is visible:

$ psearch lzip USER PID PPID STARTED TIME %CPU %MEM COMMAND cfisher 29995 29992 16:01:49 0:00 0.0 0.0 xargs -0 -L 1 -P 3 lzip -9 cfisher 30007 29995 16:02:10 0:27 100 2.8 lzip -9 track01.cdda.wav cfisher 30046 29995 16:02:31 0:05 97.5 1.4 lzip -9 track02.cdda.wav cfisher 30049 29995 16:02:33 0:04 108 1.2 lzip -9 track03.cdda.wav

As is outlined in the xargs parallelism documentation, sending SIGUSER1 and SIGUSER2 will oppositely add and drop the number of parallel processes scheduled by xargs. Additions take effect immedately, while reductions will wait for existing processes to exit.

The form of the xargs command above is constraining, in that the order of the predetermined arguments and the parameter supplied by xargs cannot be adjusted. A more nuanced version, allowing greater scripting flexibility, can be set with the POSIX -I option, but it requires a “meta script” that is generated at runtime.

$ cat ~/parallel-pack_gz #!/bin/sh PARALLEL="$(nproc --ignore=1)" S="$(mktemp -t PARALLEL-XXXXXX)" trap 'rm -f "$S"' EXIT EXT="${0##*_}" case "$EXT" in  7z) printf '#!/bin/sh n exec 7za a -bso0 -bsp0 --mx=9 "${1}.7z" "$1"' ;;  bz2) printf '#!/bin/sh n exec bzip2 -9 "$1"' ;;  gz) printf '#!/bin/sh n exec gzip -9 "$1"' ;;  lz) printf '#!/bin/sh n exec lzip -9 "$1"' ;;  xz) printf '#!/bin/sh n exec xz -9e "$1"' ;;  zst) printf '#!/bin/shnexec zstd --rm --single-thread --ultra -22 "$1"';; esac > "$S" chmod 500 "$S" if [ -z "$1" ] then echo "Specify files to pack into ${EXT} files." else for x  do printf '%s' "$x"  done | nice xargs -0 -P "$PARALLEL" -Ifname "$S" fname fi

Above, a call to 7za is added, which is contained in the p7zip package that is available on many platforms (Red Hat users can find this in EPEL). The use of 7-zip comes with a few warnings, as the program is itself multithreaded (using between 1.25-1.5 cores) and memory demands are increased, so the number of parallel processes should be reduced. Furthermore, 7-zip has the ability to append to an existing archive (like Info-ZIP, which it is intended to replace); do not schedule multiple 7-zip processes to append to the same target file. The encryption options of 7-zip might be of particular interest in avoiding security breach regulations on backup media.

While the title of this article, “parallel shells,” is technically correct in the above usage, the exec above wipes the shells instantly, and is a more efficient use of the process table.

With this flexible script in place, we perform a benchmark with pigz, the multithreaded gzip, against 80 files of 2 gigabytes in size (which in this case are Oracle database datafiles, containing table and index blocks at random). The base server is an (older) HP DL380 Gen8, with 8 available processor cores:

$ lscpu | grep name Model name: Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz # time pigz -9v users* users00.dat to users00.dat.gz users01.dat to users01.dat.gz users02.dat to users02.dat.gz ... users77.dat to users77.dat.gz users78.dat to users78.dat.gz users79.dat to users79.dat.gz real 45m51.904s user 335m15.939s sys 2m11.146s

During the run of pigz, the top utility reports the following process CPU utilization:

 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11162 root 20 0 617616 6864 772 S 714.2 0.0 17:58.21 pigz -9v users01.dat...

Against this (ideal) benchmark, the xargs script was slightly faster, even running under nice CPU priority, with PARALLEL set to 8 on the same host:

$ time ~/parallel-pack_gz users* real 44m42.107s user 341m18.650s sys 2m47.379s

During the run of xargs-orchestrated parallel gzip, the top report listed all the single-threaded processes scheduled on separate CPUs (note the priority level 30, reduced by nice, compared to 20 for pigz):

 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14624 root 30 10 4624 828 424 R 100.0 0.0 0:09.85 gzip -9 users00.dat 14625 root 30 10 4624 832 424 R 100.0 0.0 0:09.86 gzip -9 users01.dat ... 14630 root 30 10 4624 832 424 R 99.3 0.0 0:09.76 gzip -9 users06.dat 14631 root 30 10 4624 824 424 R 98.0 0.0 0:09.69 gzip -9 users07.dat

In this ideal case, the number of files was evenly divisble by the number of CPUs, which helped parallel xargs defeat pigz; adding another file would have caused xargs to lose this race.

There are also a parallel versions of bzip2 (pbzip2), lzip (plzip), xz (pixz) and the zstd utility is normally multithreaded and will utilize all cpu cores, but this default was disabled above. Multithreaded versions may display different performance characteristics than obtained with xargs. For 7za, xargs is the obvious method to increase machine utilization.

A significant I/O concern of parallel xargs scheduling on rotational media is fragmentation. While this is not a factor on SSDs, it is an issue on conventional storage that should be regularly addressed if possible, as can be observed with this result, matched by inode number:

# ls -li users46.dat.lz 2684590096 -rw-r--r--. 1 oracle dba 174653599 Jan 28 13:30 users46.dat.lz # xfs_fsr -v ... ino=2684590096 extents before:52 after:1 DONE ino=2684590096 ...

This fragmentation on an XFS filesystem (the native for Red Hat and derivatives) is obvious, and care should be taken to regularly address fragmentation on filesystems where tools exist to exist to remediate it (i.e. e4defrag, btrfs defrag). On the ZFS filesystem, where no tools to address fragmentation exist, parallel processing should be approached with great care, and only on datasets that lie within pools maintaining ample free space.

Due to this fragmentation issue, we forego a parallel unpack, but prefer a single-threaded, compression-agnostic approach:

$ cat unpack #!/bin/sh for x do echo "$x"  EXT="${x##*.}"  case "$EXT" in  bz2) bzip2 -cd "$x" ;;  gz) gzip -cd "$x" ;;  lz) lzip -cd "$x" ;;  xz) xz -cd "$x" ;;  zst) zstd -cd "$x" ;;  esac > "$(basename "$x" ".${EXT}")" done

Finally, this technique can be used on the BusyBox port for Windows, and likely on other (POSIX) shell implementations on the Win32/64 platform supporting GNU xargs. The BusyBox shell does not implement nice (remove it from the script), nor does does nproc exist within it (set PARALLEL manually). BusyBox only fully implements gzip and bzip2 (an xz applet exists, but does not implement a numeric quality setting). Refitting bzip2 changes, here is a demonstration on my laptop, testing with a copy of all the Cygwin .DLL files:

C:Temp>busybox64 sh C:/Temp $ time sh parallel-pack_bz2 dtest/*.dll real 0m 58.70s user 0m 0.00s sys 0m 0.06s C:/Temp $ exit C:Temp>dir dtest Volume in drive C is OSDisk Volume Serial Number is E44B-22EC Directory of C:Tempdtest 02/02/2021 11:10 AM <DIR> . 02/02/2021 11:10 AM <DIR> .. 02/02/2021 11:09 AM 40,957 cygaa-1.dll.bz2 02/02/2021 11:09 AM 263,248 cygakonadi-calendar-4.dll.bz2 02/02/2021 11:09 AM 289,716 cygakonadi-contact-4.dll.bz2 . . . 02/02/2021 11:10 AM 658,119 libtcl8.6.dll.bz2 02/02/2021 11:10 AM 489,135 libtk8.6.dll.bz2 02/02/2021 11:09 AM 5,942 Xdummy.dll.bz2  1044 File(s) 338,341,460 bytes  2 Dir(s) 133,704,908,800 bytes free

Conclusion

IBM wrote, in their UNIX porting efforts to the System/370:

UNIX… is the only operating system available that runs on everything from one-chip microcomputers to the largest general-purpose mainframes… This represents at least a two-orders-of-magnitude range in power and capacity… The ability of the UNIX system to gracefully span the range from microcomputers to high-end mainframes is a tribute to its initial design over a decade ago and to its careful evolution.

At the same time, we feel nostalgia for job control (under System/370 operating systems) that we do not understand.

While Linux may not reach quite as low as a PDP-11, it shares this property with the 7th edition to a great extent, while running on machines of unimaginable speed from the perspective of the 1970s. However, POSIX.2 requires that we remain in the 1970s with a number of our tools, likely driving users to less expansive competetors with better (job) tooling.

I began my own exposure to UNIX SMP on an Encore Multimax at university in the early 90s, and it is unreasonable to imagine even that machine’s userland to be constrained by the unreasonable requirements of POSIX.2. To accept, even now, the same restrictions upon modern SMP designs is, to an extent, anathema.

POSIX is regarded in many realms to be an inviolate standard. To see it surpassed in small ways by SELinux and systemd provides some hope that we may overcome the limitations imposed upon us by the previous generation. Perhaps the obvious solution would involve systemd acquiring a new job scheduling system. Though it may be argued that portability overrules functionality, also innovation must eventually overrule tradition. Portability is a useful pursuit, but capability and efficiency are also not without value.

Kernel participation in an improved job scheduling system is not strictly required. A basic userspace implementation added to POSIX would likely be greeted with great pleasure by the user community (and hopefully something better than SIGUSR1/2 for runtime adjustments). POSIX does not allow this, but it is time to leave the past behind.

To be forced into an obscure utility for parallel scripting due to the early poverty of UNIX is no reasonable position. An updated POSIX.2 standard for a capable shell and sundry userland utilities is long overdue.

Until such time, thank the FSF for a lateral approach.

Discover more from UBERCLOUD

Subscribe now to keep reading and get access to the full archive.

Continue reading