perl tip: reporting pipe consumption progress

Posted:

Here’s a perl tip for those trying to report progress on external programs that don’t report that kind of information. The case this hack was designed for was gzip, but you’ll think of many other examples of this class of problem.

Perl is an incredible flexible language. Of all its wonderous features, the ability to get a file handle to a process is the most arcane and little appreciated. It is, however, the key to reporting how much input is consumed by a process.

Here’s a concrete example of what I’m talking about. The compression utility gzip is a stream-oriented program that works on chunks of data that it receives typically from stardard input (STDIN). You can therefore feed gzip a file of any size and it should work, given enough disk space. The larger the file, the longer gzip takes to run (I suppose this makes the runtime a Big O of (n), linear time [so much for using my comp sci degree]).

Occassionally, you’d like to know how far along gzip is in compressing a large file. Gzip does not report this, but does give you the compression ratio at the end of the run, if you called it with the -v flag.

Without hacking gzip, you can create a perl wrapper around gzip in which you can report how many bytes gzip has consumed of the source file. The idea is that the source file is read by perl and feed to gzip. Keep tracking of how many bytes are read in the perl script is simple. Here’s some code.

my $infile = shift @ARGV || die "$0 n";
open GZIP, "|/bin/gzip -c > out.gz" 
   or die "can't open process to gzip: $!";

# disable output buffering to see the progress report
$|++; 

open IN, $infile or die "Can't open $infile: $!";
my $original_size = -s $infile;

my ($buf, $sum);
my $chunk = 200;
while (read(IN, $buf, $chunk)) {
    $sum += $chunk;
    print GZIP $buf;

    printf "progress: %02.2fr", ($sum/$original_size)*100;
}
print "n";
close GZIP;
close IN; 

This short script expects to be called with the name of the file to compress. The output file name is hard coded to be “out.gz”, but it’s a simple matter of programming to make this more flexible. The magic begins when we open the process to gzip. Here, the GZIP file handle will be written to. The source file is then opened for reading. I choose to read the source file in very tiny chunks to clearly see the progress indicate work. Here, 200 bytes are read from the source file and then feed to gzip. The number of bytes read is tracked and reported in a straight forward way.

Two penetrating glimpses into the obvious. One: this script is built for some flavor of UNIX. Some modifications would be needed for Windows, including the use of binmode(IN), binmode(GZIP). Two: this is really just a specialized echo loop. While I’m not one to yammer on about coding patterns, I would say that nearly 90% of the code I write is some kind of echo loop, when you take away the business logic, error checking and other distractions.

If you only learn one thing for a programming class, it should be the humble echo loop.