tee on steroids with awk
From time to time, it happens that I want to execute a command in the shell, and I want to work on its output.
Often, I want to filter some data, but I do not want to drop the unfiltered data; I want to store it somewhere else to be able to process it further in a separate step.
The easy solution is to store the content of the first program in a file (or in a buffer), and then filter it multiple times.
It works, and it is good enough that I’ve always used this approach, although when the output is big, the idea of processing the whole input multiple times caused me some itches.
I finally had a use case where I had a job that took a long time and created a lot of data. I wanted to filter it while the job was in progress.
I could have worked around it; use multiple file descriptors; periodically update files… but I decided there had to be a simpler way.
I decided I needed something like tee, but with the option to store the content to different files based on filters.
For those who have never used tee, the most common usage looks like the following:
command 2>&1 | tee file.txt The output of command is still displayed on the console, and it is also stored in file.txt.
I wanted something that worked like the following:
command 2>&1 | multi-tee "pattern1" file1.txt "pattern2" file1.txt ... After evaluating my use case better, I noticed that a simple pattern was not enough; I also wanted to transform the data. If possible, transforming the data immediately is easier than processing it afterwards if you want to look at it while command is running.
Thus, I actually wanted something like
command 2>&1 | multi-tee-transform "pattern1" "transform1" file1.txt "pattern2" "transform2" file2.txt ... Otherwise, I would have needed to write something like
command 2>&1 | multi-tee "pattern1" file1.tmp.txt "pattern2" file2.tmp.txt ...
# while command is running, execute repeatedly
transform file1.tmp.txt file1.txt
transform file2.tmp.txt file2.txt which works, but means that the same data is processed multiple times, as I see no simple way to ensure the absence of data races if some data is removed from file1.tmp.txt. The alternative is to remember the final position where file1.tmp.txt has been processed, but it also gets hairy.
I decided that I needed multi-tee-transform; multi-tee was not enough for many use-cases, but I was unsure how to implement the transform capabilities. For filtering, a regular expression seemed the obvious choice, but for transforming the text, a regex is not always that practical.
At one point, I remembered:
awkis the tool I always forget I have at my disposal, which is a pity as it is very flexible.
So I checked out, and learned that awk is a suitable replacement for tee:
9. Input and output
[…] The output of print and printf can be redirected to a file or command by appending > file, >> file or | command to the end of the print statement. Redirection opens file or command only once, subsequent redirections append to the already open stream. […]
So I tried it out, and in fact, a simple tee can be replaced easily with awk:
command 2>&1 | tee file.txt
# vs
command 2>&1 | awk '
{
print # print to stdout
print >> "file.txt" # print to file
}' Since awk can also be used in a similar way to grep, it can be used for filtering content and storing it in different files
command 2>&1 | awk '
{
print # print to stdout
print >> "file.txt" # print to file
if (/load/) {print >> "load.log"; fflush("load.log"); }
if (/analyzing/) {print >> "analyze.log"; }
if (/verifying/) {print >> "verify.log"; }
}' In this example, I’ve also added fflush("load.log"), because the patter load did not happen often, maybe once every five minutes, and with little data, and since I wanted to see the output in (nearly) realtime, I wanted it to get to the file as fast as possible, instead of using the buffering capabilities to increase performance. For the pattern analyzing and verifying, there was no need to flush the content; the output was big, and the data was flushed often enough, without disabling the buffering capabilities.
Thus, with a straightforward and simple syntax, awk is already the replacement for a multi-tee program, and an improvement for many scripts I wrote.
But awk can be used for text editing too, and thus it can transform the data before writing it to a file, thus it is also a replacement for multi-tee-transform!
command 2>&1 | | awk '
{
print # print to stdout
print >> "file.txt" # print to file
if (/load/) {print >> "load.log"; fflush("load.log"); }
if (/analyzing/) {printf "%s - %s\n", $9, $5 >> "analyze.log";}
if (/verifying/) {printf "%s - %s\n", $7, $3 >> "verifying.log";}
}' In this example, I’m only writing a subset of the output to analyze.log and verifying.log, but even such a simple transformation can bring a lot of benefits, instead of creating temporary files that need to be processed afterwards, or multiple times on demand.
A minor variation of tee would be not to forward everything to stdout; for example:
command 2>&1 | | awk '
{
if (/analyzing/) {printf "%s - %s\n", $9, $5 >> "analyze.log";}
if (/verifying/) {printf "%s - %s\n", $7, $3 >> "verifying.log";}
if (/load/) {print >> "load.log"; fflush("load.log"); }
else {print; print >> "file.txt";}
}' But awk is that flexible, that listing all possible use-cases would not be practical.
Suffice to say, I had some tasks that consisted of analyzing some data and writing a summary report.
Some of them have been completely replaced by an awk script, thanks to its great flexibility.
As a bonus, some analysis can be done in real-time, instead of waiting for command to finish, and without transforming the data multiple times.
I should really use awk more often!
If you have questions, comments, or found typos, the notes are not clear, or there are some errors; then just contact me.