Automated text editing

Normally, when I want to edit some text, I just open a text editor and change or add the text as I want to.

But sometimes there is a lot of text to change. Doing it by hand would take too much time, and errors would almost certainly slip in.

Another situation that happens often is that there is some text you need to change often.

In both situations, automating the task is a sensible approach.

There are some great tools outside my text editor of choice for editing text, and I tend to forget quite often how to work, or even that I have them at my disposal.

Here I want to document some common use-cases I’ve encountered.

Built-in functionalities in POSIX sh

The shell itself only has rudimentary tools for editing text.

Parameter expansion

As already noted, with parameter expansion, it is possible to replace empty or unset variables with a specific string.

unset var
printf '%s\n' "${var:-default value}"
# prints: "default value"

var= # or var=''
printf '%s\n' "${var:-default value}"
# prints: "default value"

var=value
printf '%s\n' "${var:-default value}"
# prints: "value"


unset var
printf '%s\n' "${var-default value}"
# prints: "default value"

var= # or var=''
printf '%s\n' "${var-default value}"
# prints: ""

var=value
printf '%s\n' "${var-default value}"
# prints: "value"

substring extraction

A common operation is extracting a substring from a string.

For example, when converting between file formats, one might want to use the same file name, but with a different extension, and thus needs to extract the file name without extension from a string representation of a path:

my_path=... # might exist or not
filename="${my_path##*/}" # as alternative to: filename="$(basename -- "$myfile")"
last_extension="${my_path##*.}"
filename_without_last_extension="${filename%.*}"

string concatenation

This operation is easy, just put the variables together:

VAR1="...";
VAR2="...";
RESULT1="$VAR1$VAR2"

In some cases, you might need to write ${VAR} instead of $VAR, for example, when concatenating a variable to a constant:

VAR1="...";
RESULT1="${VAR1}someothertext"

Note that this is not always necessary:

VAR1="...";
RESULT1="$VAR1/someothertext"
RESULT2="text$VAR1"

In this example, $VAR1 is sufficient because a variable name cannot contain the / or $ character. There is thus no need to disambiguate where the variable name ends or begins.

String comparisons

# contains (v1)
if [ "${string#*"$substring"}" != "$string" ]; then :;
   # $substring is in $string
else :;
   # $substring is not in $string
fi

# contains (v2)
case "$string" in
  *"$substr"*) ;; # $substring is in $string
  *) ;; # $substring is not in $string
esac

# starts_with (v1)
case "$starts_with" in
  "$prefix"*) ;; # string begins with prefix
  *) ;; # string does not begin with prefix
esac

# starts_with (v2)
if [ "${string#"$prefix"}" != "$string" ]; then :;
   # string begins with prefix
else :;
   # string does not begin with prefix
fi

# ends_with (v1)
case "$string" in
  *"$suffix") ;; # string ends with suffix
  *) ;; # string does not end with suffix
esac

# ends_with (v2)
if [ "${string%"$suffix"}" != "$string" ]; then :;
  # string ends with suffix
else :;
  # string does not end with suffix
fi

The disadvantage of using parameter expansion is that:

  • string needs to be repeated

  • I always have difficulty remembering which symbols to use

    • #* for contains

    • # for starts_with

    • % for end_with

Thus, I either use case, or make a function with the parameter expansion.

cut

cut - remove sections from each line of files

— man cut

Perfect for extracting something based on a delimiter

git rev-parse --abbrev-ref HEAD | cut -d_- -f1

It does not work well if the delimiter might be a character or another one, as you can only specify one.

tr

tr - translate or delete characters

— man tr

It is not a tool that I use often.

It works very well if one wants to unconditionally map one character with another.

But if the character needs to be replaced only in some situation, or if the mapping depends on the content itself.

sed

sed - stream editor for filtering and transforming text

— man sed

sed has multiple features I like and have used:

  • can edit files in-place with -i

  • can apply multiple commands instead of creating one giant command with -e or executing multiple sed commands

  • can capture content

  • multi-line support with -z

Unfortunately, other tools, editors in particular, might not use the same regex syntax.

So a command I wrote for sed cannot be used in vim directly.

On top of that, both -z and -i seems to be extensions, as they are not mentioned in the POSIX specification 🗄️

This is my cheat-sheet:

  • . matches any character

  • * to match zero or more instances

  • ^ matches the beginning of a string

  • \{n,m\} to match between n and m instances (inclusive)

  • $ matches the end of a string

  • \\, \$, * matches \, $ and * respectively

  • [ and ] for creating a list of characters where one is matched.

    • use - between characters to create a sequence in a list, like [a-z]

    • use ^ at the begin of the list to reverse its meaning, like [^ab]

    • use ] at the begin of the list to match ], like []ab]

  • \b matches a word boundary

  • \( and \) for starting and finishing a capture

  • \1, \2, …​ which is for inserting or matching the first, second, …​ capture

  • \s for denoting space (normal space, tabular space, …​)

  • \| is the or operator

I’ve mostly used sed in two situations.

The first one is for fixing problematic patterns. In this case, I want to edit multiple files and edit some specific patterns.

Together with sed, find is used for editing all the files I am interested in:

find . -name '*.cpp' -type f -exec \
 sed -i \
  -e "$pattern1" \
  -e "$pattern2" \
 {} +

The second situation is for editing the output of specific commands in a pipeline; an example would be extracting from the branch name the ticket ID of the bug tracker:

git rev-parse --abbrev-ref HEAD | sed 's/\([A-Z][A-Z]*-[0-9][0-9]*\)[_-].*/\1/'

If the branch name follows the pattern PROJECT-NUMBER_short-description or PROJECT-NUMBER-short-description, the command shown will return PROJECT-NUMBER.

awk

mawk - pattern scanning and text processing language

— man awk

awk is the tool I always forget I have at my disposal, which is a pity as it is very flexible.

Cutting text

As an alternative to cut

#cutting
awk '-F[' '{print $1}'
# vs
cut -d'[' -f1

But compared to cut, it is possible to cut the text based on one character, or something else:

# cutting based on pattern [A-Z]+-[0-9]+ or after the word noticket
git rev-parse --abbrev-ref HEAD | awk '{ match($0, /^([A-Z]+-[0-9]+|noticket)/, m); print m[1] }'

Counting / Dictionary

Counting things in awk is extremely easy:

awk '{count[$0]++} END {for (item in count) print count[item], item}'

But sometimes one wants to count based on a substring; with awk one can merge both operations instead of using different tools.

# note the usage of $1 instead of $0
awk '-F[' '{count[$1]++} END {for (item in count) print count[item], item}'

Split

Another trivial but still useful job that can be accomplished with awk:

# print $PATH line by line
echo "$PATH" | awk -F':' '{for(i=1;i<=NF;i++)print $i}'

Conditional printing

For example, print a line only if it has a minimal length:

awk 'length($0) > 5 {print $0}'

grep

awk can also be used in a similar way to `grep, for filtering content based on a pattern:

printf 'aaa\nbbb\ccc\n' | awk '/^b/ {print $0}'

Colorizing output

This task might come a little bit out of the blue, but it is the task where I used different awk features.

For example, contrary to git, svn does not colorize the output.

Another example is the cl compiler from Microsoft, which does not support colorized output like gcc or clang.

This is not a deal-breaker, but nearly. Having warnings and errors highlighted, especially on big projects, makes working from the command line much easier.

Thus, since the difference in experience when working from the console is big, I decided to write a couple of "colorizers".

This is, for example, what I’ve used for having a better output when working with CMake and MSVC:

colorize-build
#!/usr/bin/awk -f

BEGIN {
 count_vcxproj=0
 count_err=0;
 count_warn=0;
 cmake_count_warn=0;

 it_arr[0] = "/"
 it_arr[1] = "-"
 it_arr[2] = "\\"
 it_arr[3] = "|"
}

/Copyright \(C\)|Cmake does not need to run/{
 next;
}
/\.vcxproj -> /{
 count_vcxproj++;
 printf "[" it_arr[ count_vcxproj % 4] " " count_vcxproj "] " $1 "                                                 \r";
 next;
}
/: error C.| : error LNK|: error MSB|: fatal error LNK|^cl : Command line error/{
 count_err++;
 $NF="";
 print "\033[31m" $0 "\033[00m";
 next;
}
/: warning C.|: warning LNK|: warning MSB/{
 count_warn++;
 $NF="";
 print "\033[33m" $0 "\033[00m";
 next;
}
/^CMake Deprecation Warning|^CMake Warning|-- Could NOT find |WARNING: /{
 cmake_count_warn++;
 print "\033[33m" $0 "\033[00m";
 next;
}
/: note: /{
 print "\033[36m" $0 "\033[00m";
 next;
}
#gcc/clang syntax, FAILED seems to be of cmake
/: error: |^FAILED: |subcommand failed\./{
 print "\033[31m" $0 "\033[00m";
 next;
}
/^ninja: no work to do\./{
 print "\033[36m" $0 "\033[00m";
 next;
}
{print;}

END {
 if( count_vcxproj != 0) {
 print "\n\033[35mGone through " count_vcxproj " vcxproj.\033[00m";
 }
 if (count_err != 0 || count_warn != 0 || cmake_count_warn != 0) {
 print "\n\033[35m----------------------------------------";
 print   "\033[31m" count_err        " errors";
 print   "\033[33m" count_warn       " warnings";
 print   "\033[33m" cmake_count_warn " cmake warnings\033[00m";
 }
}

The pattern in the whole script matches a specific pattern, like warning C.*, then it prints it with colors. This is not perfect, as a warning could be split on multiple lines, but it is good enough.

As a bonus, I added an error and warning counter; ideally, those are not needed, but legacy projects tend to have a lot of warnings with newer compilers, and instead of disabling them, I prefer to try to fix them. The counter at the end is a quick way to ensure that warnings are getting fixed.

As another bonus, I found the standard output too verbose; in particular, I am not interested in all lines starting with the literal `.vcxproj `, so I replaced them with a "spinner".

ex, vi

POSIX defined ex 🗄️ and vi 🗄️ .

It is possible to define commands and use them for bulk operations or automated editing directly from the shell.

Other tools

There are a lot of tools, for example turing-complete languages like PERL or Python, and tools more dedicated for some formats, like jq for JSON.

Normally, when I have to reach to "more complete languages" or more specialized tools, it is for much more constrained use-cases.

This is not to say that they are less useful; on the contrary, they have many advantages over the tools presented. Tools like awk and sed are small, performant, have a very concise syntax and cover a lot of use-cases. It might take some time to use them efficiently, but they are easy to use from the command line or "embed" in scripts.

Once I am using something else, normally those tools are replaced by functions and other language constructs. I am not aware of libraries for C, Java, Python, and other languages that provide a "drop-in" interface that works like awk and sed; it might be something interesting to have.

Conclusion

I should use awk and sed more often.

Sometimes I realize too late that I could have used sed instead of searching and editing some text manually over dozens of files, and sometimes I completely forget that I have awk in my toolchain.

All programs presented programs are defined by POSIX, so they should be available on most Linux distributions. On Windows, there are multiple ports, normally I use those made available by Cygwin or WSL.


If you have questions, comments, or found typos, the notes are not clear, or there are some errors; then just contact me.