Chapter 32 The Shell (IV)

In research applications, it is impossible to anticipate all the types of manipulations we might want to carry out. For some tasks, we will need to write our own software to carry out specific operations of interest. Fortunately, many specialized tasks require only a brief program to achieve the goal. The Humdrum tools can be used in conjunction with user-developed software to carry out specific tasks.

Many users will already have some programming ability and will be able to apply this knowledge using their preferred programming language. For those users who have less programming background, it may be useful to learn some basic programming skills. While the shell provides a useful programming environment, for more complex tasks, it is better to use one of the many good programming languages.

For data manipulation tasks comparable to those described in this book, the most appropriate programming language include perl and awk. The awk programming language is especially useful for text processing, retrieving, transforming, reducing, and validating text data. The perl programming language provides even more extensive capabilities, but requires a somewhat greater effort to learn. For research-oriented programming, perl is the programming language of choice. However, for this brief introduction we will describe features of the awk programming language. Awk is a so-called “scripted” language. It is easy to learn but nevertheless quite powerful.

The awk Programming Language

Awk programs can be executed from the shell command line. A simple program is the following:

awk '{print "hello"}'

The awk command invokes the awk program interpreter. The material within the single quotes is the actual program. Once the program is started, it is is executed once each time you type the carriage return or ENTER key. To stop the program, simply type control-D (on UNIX systems) or control-Z (on DOS systems).

In the default configuration, an awk program will be executed once for each line of input. If no input file is specified, then “standard input” is assumed. That is, input will come from either data arriving through a pipeline, or data typed at the keyboard.

Automatic Parsing of Input Data

Each line of input data is automatically assigned to the awk variable $0. This means that the command

awk '{print $0}'

will simply echoe each line of input as the output. Similarly, the following command will print each line of input preceded by a colon and a space:

awk '{print ": " $0}'

For any input line, awk also automatically parses the data into individual tokens or fields. A token is deemed to be any sequence of characters that is separated from other tokens by any blank space such as spaces or tabs. The first data token is automatically assigned to an awk variable $1. The second data token is assigned to the variable $2, and so on. For example, suppose a program encountered the following input line:

243xyz 3    29   #%$ **     Ullyses 234-034

The variables would be automatically assigned as follows:

$1 = 234xyz
  $2 = 3
  $2 = 29
  $4 = #%$
  $5 = **
  $6 = Ullyses
  $7 = 234-034

Given this input, the command

awk '{print $2 + $3}'

will print the sum of $2 and $3, namely 32.

Arithmetic Operations

Suppose that we have two semits spines as input and we would like to print the semitone difference between the two parts for each sonority. Typically, the higher part is placed in the right-most spine, so it makes most sense to subtract $1 from $2. Negative numbers mean that the nominally lower part has crossed above the nominally higher part:

awk '{print $1 - $2}'

In addition to addition and subtraction, other possible arithemtic operators include the slash / for division, the asterisk (*) for multiplication, the caret ^ for exponentiation, and the percent sign % for modulo arithmetic. Parentheses can be used to clarify the order of operations. For example, the following command prints the product of the first and second tokens $1 * $2 divided by the third token raised to the fourth token power:

awk '{print ($1 * $2) / ($3^$4)}'

As we have already seen, character strings can also be included in print statements. For example, we might want to print the first and third input tokens separated by a tab:

awk '{print $1 "\t" $3}'

Conditional Statements

Often we’d like to avoid processing certain records. For example, we might wish to avoid processing barlines. The awk if statement can be used to restrict the operation to particular circumstances. Consider the following awk program:

awk '{if ($0 !~/^=/) print $1 - $2}'

The if condition is given in parentheses. The string given between the slashes /^=/ is a regular expression: in this case, it identifies any equals sign that occurs at the beginning of an input line. The tilde means “match” and the exclamation mark means “not”. Hence the program means: if the entire line $0 does not match !~ an equals sign occuring at the beginning of the line /^=/, then print the value of the first token minus the value of the second token print $1 - $2.

Awk also provide an else condition. The syntax is:

if (condition)
     [then] {do something}
     else {do something else instead}

For Humdrum inputs, we may want to avoid processing comments and interpretations. Whenever we encounter a comment or interpretation, we might simply echo the input record in the output:

awk '{if($0 ~/^[*!]/) {print $0} else {print $1 - $2}}'

Sometimes we might simply want to do nothing at all when we encounter a comment or interpretation:

awk '{if($0 ~/^[*!]/) {} else {print $1 - $2}}'

Recall that input tokens in awk are separated by any blank space such as spaces or tabs. This means that a Humdrum multiple-stop will be treated as containing two or more tokens. We can avoid this situation by explicitly telling awk to assign the “field separate” (FS) to the tab character. For example, the following program prints the value in the third spine of a Humdrum input. Without reassigning the field separator, the third token might be the third element of a multiple-stop in the first spine, or the second element of a multiple-stop appearing in the second spine.

awk '{FS="\t"; print $3}'

Notice the use of the semicolon to separate individual instructions.

Assigning Variables

Within an awk program, the user can assign and manipulate variables that store particular values. Variables may hold numerical values or they may hold character strings. In the following examples, the value 178 is assigned to the variable `A'; the value 2.2 is assigned to the variable `number'; and the character string “Dear Gail” is assigned to the variable `salutation':

A=178
number = 2.2
salutation = "Dear Gail"

Named variables can be used for various arithmetic operations. For example:

A=178+18
number = 2.2 + A
number_squared = number ^ 2

Manipulating Character Strings

Variables holding character strings can be concatenated together. In the following example, after the first three assignments, the variable saluation will contain the character string “Dear Craig”:

opening = "Dear"
space = " "
name = "Craig"
salutation = opening space name

Awk provides a number of built-in functions for manipulating text. One function (gsub) carries out global substitutions. The syntax is:

gsub("target-string","replacement-string",variable)

For example, the following instruction changes all occurrences of X to Y in a variable named string:

gsub("X","Y",string)

Suppose that we wanted to increment all measure numbers by 1. Let’s presume our input contains only a single spine. First we test for the presence of the equal sign at the beginning of the input record. If the input is not a barline, then we simple print the line in the output. Otherwise we: (1) assign the input to the variable barline, (2) eliminate all non-numeric characters using gsub, (3) add one to the remaining numeric value, and (4) output the new number preceded by the equal sign:

awk '{
   if ($0 !~/^=/) {print $0}
   else {
      barline = $1
      gsub("[^0-9]","",barline)
      barline = barline + 1
      print "=" barline
   }
}'

Notice that we are at liberty to add spaces, tabs, and newlines in order to improve the readability of our program.

The for Loop

Often we would like to repeat a process for several concurrent spines. For example, suppose we had four spines of solfa data and we want to output the total number of leading-tones for each sonority. Awk provides a for instruction that allows us to cycle through a series of values. The for-loop construction has the following syntax:

for (initial-value; condition-for-continuing; increment-action) {do something repeatedly}

In the case of counting the number of leading-tones for each of four spines, our program would be as follows:

awk '{
   count = 0
   for (i=1; i<=4; i++)
      {if ($i ~/ti/) count++}
   print count
   }'

The initial value for the for-loop is 1 (i=1); each time the loop is executed the value of i is incremented by 1 (i++); and the loop continues executing as long as i is less-than or equal to 4 (i<=4). The value $i will take successive values so that the loop will test whether each of $1, $2, $3 and $4 match the regular expression /ti/. For each match, the variable count is incremented by 1. Finally, the value of count is printed. The count is set to zero each time the program is run (that is, for each line of input).

It would be nice if our program could adapt to inputs containing any number of spines. For each line of input, awk automatically identifies the number of input tokens or fields and stores the value in the varible NF. Simply replacing the number 4 by NF will achieve our goal. In our revised program we have also added some comments to clarify our code. Like the shell, awk comments consist of material following the octothorpe character #:

awk '{
      # A program to count occurrences of the leading-tone.
      count = 0
      for (i=1; i<=NF; i++)
         {if ($i ~/ti/) count++}
      print count
   }'

A problem with the above script is that it will attempt to count occurrences of ti in Humdrum comments, interpretations, and barlines. We can improve our program by echoing these in the output without processing them. Another refinement makes use of the awk next instruction. Whenever a next statement is encountered, the program immediately moves on to the next input line and begins processing again from the start of the program.

awk '{
   # A program to count occurrences of the leading-tone.
   count = 0
   if ($0 ~/^[!*=]/) {print $0; next}
      for (i=1; i<=NF; i++)
         {if ($i ~/ti/) count++}
      print count
   }'

Although our output data will consist of a single column (spine) of numbers, it is possible that an input will contain more than one interpretation — and so cause the output to fail to conform to the Humdrum syntax. Rather than simply echoing any interpretation records, we might ensure that only a single interpretation is generated for the output. First, we might look for exclusive interpretations (beginning **) and output a suitable interpretation of our own (e.g., **leading-tones). In the case of tandem interpretations (beginning with only a single asterisk), we could output a single null interpretation. Similarly, when we encounter a barline, we might ensure that only one barline token is output. Finally, we should remain vigilant for spine-path terminators (*-) and ensure that our output is similarly properly terminated. The revised program is as follows:

awk '{
      # A program to count occurrences of the leading-tone.
      count = 0
      if ($0 ~/^**/) {print "**leading-tones"; next}
      if ($0 ~/^*-/) {print "*-"; next}
      if ($0 ~/^*[^*]/) {print "*"; next}
      if ($0 ~/^!!/) {print $0; next}
      if ($0 ~/^=/) {print $1; next}
      {
      for (i=1; i<=NF; i++)
         {if ($i ~/ti/) count++}
      print count
      }'

Of course there are many other features of the awk programming language that we have not described here. These features include associative arrays, built-in variables, string-processing functions, user-defined functions, system calls, begin and end blocks, other control-flow statements, and pipes and file manipulations.

Reprise

In this chapter we have introduced some features of the awk pattern/action language. A programming language, like awk or perl can be used to transform data in highly specific and specialized ways. The power of Humdrum is significantly enhanced when users are able to create their own specialized filters.