Chapter 10

Musical Uses of Regular Expressions


Now that you have a better understanding of regular expressions, let’s apply them. This chapter provides many examples of how regular expressions may be used to define musically useful patterns. In subsequent chapters, we’ll make frequent use of regular expressions.

The grep Command (Again)

Although regular expressions are used in a number of Humdrum commands, they are most frequently used in conjunction with the grep command encountered in Chapter 3. grep is a popular software tool that is available from a number of manufacturers and sources. Many versions of grep differ in the options provided. For example, the version of grep distributed by the GNU Software Foundation provides no fewer than 19 options. Some of the most common options for grep are identified in Table 10.1.

Table 10.1: Common options for the grep command.

-c count the number of lines matching the regular expression
-f file search for patterns that are specified in file
-i ignore differences of upper- and lower-case
-l just list the names of files containing a matching line
-n prefix each output line with its line number
-h suppress file-name prefixes (headers) in output when searching more than one file
-v display all lines not matching the regular expression
-L list names of files not containing the regular expression

Many of the predefined Humdrum representations make use of the “common system” for representing barlines. The following command counts the number of barlines in the file czech37.krn. Note that the caret anchor ^ is used to avoid inadvertent matches of the equals sign that might appear in Humdrum comments or interpretations.

grep -c ^= czech37.krn

Recall that the dollar sign $ can be used to anchor an expression to the end of the line. The following command determines whether numbered measure 9 is present in the file france12.krn; the dollar sign ensures that measure 9 is not mistaken for measure 90, 930, etc.

grep ^=9$ france12.krn

The asterisk means “zero or more” instances of the preceding expression. For example, the following regular expression will match any reference record or global comment in the file clara29:

grep '^!!!*' clara29

Suppose we want to list all of the global comments for all files in the current directory:

grep '^!!!*' *

Notice that the two asterisks serve different functions in the above command. The first asterisk means “zero or more instances” and is part of the regular expression passed to grep. The second asterisk means “all files in the current directory” and is expanded by the shell. The first asterisk is ‘protected’ from the shell by the single quotes. Otherwise, the first asterisk might be expanded by the shell to a list of all files in the current directory.

In regular expressions, the period character . matches any single character. For example, the expression A.B will match strings such as AXB and AAB etc. The following command identifies all eighth-notes containing at least one flat, and whose pitch lies within an octave of middle C.

grep 8.- *.krn

Frequently it is necessary to turn off the special meanings for metacharacters such as ^, $, and *. Recall that this can be done by inserting a backslash \ immediately prior to the metacharacter. In the kern representation the caret signifies an accent. In a monophonic input, we might count the number of notes that have a notated accent as follows:

grep -c '\^' danmark3.krn

In the following command we have used the backslash to escape the special meaning of the asterisk. The l option causes grep to output only the names of any files that contain a line matching the pattern. Hence, the following command identifies those files in the current directory that encode music in 9/8 meter:

grep -l '^\*M9/8' *

Recall that square brackets can be used to indicate character classes where any of the characters in the class can be used to match the expression. The following command identifies those files in the current directory that encode music in either 3/8 or 9/8 meter:

grep -l '\*M[39]/8' *

One of the most frequently used regular expressions consists of the period followed by the asterisk .*. Recall that this expression will match any string including the null string (i.e. nothing at all). This expression commonly appears between two other character strings. For example, we can identify all files in the current directory whose instrumentation includes a trumpet:

grep -l '!!!AIN.*tromp' *

The .* expression is needed since we don’t know what other instruments might be listed following AIN and before tromp. Instrumentation reference records require that instrument codes appear in alphabetical order. This makes it easier to conduct searches for combinations of instruments. For example, we can identify all scores in the current directory whose instrumentation includes both trumpet and cornet as follows:

grep -l '!!!AIN.*cornt.*tromp' *

There are many variants on the use of the .* expression. The following command identifies all files that contain a record having the word Drei followed by the word Koenige. (Notice the use of the i option in order to ignore the case of the letters.)

grep -li 'Drei.*Koenige' *

This command will match such strings as: Die Heiligen Drei Koenige, Drei Koenige, Dreikoenigslied, etc.

The `!!!AGN' reference record is used to encode genre-related keywords. The following command lists all files that are ballads.

grep -l '!!!AGN.*Ballad' *

List all files that have the word Amour in the title:

grep -li '!!!OLT.*Amour' *

List any works that bear a dedication:

grep -l '!!!ODE:' *

List those works that are in irregular meters:

grep -l '!!!AMT.*irregular' *

The L option for grep causes the output to contain a list of files not containing the regular expression. For example, we could identify those works that don’t bear any dedication:

grep -L '!!!ODE:' *

List those works not composed by Schumann:

grep -L '!!!COM: Schumann' *

Identify any works that don’t contain any double barlines:

grep -L '^==' *

How many works in the current directory are in simple-triple meter?

grep -c '!!!AMT.*simple.*triple' *

When searching for more complex patterns it may be necessary to use grep more than once. Consider, for example, the problem of identifying works whose titles contain both the words Liebe and Tod. The first of the following commands will identify only those titles that contain Liebe followed by Tod, whereas the second command will identify only those titles that contain Tod followed by Liebe:

grep '!!!OTL.*Liebe.*Tod' *
grep '!!!OTL.*Tod.*Liebe' *

A better solution is to pipe the output between two grep commands. Recall that the vertical bar | conveyes or “pipes” the output from one command to the input of a subsequent command. The following command passes all opus-title records OTL containing the word Liebe to a second grep, which passes only those records also containing the word Tod. Since both grep commands process the entire input line, it does not matter whether the word Tod precedes or follows the word Liebe:

grep '!!!OTL.*Liebe' * | grep 'Tod'

The v option for grep causes a “reverse” or “negative” output. Instead of outputting all records that match the specified regular expression, the v option causes only those records to be output that do not match the given regular expression. For example, the following command eliminates all comments from the file polska24.krn:

grep -v '^!' polska24.krn

Similarly, the following command eliminates all whole-note rests:

grep -v 1r *

The v option is especially convenient in pipelines. For example, the following command identifies all those files whose instrumentation includes a cornet but not a trumpet:

grep '!!!AIN.*cornt' * | grep -v 'tromp'

The following command identifies those works in compound meters that are not also quadruple meters:

grep '!!!AMT.*compound' * | grep -v 'quadruple'

Similarly, the following command identifies those notes that begin a phrase, but are not rests.

grep '^{' * | grep -v r

German, French, Italian, and Neapolitan Sixths

In conjunction with the solfa command, grep can be used to search for various types of special chords. Suppose, for example, that we wanted to identify occurrences of augmented sixth chords. An augmented sixth chord is characterized by an augmented sixth interval occurring between the lowered sixth scale-degree and the raised fourth scale-degree. In Chapter 4, we saw that the solfa command represents pitches with respect to an encoded tonic pitch. In the solfa representation, the lowered sixth and raised fourth degrees will be represented as 6- and 4+ respectively. First we translate the input to the solfa representation, and then we search for records matching the appropriate regular expression:

solfa input | grep '6-.*4+'

Notice that the expression `6-.*4+' presumes that the lowered sixth degree is lower in pitch than the raised fourth degree. For augmented sixth chords, this is a reasonable presumption. In the unlikely situation that the raised fourth degree is lower in pitch than the lowered sixth degree, we would need to also search for the expression `4+.*6-'. Alternatively, we could use two separate grep commands, eliminating the constraint of order:

solfa input | grep '6-' | grep '4+'

Augmented sixth chords can be further classified as either German, French, or Italian sixths. The German sixth contains the lowered mediant whereas the French sixth contains the supertonic pitch; the Italian sixth contains neither:

solfa input | grep '6-.*4+' | grep '3-'      # German sixth
solfa input | grep '6-.*4+' | grep '2'       # French sixth
solfa input | grep '6-.*4+' | grep -v '[23]' # Italian sixth

A similar approach can be used to identify Neapolitan sixth chords. These chords are based on the lowered supertonic with the third of the chord (unaltered subdominant) in the bass.

solfa input | grep '4[^-+].*2-' | grep '6-' # Neapolitan sixth

Depending on the key, Neapolitan chords are sometimes notated enharmonically as a raised tonic chord. Suppose we were looking for such enharmonically spelled Neapolitan chords:

solfa input | grep '3+.*1+' | grep '5+'

Occassionally, Neapolitan chords are missing the fifth of the chord (the lowered sixth degree of the scale). We might search for an example of such a chord:

solfa input | grep '2-' | grep '4' | grep -v '6-'

AND-Searches Using the xargs Command

In some cases, we want to identify those files that match two entirely different patterns (in different records). Recall that the l option causes grep to output the filename rather than the matching record. If we could pass along these file names to another grep command, we could search those same files for yet another pattern.

The UNIX xargs command provides a useful way of transferring the output from one command to be used as final arguments for a subsequent command. For example, the following command takes each file whose opus title contains the word Liebe and counts the number of phrases.

grep -l '!!!OTL:.*Liebe' * | xargs grep -c '^{'

In this case the grep -l command outputs a list of names of files containing the string Liebe in an OTL reference record. The xargs command causes these filenames to be appended to the end of the following grep command. The grep -c command will thus be applied only to those files already identified by the previous grep as containing Liebe in the title.

A set of such pipelines can be used to answer more sophisticated questions. For example, are drinking songs more apt to be in triple meter?

grep -l '!!!AMT.*triple'  *   | xargs grep -l '!!!AGN.*Trinklied'
grep -l '!!!AMT.*duple'   *   | xargs grep -l '!!!AGN.*Trinklied'
grep -l '!!!AMT.*quadruple' * | xargs grep -l '!!!AGN.*Trinklied'

Similarly, the following commands determine whether files whose titles contain the word death are more apt to be in minor keys:

grep -li '!!!OTL.*death' * | xargs grep -c '^\*[a-g][#-]*:'
grep -li '!!!OTL.*death' * | xargs grep -c '^\*[A-G][#-]*:'

Note that the xargs command can be used again and again to continue propagating file names as arguments to subsequent searches. For example, the following command outputs the key signatures for all works originating from Africa that are written in 3/4 meter:

grep -l '!!!ARE.*Africa' * | xargs grep -l '^\*M3/4' \
     | xargs grep '^\*k\['

Similarly, the following command outputs the names of all files in the current directory that encode 17th century organ works containing passages in 6/8 meter:

grep -l '!!!ODT.*16[0-9][0-9]/' | xargs grep -l \
     '!!!AIN.*organ' | xargs grep -l '\*M6/8'

Using the L option allows us to form even more complex criteria by excluding certain works. For example, the following variation of the above command outputs the names of all files in the current directory that encode 17th century organ works that do not contain passages in 6/8 meter:

grep -l '!!!ODT.*16[0-9][0-9]/' | xargs grep -l \
     '!!!AIN.*organ' | xargs grep -L '\*M6/8'

OR-Searches Using the grep -f Command

In effect, the above pipelines provide logical AND structures: e.g. identify works composed in the 17th century AND written for organ AND containing a passage in 6/8 meter. The f option for grep provides a way of creating logical OR searches. With the f option, we specify a file containing the patterns being sought. For example, we might create a file called criteria containing the following three regular expressions:

!!!ODT.*16[0-9][0-9]/
!!!AIN.*organ
\*M6/8

We would invoke grep as follows:

grep -l -f criteria *

The f option tells grep to fetch the file criteria and use the records in this file as regular expressions. A match is made if any of the regular expressions is found.

The output would consist of a list of all files in the current directory that encode works composed in the 17th century OR written for organ OR in 6/8 meter. The f option is more typically used to specify several variations of the same idea. For example, suppose we were searching for D major triads in pitch data. We could use a file containing the following regular expressions:

[Dd].*[Ff]#.*[Aa]
[Dd].*[Aa].* [Ff]#
[Ff]#.*[Aa].*[Dd]
[Ff]#.*[Dd].*[Aa]
[Aa].*[Dd].*[Ff]#
[Aa].*[Ff]#.*[Dd]

Depending on the application, it may be easier to construct such pattern files than to use a lengthy pipeline. That is:

grep -f Dmajor *

may be less cumbersome than:

grep [Dd] * | grep [Ff]# | grep [Aa]

The f option can be combined with L. For example, suppose we wanted to identify all works in the current directory that are not in the keys of C major, G major, B-flat major or D minor. Our regular expression file would contain the following regular expressions:

^\*[CGd]:
^\*B-:

The corresponding command would be:

grep -L -f criteria *

Another way of thinking of the f option is that it allows us to define equivalences. Consider, for example, the task of counting all of the notes in a kern melody that belong to a particular whole-tone pitch set. Let’s create two files, one called whole1 and the other called whole2. The file whole1 might contain the following regular expressions:

[Cc]([^-#Cc]|$)  [Dd]([^-#Dd]|$)  [Ee]([^-#Ee]|$)  [Ff]#([^#]|$)  [Gg]-([^-]|$)  [Gg]#([^#]|$)  [Aa]-([^-]|$)  [Aa]#([^#]|$)  [Bb]-([^-]|$)

Notice that the regular expressions have been carefully defined. The first regular expression defines a pattern consisting of either an upper- or lower-case letter C followed either by a character that is neither a sharp # nor a flat - nor another letter C, nor is followed by the end of the line $.

Recall that parenthesis grouping (…) is part of the extended regular expression syntax. Therefore, we should use the egrep rather than the grep command with the above expressions. We can count the number of notes in a monophonic kern input that belong to this whole-tone set:

egrep -c -f whole1 debussy

If the file whole2 contains regular expressions for the complementary pitch set, we could similarly count the number of pitches that belong to this alternative set:

egrep -c -f whole2 debussy

Reprise

The grep command is usually thought of as a way to find particular patterns in a file or input stream. However, the various options for grep (such as v, l, and L) allow grep to be used for other purposes. It can be used to isolate data, to count occurrences of patterns, to eliminate unwanted lines, to identify files for processing, and to avoid files that contain certain information.

We have seen how the xargs command can be used to carry out AND-searches where each work must conform to multiple criteria. We have also seen how the f option for grep can be used to permit OR-searches where a work needs to conform only to one of a set of possible criteria.

Although this chapter has focussed principally on the grep command, the ensuing chapters will show that regular expressions are used by a wide variety of commands. In Chapter 33, many more powerful examples will be discussed in conjunction with the find command.