Chapter 14

Stream Editing


Most computer users are familiar with editing an electronic document using an interactive word-processor or text editor. Stream editors are non-interactive editors that automatically process a given input according to a user-specified set of editing instructions. A stream editor can be used, for example, to automatically transform a document from British spelling to American spelling. Stream editors are especially useful when processing large numbers of documents — such as a series of files encoding some musical repertory. In this chapter we will introduce two stream editors: sed and humsed.

The sed and humsed Commands

The humsed command is simply a Humdrum version of the UNIX sed stream editor. The syntax and operation of sed and humsed are virtually identical. However, humsed will modify only Humdrum data records, whereas sed will modify any type of record, including Humdrum comments and interpretations. Both stream editors provide operations for substitution, insertion, deletion, transliteration, file-read and file-write. When used in combination, these operations can completely transform an input stream or document.

Simple Substitutions

The most frequently used stream-editing operation is substitution. Both humsed and sed designate substitutions by the lower-case letter s. Substitutions require two strings: the target string to be replaced, and the replacement string to be introduced. The syntax for substitutions is as follows:

s<delimiter><target string><delimiter><replacement string><delimiter><options>

No spaces are permitted between these elements. The delimiter can be any character; however, the same delimiter character must be used throughout the operation. The following substitution command causes occurrences of the letter A to be replaced by the letter B:

s/A/B/

Since the slash character / appears immediately following the s, it becomes the delimiter for the rest of the operation. In this case no option has been given at the end of the substitution. Since the delimiter can be any character, the above command is functionally identical to the following:

sxAxBx

If it is necessary to use the delimiter character (as a literal) within either the target string or the replacement string it can be escaped using the backslash character.

There are two ways to execute a substition operation such as given above. One way is to give the substitution as a command-line argument to sed or humsed:

humsed s%A%B% filename

Alternatively, the operation can be placed in a file (for example, named revise.txt):

s%A%B%

Then the stream editor can be invoked to execute the operations contained in this file using the f option:

humsed -f revise.txt inputfile

By default the output will be displayed on the screen. Using file-redirection > the output can be placed in some other file. Note that you should never redirect the output to the same file as the input — this will destroy the original input file. If necessary, send the output to a temporary file, and then use the UNIX mv command to rename the output.

Suppose that you had encoded a musical work in the kern representation. Having finished the encoding, you realize that what you thought were pizzicato marks are really spiccato marks. In the kern representation, pizzicatos are indicated by the double quote " whereas spiccatos are represented by the lower-case letter s. We can change all pizzicato marks to spiccato marks using the following command:

humsed 's/"/s/g' inputfile

Since the double quote is interpreted as a special character by the UNIX shell, we have escaped the entire substitution operation by placing it in single quotes. (Alternatively, we could place a backslash immediately before the double-quote character.) Note also the presence of the g option at the end of the string. Permissible options include any positive integer or the letter g. Without any option, the sed and humsed substitute (s) operation will replace only the first occurrence of the string in each data record. The g option specifies a “global” substitution, in that all occurrences on a given data record are replaced. If the option consisted of the number `3', then only the third instance of the target string would be replaced on each line.

Selective Elimination of Data

The target string in substitution operations is actually a regular expression. This means that we can specify patterns using the full power of regular expression syntax. Frequently, it is useful to eliminate certain kinds of information from a file. For example, we can eliminate all sharps and flats from a kern-format file as follows:

humsed s/[#-]//g inputfile

Suppose we wanted to eliminate all beaming information in a score. In the kern representation, open and closed beams are represented by L and J respectively; partial beams are represented by K and k.

humsed s/[JLkK]//g inputfile

Alternatively, we might want to eliminate all data except for the beaming information:

humsed s/[^JLkK]//g inputfile

Sometimes we need to restrict the circumstances where the data are eliminated. For example, we might want to eliminate all measure numbers. Eliminating all numbers from a kern file will have the undesirable consequence of eliminating all note durations as well. Most humsed operations can be preceded by a regular expression delineated by slashes. This tells humsed to execute this substitution only if the data record matches the leading regular expression. For example, the following command eliminates measure numbers but not note durations:

humsed /^=/sX[0-9]*XXg inputfile

The operation may be interpreted as follows: look for lines that match a pattern where the first character in the line is an equals sign; if you find this pattern look for zero or more instances of any number between zero and nine, and replace that by an empty string; do this substitution for all numbers on the current data record.

Incidentally, Humdrum provides a num command that can be used to insert numbers in data records. The num command supports an elaborate set of options, but is not used often, so we won’t describe it here. The following command renumbers all of the barlines in an input so that the first measure begins with the number 72. (Refer to the Humdrum Reference Manual for details regarding num.)

humsed /^=/sX=[0-9]*X=Xg inputfile | num -n ^= -x == -p = -o 72

Suppose we wanted to eliminate all octave numbers from a pitch representation. In this case we want to delete all numbers except when they occur in conjunction with a barline. Our substitution should occur only when the current record does not match a leading equals sign:

humsed /^[^=]/s%[0-9]%%g inputfile

Suppose we wanted to determine which of two MIDI performances exhibits more dynamic range — that is, which performance has a greater variability in key-down velocities. Recall from Chapter 7 that MIDI data tokens consist of three elements separated by slashes /. The third element is the key velocity. First, we want to eliminate key-up data tokens. These tokens can be distinguished by the minus sign associated with the second data element. An appropriate substutition is:

s%[0-9][0-9]*/-[0-9][0-9]*/[0-9]* *%%g

(That is, replace by nothing any data that matches the following: a numerical digit followed by zero or more digits, followed by a slash, followed by a minus sign, followed by a digit, followed by zero or more digits, followed by a slash, followed by zero or more digits, followed by zero or more spaces.)

Having isolated only the key-down data tokens, we now need to eliminate everything but the third data element, the MIDI key-down velocities:

s%[0-9][0-9]*/[0-9][0-9]*/%%g

The stats Command

We can determine the range or variance of these velocity values by piping the output to the stats command. The stats command calculates basic statistical information for any input consisting of a column of numbers. A sample output from stats might appear as follows:

n:	124
total:	5700
mean:	45.9677
min:	9
max:	102
S.D.:	232.37

The value n indicates the total number of numerical values found in the input; the total specifies the sum of these numbers; the mean identifies the average; the min and max report the minimum and maximum values encountered, and the S.D. represents the standard deviation. The standard deviation provides a useful way of characterizing which performance has greater variability in key-down velocities.

Assuming that the above two stream-editing substitutions are kept in a file called revise we can compare the dynamic range for the two performances as follows:

extract -i '**MIDI' perform1 | grep -v ^= | humsed -r revise \
    | rid -GLId | stats
extract -i '**MIDI' perform2 | grep -v ^= | humsed -r revise \
   | rid -GLId | stats

The extract command has been added to ensure that we only process MIDI data; the grep command ensures that possible barlines are eliminated, and the rid command eliminates comments and interpretations prior to passing the data to the stats command.

Eliminate Everything But …

A common use for humsed is to eliminate signifiers that are not of interest. Stream editors like sed and humsed can be used to dramatically simplify a representation.

Did Monteverdi use equivalent numbers of sharps and flats? Or did he favor one accidental over the other? A simple way to determine this is to throw away everything but the sharps and flats. We can generate an inventory of just sharps and flats:

humsed 's/[^#-]//g' montev* | rid -GLId | sort | uniq -c

In some tasks, we might wish to transform a kern-format file so that only pitch-related information is preserved:

humsed 's/[^a-gA-G#-]//g' inputfile

In extreme cases, we may wish to eliminate all Humdrum data from an input. The following command replaces all data tokens by null tokens:

humsed 's/[^     ][^     ]*/./g' inputfile

(That is, globally substitute all instances of the string not-a-tab followed by zero or more instances of not-a-tab characters, by a single period character.) This sort of command can be useful in generating a file that maintains the structure but not the content of some document. Incidentally, neither the sed nor the humsed commands support extended regular expressions, so we are not able to use the + metacharacter in the above substitution.

Deleting Data Records

Sometimes it is useful to delete entire data records rather than simply eliminating certain kinds of information. The d operation causes lines to be deleted. Normally, it is preceded by a regular expression that identifies which records should be eliminated. Deleting barlines can be done using the following command:

humsed /^=/d inputfile

Note that this is functionally equivalent to:

grep -v ^= inputfile

In the general case, humsed /…/d is preferable to <span class=”unix>grep -v</span>. Remember that humsed only manipulates Humdrum data records; it never touches comments or interpretations. The grep command has no such restriction. Consider, for example, the following command to eliminate grace notes (acciaccaturas) from a kern-format file.

humsed '/q/d' inputfile

By contrast, the command:

`grep -v q` inputfile

would also eliminate any comments or interpretation records containing the letter q.

Suppose that we wanted to know whether a melody still evokes a certain key perception even if we eliminate all the tonic pitches. First we translate the representation to scale degree and assemble this file with the original kern representation for the melody.

deg inputfile > temp
assemble inputfile temp | humsed '/1$/d' | midi | perform

Of course deleting all of the tonic notes will disrupt the original rhythm. An alternative is to replace all tonic pitches by rests:

deg inputfile > temp
assemble inputfile temp | humsed '/1$/s%[A-Ga-g#-]*%r%' | midi \
   | perform

Perhaps we might want to eliminate all the pitch information, and simply listen to the rhythmic structure of a work. That is, we might change all of the pitches in a work to a single pitch — in the following case, middle C:

humsed 's/[A-Ga-g#-]*/c/' | midi | perform

Adding Information

The substitute command can also be used to add information to points in a Humdrum input. For example, we might wish to add an explicit breath-mark (,) to the end of each phrase in a kern-format input:

humsed s/}/},/g inputfile

Any occurrence of the ampersand (&) in the replacement string of a substitution is a standard stream-editing convention which means “the matched string.” Suppose we want to add a tenuto mark to every quarter-note in a work. The following substitution seeks the number 4 followed by any character that is not a digit or period. This pattern is replaced by itself & followed by a tilde ~, the kern signifier for a tenuto mark:

humsed s/4[^0-9.]/&~/g inputfile

Multiple Substitutions

Some tasks may require more than one substitution command. Multiple operations can be invoked by separating each operation by a semicolon. In the following example, we change all kern quarter-notes to eighth-note durations:

humsed 's/4[A-Ga-g]/8&/g; s/84/8/g' inputfile

The first substitution finds strings that match the number 4 followed by an upper- or lower-case letter from A to G. The matched string is then output preceded by the number 8. This operation will change all quarter notes and rests to eighty-fourth durations. The ensuing substitution operation changes 84 to 8 and so completes the transformation.

Switching Signifiers

In some situations, we will want to switch two or more signifiers — make all A’s B’s and all B’s A’s. These sorts of tasks require three substitutions and involve creating a unique temporary string. For example, the following command changes all kern up-bows to down-bows and vice versa.

humsed 's/u/ABC/g; s/v/u/g; s/ABC/v/g' inputfile

The first substitution changes down-bows u to the unique temporary string ABC. (In the kern representation ABC is an illegal pitch representation, so it is bound to be a unique character string.) The second substitution changes up-bows v to down-bows. The third substitution changes occurrences of the temporary string ABC to up-bows.

Executing from a File

When several instructions are involved in stream editing, it can be inconvenient to type multiple operations on the command line. It is easier to place the editing instructions in a file, and use the f option (with either sed or humsed) to execute from the file. Consider, for example, the task of rhythmic diminution, where the durations of notes are halved. We might create a file called diminute containing the following operations:

s/[0-9][0-9]\*/&XXX/g
s/64XXX/128/g
s/32XXX/64/g
s/16XXX/32/g
s/8XXX/16/g
s/4XXX/8/g
s/2XXX/4/g
s/1XXX/2/g
s/0XXX/1/g

Each substitution command is applied (in order) to every line or data record in the file. The first substitution adds the unique string XXX to every number. The ensuing substitutions transform these numbers to appropriate diminution values. We can execute these commands as follows:

humsed -f diminute inputfile

Writing to a File

A useful feature of humsed is the “write” or w operation. This operation causes a line to be written to the end of a specified file. Suppose, for example, we wanted to collect all seventh chords into a separate file called sevenths. With a harm-format input, the appropriate command would be:

humsed '/7/w sevenths' inputfile.hrm

Each line containing the number 7 wll be written to a file named sevenths.

Similarly, we could copy all sonorities containing pauses to the file pauses.

humsed '/;/w pauses' inputfile

Of course there are other ways of achieving the same goal:

yank -m ';' 0 inputfile > pauses

Or even:

grep ';' inputfile | grep -v '^[!*]' > pauses

In some cases, a stream editor can be used to eliminate or modify data that will confound subsequent processing. For example, suppose we wanted to count the number of phrases that begin on the subdominant and the number of phrases that end on the subdominant. The deg command will allow us to identify subdominant pitches (via the number `4'). Since we would like to maintain the phrase indicators, we will avoid the x option for deg. However, the x option will pass all of the non-pitch related signifiers, including the duration data which encodes numbers. Hence, we will not be able to distinguish the subdominant (4) pitch from a kern quarter-note (4). The problem is resolved by first eliminating all of the duration information (numbers) from the original input:

humsed 's/[0-9.]//g' input.krn | deg | egrep -c '({.*4)|4.*{)'
humsed 's/[0-9.]//g' input.krn | deg | egrep -c '(}.*4)|4.*})'

In texts for vocal works, identify the number of notes per syllable.

extract -i '**kern' inputfile | humsed 's/X//g' > tune
extract -i '**silbe' inputfile | humsed 's/[a-zA-Z]*/X/' > lyrics
assemble tune lyrics | cleave -i '<span class="tool">kern</span>,silbe' -o '**new' \
   > combined
context -b X -o '[r=]' combined | rid -GLId | awk '{print NF}'

Identify the number of notes per word rather than per syllable.

extract -i '**kern' inputfile > tune
extract -i '**silbe' inputfile | humsed 's/^[^-].*[^-]$/BEGIN_END/; s/-.*[^-]$/END/; s/^[^-].*-/BEGIN/' > lyrics
assemble tune lyrics | cleave -i '**kern,**silbe' -o '**new' \
     > combined
context -b BEGIN -e END -o '[r=]' combined | rid -GLId \
     | awk '{print NF}'

Reading a File as Input

Another useful feature is the humsed “read” or r operation. Whenever a leading regular expression is matched, a file is read in at that point. Suppose, for example, that we want to annotate a file with Humdrum comments identifying the presence of cadential 6-4 chords. First, we might create a file — comment.6-4 — containing the following Humdrum comment:

!! A likely cadential 6-4 progression.

We can use the Humdrum pattern command (to be described in Chapter 21), as follows:

File template:

  .*
Ic
^\.  *
=    *
V[^I]

Command:

pattern -f template inputfile > output
humsed 'cadential-64/r comment.6-4' output > commented.output

Reprise

The sed and humsed commands provide stream editors that can automatically edit a data stream. We’ve seen that multiple operations can be carried out, either from the command line or from a file containing editing instructions. It should be noted that the sed and humsed commands provide many more editing facilities than those discussed in this chapter. Some 25 operations are provided by sed and humsed. For example, segments of text can be stored in various buffers, the contents of these buffers modified, and the results placed anywhere in the output text. Markers can be set at particular points and conditional branch statements executed. Stream-editing scripts have been written to execute programs of considerable complexity. However, for most tasks, the simple substitute (s) and delete (d) operations are the most useful. For further information about stream editing using sed, refer to the book on sed and awk written by Dale Dougherty (listed in the bibliography).