infot -b [-H] [-x regexp] [inputfile ...]
infot -n [-H] [-x regexp] [inputfile ...]
infot -p [-H] [-x regexp] [inputfile ...]
infot -s [-x regexp] [inputfile ...]
In conjunction with other Humdrum tools (notably the context and humsed commands), infot permits sophisticated information-theoretic analyses to be carried out, including calculations of information flow, short-term conditional probabilities, and longer-term m-dependency analyses. Alternatively, a simple set of summary statistics can be requested. In most cases, users will want to use infot to generate outputs that are suitable for further processing.
Input to infot is restricted to a single spine. However, the input data tokens may contain multiple-stops representing complex contextual information (such as produced by the context command).
For the entire input, infot tabulates the total number of occurrences of each unique data record (hereafter referred to as `states'). For the -n, -p and -b options, infot outputs a two-column list where the left column indentifies each unique state and the right column provides one of several corresponding calculated measures. With the -n option, this measure is merely an integer count of the number of occurrences of each corresponding state. With the -p option, this measure is a probability of occurrence for each state. With the -b option, this measure identifies the information content for the corresponding state in bits.
Information content (H) in bits is calculated according to the classic equation devised by Shannon and Weaver (see REFERENCES):
.EQ delim $$ .EN .EQ H~ =~ sum from i=1 to N ~p sub i ~log sub 2 ~1 over { p sub i } .EN
where $H$ is the average information (in bits), $N$ is the number of possible unique states in the repertoire, and $p sub i$ is the probability of occurrence of state $i$ from the repertoire. .EQ delim off .EN
Note that the outputs produced by infot do not conform to the Humdrum syntax.
Options are specified in the command line.
-b output information (in bits) for each unique data token -h displays a help screen summarizing the command syntax -H format output as humsed commands -n output frequency count for each unique data token -p output probability value for each unique data token -s output information-related summary statistics -x regexp exclude tokens matching regexp from calculations
With the -n option, infot outputs a two-column list where the left column indentifies each unique state present in the input, and the right column provides an integer count indicating the number of occurrences for the corresponding state.
With the -p option, infot outputs a two-column list where probabilities of occurrence are output in the right-hand column, rather than counts.
With the -b option, infot outputs the information (in bits) as calculated according to the Shannon-Weaver equation.
**foo
A
B2
C-c
A
B2
A
A
B2
C-c
A
A
X Y
*-
A simple command invocation would use the -n option to count
the number of occurrences of each unique data token (or state):
infot -n input
The corresponding output is:
The tallies indicate that state `A' occurs 6 times,
and that the least common state (`X Y') occurs just once.
If we had invoked the -p option, the counts would
be replaced by probabilities.
The command:
A 6 B2 3 C-c 2 X Y 1
infot -p input
produces the following output:
Alternatively, the -b option:
A 0.500 B2 0.250 C-c 0.167 X Y 0.083
infot -b input
would output information measures for each state, in bits:
In the case of the -s option, summary statistics would be output,
rather than a two-column list.
For the above input, the following summary statistics would be generated:
A 1.000 B2 2.000 C-c 2.585 X Y 3.585
The first line of output merely indicates the number of unique states
found in the input (in this case just 4).
The fifth output line indicates the average information conveyed
per state (in bits).
The fourth output line indicates the theoretical maximum average information
per state that could be communicated by a system having four states.
The third line indicates the maximum possible information that
could be communicated in a message of the same length as the input
-- given the theoretical maximum average information.
(Since there are 12 data records, this value is simply 12 x 2 bits,
or 24 bits.)
The second output line gives the actual total information for the
given input message.
(This is always less-than, or equal-to the maximum theoretical value.)
The final line indicates the amount of redundancy -- as a percentage.
That is, this value contrasts the actual information conveyed with
the theoretical maximum.
Total number of unique states in message: 4 Total information of message (in bits): 20.7549 Total possible information for message: 24 Info per state for equi-prob distrib: 2 Average information conveyed per state: 1.72957 Percent redundancy evident in message: 13.5213
In general, note that as the probabilities of the input states approach equivalence, the redundancy approaches zero and the average information content approaches the theoretical maximum.
Consider now an example where a large number of messages from a repertoire
(dubbed repertoire
) is passed to infot:
infot -b repertoire
Suppose that the following output is produced:
This result indicates that, although there might have been hundreds
of data tokens processed in the repertoire, only five different
unique states were present.
The greatest information content (lowest probability)
is associated with the state
ABC 3.124 BAC 1.306 C C D 1.950 X 5.075 XYZ 19.334 XYZ
(19.334 bits),
whereas the lowest information content (highest probability)
is associated with the state BAC
(1.306 bits).
Notice that the multiple-stop C C D
is treated as a single state.
Now imagine we had another message presumed to belong to the same
repertoire as our input.
We would like to trace how the information increases and decreases
over the course of this new `message'.
This goal involves a two-part operation.
First, we re-invoke
infot
adding the
-H
option, and redirect the output to a file replace
:
infot -bH repertoire > replace
This causes
infot
to produce as output a set of
humsed
commands.
Given the identical repertoire
input, the following
output is sent to the file replace
:
s/^ABC$/3.124/g; s/^ABC /3.124/g; s/ ABC$/3.124/g; s/ ABC /3.124/g
s/^BAC$/1.306/g; s/^BAC /1.306/g; s/ BAC$/1.306/g; s/ BAC /1.306/g
s/^C C D$/1.95/g; s/^C C D /1.95/g; s/ C C D$/1.95/g; s/ C C D /1.95/g
s/^X$/5.075/g; s/^X /5.075/g; s/ X$/5.075/g; s/ X /5.075/g
s/^XYZ$/19.334/g; s/^XYZ /19.334/g; s/ XYZ$/19.334/g; s/ XYZ /19.334/g
Although these commands may appear somewhat cryptic, they merely instruct the Humdrum stream editor (humsed) to replace all occurrences of the five data tokens (in any input file) by the corresponding numerical values -- in this case, values that represent the number of bits of information.
The following file (called input
) contains the
message of interest:
This file can be transformed so that the data tokens are replaced
by corresponding information values as determined from the
original repertoire.
This is done by invoking the
humsed
command, and providing it with the substitution commands held in the
file
**bar BAC BAC C C D . = * C C D XYZ X ABC BAC *- replace
:
humsed -f replace input > output
The resulting output file
would be as follows:
Notice that input data tokens which do
not appear in the probability list (such as the equals-signs)
remain unmodified.
**bar 1.306 1.306 1.950 . = * 1.950 19.334 5.075 3.124 1.306 *-
Several interpretations may be made about this message. For example, the above passage appears to show a pattern of initially low information that increases and then decreases toward the end of the passage. This suggests that the beginning and ending of this passage are more highly constrained or stereotypic than the middle part of the passage.
Summing together the individual information values for this passage, the total message conveys 35.35 bits. For five states, the maximum average information is 2.322 bits per state, and so the expected maximum for a message consisting of 8 items would be 8 x 2.322 or 18.58 bits. This suggests that this message is considerably less banal, (less redundant or more unique) than a typical message from the original repertoire. In particular, the occurrence of the state `XYZ' has a low probability of occurrence -- and is likely to be a distinctive feature of this passage.
In the above examples, only simple (zeroth-order) probabilities have been examined. Higher-order and m-dependency probabilities may be measured by reformulating the input using the context command.
Moles, A. Information Theory and Esthetic Perception, Urbana: University of Illinois Press, 1968.
Pinkerton, R.C. "Information theory and melody." Scientific American, Vol. 194 (1956) pp. 77-86.
Snyder, J.L. "Entropy as a measure of musical style: The influence of a priori assumptions." Music Theory Spectrum, Vol. 12, No. 1 (1990) pp. 121-160.
Wong, A. K. C., & Ghahraman, D. A statistical analysis of interdependence in character sequences. Information Sciences, Vol. 8 (1975) pp. 173-188.
Youngblood, J.E. "Style as information." Journal of Music Theory, Vol. 7 (1962) pp. 137-162.