Editing Distance Clusters

A greedy string classification algorithm

This algorithm attempts to classify strings by their editing distance. It first computes the case-insensitive editing distances between all input strings in O(n²ab), where a and b the maximum string lengths. Normalized editing distances are stored in a square matrix, M. Normalized, in this context, means that the editing distance is divided by the length of the longer of the two strings. The value Mi,j = d means that the normalized editing distance between elements i and j is equal to d. The diagonal of this matrix should consist of all zeros.

Next, the algorithm sorts the columns of this distance matrix by the simple invariant that the element immediately to the right of the diagonal is the least element on that row right of the diagonal. Symbolically, the invariant states that for matrix M of size m×n, Mi,i+1 ≤ Mi,j for all i+1 < jn. This "sorting" is basically a selection sort and occurs in Θ(n²).

Finally, the algorithm assigns clusters by the normalized editing distance. This is a simple comparison. The threshold for differentiating clusters configurable by the user. If Mi,i+1 > threshold (more different) then the elements i and i+1 are classified in different clusters.

My goal with all of this is to automatically classify Syslog messages. This application looks promising! Try pasting the below test inputs into the text area above to see what interesting clusters you can come up with.

Test inputs and recommended thresholds

"Nice" clusters (0.50)

Even nicer clusters (.40)

Low variation (.45)

High variation (.75)

Very high variation (.80)

Long input (notice Grover Cleveland) (.50)

Interface state change Syslog messages (.20)

Router adjacency change Syslog messages (.20)

Spanning tree state change Syslog messages (.20)