This algorithm attempts to classify strings by their editing distance. It first computes the case-insensitive editing distances between all input strings in O(n²ab), where a and b the maximum string lengths. Normalized editing distances are stored in a square matrix, M. Normalized, in this context, means that the editing distance is divided by the length of the longer of the two strings. The value Mi,j = d means that the normalized editing distance between elements i and j is equal to d. The diagonal of this matrix should consist of all zeros.
Next, the algorithm sorts the columns of this distance matrix by the simple invariant that the element immediately to the right of the diagonal is the least element on that row right of the diagonal. Symbolically, the invariant states that for matrix M of size m×n, Mi,i+1 ≤ Mi,j for all i+1 < j ≤ n. This "sorting" is basically a selection sort and occurs in Θ(n²).
Finally, the algorithm assigns clusters by the normalized editing
distance.
This is a simple comparison.
The threshold for differentiating clusters configurable by the user.
If Mi,i+1 > threshold
(more
different) then the elements i and i+1 are classified in
different clusters.
My goal with all of this is to automatically classify Syslog messages. This application looks promising! Try pasting the below test inputs into the text area above to see what interesting clusters you can come up with.