Peter Blair

FBL n-gram analyzer

The ComplaintGroup application is a simple parser of ARF compliant FBL complaints, which normalizes the email complaints and generates a 6-tuple n-gram version of the message. These n-grams are stored in a Redis database, keyed by the file in which they can be found. An inverse index also exists that allow you to find all messages containing a particular n-gram word.

Upon processing all messages into the Redis backend, a word comparison application is run, which pulls the set of all n-grams for a given filename, retrieves all files in which each n-gram exists, then executes a set intersection operation on both files, where the set is the n-grams contained within the file.

The cardinality of the resultant set is divided by the sum of the cardinality of the two compared sets, with the result divided by 2. This gives us a rough percentage of how similar are these two files

A Running Example

The application requires a list of ARF messages. This will be the set of complaint files:
\(C = \textit{The set of all complaints}\)

This set \(C\) is provided by a filename to the process.sh shell script. It will translate the filename string into a unique identity integer for subsequent reference. The filename will be normalized, with each regular word being tokenized and mapped into unique identity integers. The resulting string of normalized integers is saved to the normalized directory, under the filename of the file’s unique ID.

Once all files have been normalized, the process.sh script will process the normalized files, generating a 6 word n-gram, sliding down the entire normalized file window, one word at a time. Each “:” separated n-gram word is stored in a flat file, keyed by the name of the normalized file in which it is contained.

A subsequent script passes over this flat file, inserting each n-gram word into the redis database, referencing which normalized file in which it is contained, and referencing all normalized file IDs containing that particular n-gram word. This is recorded within a set as it allows us to use Redis’ set intersection functions.

Once all files have been normalized, and all normalized files tokenized into 6 word n-grams, and all n-grams recorded, we are able to evaluate the similarity of any two given files.

\(F = \textit{Set of all filenames}\)
\(F_g = \textit{Set of all filenames containing the n-gram g}\)
\(N = \textit{Set of all n-grams}\)
\(N_f = \textit{Set of all n-gram contained within the fileid f}\)

for normalized-file NF:
  for ngram in Nf:
    G = Set of all normalized files containing the n-gram
    for target-file in G:
      Nt = Set of all ngrams within target file
      # get the intersect between Nf and Nt (Nt = Set of ngrams within target file)
      c = | Nf intersect Nt |

      percent = c / | Nf | + | Nt | / 2

Or,
\(percent = \frac{\frac{\left|{N_f \cap N_t}\right|}{\left|{N_f}\right| + \left|{N_t}\right|}}{2}\)

Please, browse the code here, and all pull requests are welcome!

Caveats:

  • You need to have working ARF parsing libs installed for Perl. There’s a line in normalize_email.pl which needs to be updated at the top.
  • The commands take args on the command line, and not stdin, so use of “xargs” is required

Example:
To process a list of complaints:

pblair@slowpoke:~/complaintgroup/bin$ tail -10 ../var/outstanding.complaints | xargs ./convert_filename_to_index.pl
/data/staff/abuse/fbl/aol/2011-09-22-02/10 => 756
/data/staff/abuse/fbl/aol/2011-09-22-02/12 => 757
/data/staff/abuse/fbl/aol/2011-09-22-02/8 => 761
/data/staff/abuse/fbl/aol/2011-09-22-02/9 => 762
/data/staff/abuse/fbl/aol/2011-09-22-03/1 => 763
/data/staff/abuse/fbl/aol/2011-09-22-04/1 => 773
/data/staff/abuse/fbl/aol/2011-09-22-04/20 => 774
/data/staff/abuse/fbl/aol/2011-09-22-04/3 => 775
/data/staff/abuse/fbl/aol/2011-09-22-04/4 => 776
/data/staff/abuse/fbl/aol/2011-09-22-04/5 => 777
/data/staff/abuse/fbl/aol/2011-09-22-05/1 => 778
/data/staff/abuse/fbl/aol/2011-09-22-06/1 => 779
/data/staff/abuse/fbl/aol/2011-09-22-07/18 => 782

Great, now let’s process those integers:

pblair@slowpoke:~/complaintgroup/bin$ cat ../var/outstanding.complaints | tail -20 | xargs ./convert_filename_to_index.pl | awk '{print $NF}' | xargs ./compare_words.pl | awk '{if($NF > 60){print $0}}'
12804 / 12804 == 100.0000
12807 / 2445 == 60.9756
12807 / 2643 == 63.4146
12807 / 3096 == 81.0127
12807 / 2455 == 60.2410
12807 / 12557 == 64.1975
12807 / 12598 == 73.1707
12807 / 2309 == 74.0741
12807 / 2664 == 61.9048
12807 / 6706 == 71.6049
12807 / 5933 == 74.3590
12807 / 12801 == 67.5000
12807 / 12807 == 100.0000
12807 / 12087 == 65.0602
12807 / 3924 == 62.6506
12807 / 2946 == 70.0000
12807 / 3279 == 64.1026
12810 / 12810 == 100.0000
12813 / 12813 == 100.0000
12816 / 12816 == 100.0000



Get Adobe Flash playerPlugin by wpburn.com wordpress themes