METHOD OF IDENTIFYING DOCUMENTS WITH SIMILAR PROPERTIES UTILIZING PRINCIPAL COMPONENT ANALYSIS
First Claim
1. A method of characterizing a text, comprisingdetermining frequency distribution for a plurality of n-grams in at least a segment of a text,applying a principal component transformation to said frequency distribution to obtain a principal component vector in a principal component space corresponding to said text segment.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention generally provides methods and systems for characterizing texts, for example, for identifying textual documents by language, topic, author, or other attributes. In some embodiments, a method of the invention can include creating an n-gram frequency spectrum for a document under analysis, preferably selecting a subset of the n-gram frequency spectrum, transforming the n-gram frequency spectrum into principal component space, and identifying one or more attributes of the document according to its similarity to (or distinction from) reference documents in the principal component space.
-
Citations
30 Claims
-
1. A method of characterizing a text, comprising
determining frequency distribution for a plurality of n-grams in at least a segment of a text, applying a principal component transformation to said frequency distribution to obtain a principal component vector in a principal component space corresponding to said text segment.
-
20. A method of comparing two textual documents, comprising
for each of at least two textual documents, determining frequency distribution for a plurality of n-grams in at least a segment of said document to generate a frequency histogram of said n-grams, for each document, applying a principal component transformation to said frequency histogram to obtain a principal component vector, and comparing at least an attribute of said documents based on a comparison of said principal component vectors.
-
25. A method of selecting a plurality of n-grams for processing a text, comprising
determining, for each of a plurality of n-gram groupings, frequency distribution for at least two reference texts, wherein one text exhibits an attribute of interest and the other lacks said attribute, for each n-gram grouping, performing a principal component transformation on the frequency distributions of that grouping for said texts so as to generate a plurality of principal component vectors for said texts, for each n-gram grouping, determining value of a metric based on angles between the principal component vectors associated with one of said reference texts relative to the principal component vectors associated with the other text, rank ordering said n-gram groupings based on values of the metric corresponding thereto.
-
29. A system for processing textual data, comprising
a module for determining for each of a plurality of n-gram groupings occurrence frequency distribution corresponding to n-gram members of said grouping for at least two reference texts, wherein one text exhibits an attribute of interest and the other lacks said attribute, an analysis module receiving said frequency distribution and applying a principal component transformation to said distribution so as to generate a plurality of principal component vectors corresponding to said reference texts for each n-gram grouping, said analysis module determining for each n-gram grouping a minimum angle between the principal component vectors of said texts corresponding to that grouping, wherein said analysis module rank orders said n-gram groupings based on the minimal angles corresponding thereto.
Specification