METHOD OF IDENTIFYING DOCUMENTS WITH SIMILAR PROPERTIES UTILIZING PRINCIPAL COMPONENT ANALYSIS

US 20080281581A1
Filed: 05/07/2008
Published: 11/13/2008
Est. Priority Date: 05/07/2007
Status: Abandoned Application

First Claim

Patent Images

1. A method of characterizing a text, comprisingdetermining frequency distribution for a plurality of n-grams in at least a segment of a text,applying a principal component transformation to said frequency distribution to obtain a principal component vector in a principal component space corresponding to said text segment.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention generally provides methods and systems for characterizing texts, for example, for identifying textual documents by language, topic, author, or other attributes. In some embodiments, a method of the invention can include creating an n-gram frequency spectrum for a document under analysis, preferably selecting a subset of the n-gram frequency spectrum, transforming the n-gram frequency spectrum into principal component space, and identifying one or more attributes of the document according to its similarity to (or distinction from) reference documents in the principal component space.

Citations

30 Claims

1. A method of characterizing a text, comprisingdetermining frequency distribution for a plurality of n-grams in at least a segment of a text,applying a principal component transformation to said frequency distribution to obtain a principal component vector in a principal component space corresponding to said text segment.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The method of claim 1, further comprising comparing said principal component vector with one or more predefined decision rules to determine an attribute of said text segment.
  - 3. The method of claim 1, wherein said one or more decision rules are based on assigning different attributes to different regions in principal component space.
  - 4. The method of claim 2, wherein said attribute corresponds to an authorship of said text segment.
  - 5. The method of claim 2, wherein said attribute corresponds to language of said text segment.
  - 6. The method of claim 2, wherein said attribute corresponds to a topic of said text segment.
  - 7. The method of claim 2, wherein at least one of said decision rules is based on an angle between the principal component vector corresponding to said text segment and said reference principal component vector.
  - 8. The method of claim 7, wherein said reference principal component vector is associated with text authored by a known individual.
  - 9. The method of claim 8, further comprising identifying said individual as the author of the text segment if said angle is less than a predefined value.
  - 10. The method of claim 7, wherein said reference principal component vector is associated with text written in a given language.
  - 11. The method of claim 10, further comprising identifying said given language as the language of the text segment if said angle is less than a predefined value.
  - 12. The method of claim 1, wherein said n-grams comprise diagrams.
  - 13. The method of claim 1, wherein said n-grams comprise individual characters.
  - 14. The method of claim 2, further comprisingdetermining, for each of a plurality of n-gram groupings, frequency distribution for at least two reference texts, wherein one text exhibits an attribute of interest and the other lacks said attribute,performing a principal component transformation on each of the frequency distributions so as to generate a plurality of principal component vectors corresponding to said texts for each n-gram grouping,defining a metric based on said principal component transformation to rank order said n-gram groupings,rank ordering said n-gram groupings based on values of the metric corresponding thereto.
  - 15. The method of claim 14, further comprising selecting an n-gram grouping having the highest rank.
  - 16. The method of claim 15, further comprising utilizing said n-gram grouping to characterize the text.
  - 17. The method of claim 14, wherein said metric comprises a minimum angle between the principal component vectors corresponding to said two reference texts.
  - 18. The method of claim 17, further comprising assigning a higher rank to an n-gram grouping having a larger minimum angle.
  - 19. The method of claim 18, further comprising selecting one or more n-gram groupings having the highest ranks as said plurality of distinct n-grams for characterizing said text segment and utilizing at least one of the principal component vectors associated with one of said reference texts as said reference principal component vector.

20. A method of comparing two textual documents, comprisingfor each of at least two textual documents, determining frequency distribution for a plurality of n-grams in at least a segment of said document to generate a frequency histogram of said n-grams,for each document, applying a principal component transformation to said frequency histogram to obtain a principal component vector, andcomparing at least an attribute of said documents based on a comparison of said principal component vectors.
- View Dependent Claims (21, 22, 23, 24)
- - 21. The method of claim 20, further comprising determining an angle between said principal component vectors.
  - 22. The method of claim 21, further comprising comparing authorship of said documents based on said angle.
  - 23. The method of claim 22, further comprising the step of characterizing the documents as having the same author if said angle is less than a predefined value.
  - 24. The method of claim 21, further comprising comparing language of said documents based on said angle.

25. A method of selecting a plurality of n-grams for processing a text, comprisingdetermining, for each of a plurality of n-gram groupings, frequency distribution for at least two reference texts, wherein one text exhibits an attribute of interest and the other lacks said attribute,for each n-gram grouping, performing a principal component transformation on the frequency distributions of that grouping for said texts so as to generate a plurality of principal component vectors for said texts,for each n-gram grouping, determining value of a metric based on angles between the principal component vectors associated with one of said reference texts relative to the principal component vectors associated with the other text,rank ordering said n-gram groupings based on values of the metric corresponding thereto.
- View Dependent Claims (26, 27, 28)
- - 26. The method of claim 25, wherein said metric comprises a minimum angle between the principal component vectors of said two texts.
  - 27. The method of claim 25, further comprising assigning a higher rank to an n-gram grouping having a larger minimum angle.
  - 28. The method of claim 27, further comprising selecting one or more n-gram groupings having the highest ranks for processing the text.

29. A system for processing textual data, comprisinga module for determining for each of a plurality of n-gram groupings occurrence frequency distribution corresponding to n-gram members of said grouping for at least two reference texts, wherein one text exhibits an attribute of interest and the other lacks said attribute,an analysis module receiving said frequency distribution and applying a principal component transformation to said distribution so as to generate a plurality of principal component vectors corresponding to said reference texts for each n-gram grouping,said analysis module determining for each n-gram grouping a minimum angle between the principal component vectors of said texts corresponding to that grouping,wherein said analysis module rank orders said n-gram groupings based on the minimal angles corresponding thereto.
- View Dependent Claims (30)
- - 30. The system of claim 29, wherein said analysis module is configured to assign a for any two n-gram groupings a higher rank to the grouping having a greater minimum angle.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SPARTA, Inc. (New Port Group Holdings, Inc.)
Original Assignee
SPARTA, Inc. (New Port Group Holdings, Inc.)
Inventors
Henshaw, Philip D., Trepagnier, Pierre C.

Application Number

US12/116,735
Publication Number

US 20080281581A1
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 16/353 into predefined classes

METHOD OF IDENTIFYING DOCUMENTS WITH SIMILAR PROPERTIES UTILIZING PRINCIPAL COMPONENT ANALYSIS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD OF IDENTIFYING DOCUMENTS WITH SIMILAR PROPERTIES UTILIZING PRINCIPAL COMPONENT ANALYSIS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links