System and method for comparative analysis of textual documents

US 8,868,405 B2
Filed: 01/27/2004
Issued: 10/21/2014
Est. Priority Date: 01/27/2004
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of comparing the semantic content of two or more documents, comprising:

accessing a plurality of documents;

performing a linguistic analysis on each document;

defining a semantic vector for each document based on the linguistic analysis, said semantic vector having multiple components, wherein each component of said semantic vector has at least;

a weighting factor relating to an importance, based on characteristics of the document, of said term; and

a frequency value relating to a number of occurrences of said term;

processing the semantic vector by a digital computer; and

comparing a semantic vector of an identified document to the semantic vector for each document in the plurality of documents to determine at least one document semantically similar to the identified document, and wherein the comparing of the semantic vectors includes using a defined metric, wherein said defined metric is related to;

Sqrt(f1²+f2²+f3²+f4²+ +f(N−

1)²fN²)*100n², wherein f is a difference in frequency of a common term between documents and n is the number of terms those documents have in common if the component has a weighting factor;

orSqrt(sum((w−

Delta)^2*w−

Avg))/(Log(n)^3*1000), wherein w−

Delta is the difference in weight between two common terms, w−

Avg is the average weight between two common terms, and n is the number of common terms if the component has a frequency value.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are presented for the comparative analysis of textual documents. In an exemplary embodiment of the present invention the method includes accessing two or more documents, performing a linguistic analysis on each document, outputting a quantified representation of a semantic content of each document, and comparing the quantified representations using a defined metric. In exemplary embodiments of the present invention such a metric can measure relative semantic closeness or distance of two documents. In exemplary embodiments of the present invention the semantic content of a document can be expressed as a semantic vector. The format of a semantic vector is flexible, and in exemplary embodiments of the present invention it and any metric used to operate on it can be adapted and optimized to the type and/or domain of documents being analyzed and the goals of the comparison.

Citations

32 Claims

1. A computer-implemented method of comparing the semantic content of two or more documents, comprising:
- accessing a plurality of documents;
  
  performing a linguistic analysis on each document;
  
  defining a semantic vector for each document based on the linguistic analysis, said semantic vector having multiple components, wherein each component of said semantic vector has at least;
  
  a weighting factor relating to an importance, based on characteristics of the document, of said term; and
  
  a frequency value relating to a number of occurrences of said term;
  
  processing the semantic vector by a digital computer; and
  
  comparing a semantic vector of an identified document to the semantic vector for each document in the plurality of documents to determine at least one document semantically similar to the identified document, and wherein the comparing of the semantic vectors includes using a defined metric, wherein said defined metric is related to;
  
  Sqrt(f1²+f2²+f3²+f4²+ +f(N−
  
  1)²fN²)*100n², wherein f is a difference in frequency of a common term between documents and n is the number of terms those documents have in common if the component has a weighting factor;
  
  orSqrt(sum((w−
  
  Delta)^2*w−
  
  Avg))/(Log(n)^3*1000), wherein w−
  
  Delta is the difference in weight between two common terms, w−
  
  Avg is the average weight between two common terms, and n is the number of common terms if the component has a frequency value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the linguistic analysis comprises sentence analysis.
  - 3. The method of claim 2, wherein the sentence analysis comprises a syntactic analysis and a semantic analysis.
  - 4. The method of claim 1, wherein each component of the semantic vector for at least one of the documents comprises multiple dimensions.
  - 5. The method of claim 1, wherein each component of the semantic vector for at least one of the documents further comprises a subordinate concept value.
  - 6. The method of claim 1, wherein some of the components of the semantic vector for at least one of the documents have main term-subordinate term pairs as their first value.
  - 7. The method of claim 1, wherein the semantic vector comprises a multi-dimensional vector defined by the content of a semantic net.
  - 8. The method of claim 7, wherein the content of the semantic net is augmented by relative weights, strengths, or frequencies of occurrence of the features within the semantic net.
  - 9. The method of claim 1, wherein said term comprises at least one of a word or a phrase.
  - 10. The method of claim 1, further comprising comparing the semantic vectors based on a defined algorithm.
  - 11. The method of claim 10, wherein an output of said defined algorithm is a measure of at least one of semantic distance, semantic similarity, semantic dissimilarity, degree of patentable novelty and degree of anticipation.

12. A computer-implemented method of comparing two or more documents, comprising:
- linguistically analyzing a plurality of documents to identify at least one term group in each document, each term group comprising a main term and at least one subordinate term semantically related to the main term;
  
  generating a semantic vector associated with each document, the semantic vector comprising a plurality of components, each component including;
  
  a term group as a scalar in the document;
  
  a frequency value relating to a number of occurrences of the term group; and
  
  processing the semantic vector by a digital computer; and
  
  comparing a semantic vector of an identified document to the semantic vector for each document in the plurality of documents to determine at least one document semantically similar to the identified document using a defined metric, wherein said metric measures the semantic distance between documents as a function of at least the frequency values included in the semantic vectors for the documents, and wherein said metric is related to;
  
  Sqrt(f1²+f2²+f3²+f4²+ +f(N−
  
  1)²fN²)*100n²wherein f is a difference in frequency of a common term between the plurality of documents and n is the number of terms those documents have in common.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 13. The method of claim 12, wherein the main term includes synonyms of the main term.
  - 14. The method of claim 12, wherein one or more of said two or more documents are located using an autonomous software or '"'"'bot program.
  - 15. The method of claim 14, wherein the '"'"'bot program automatically analyzes each document in a defined domain or network by executing a series of rules and assigning an overall score to the document.
  - 16. The method of claim 15, wherein all documents with a score above a defined threshold are linguistically analyzed.
  - 17. The method of claim 12, wherein the semantic vector is a quantification of the semantic content of each document.
  - 18. The method of claim 12, wherein each component has multiple dimensions.
  - 19. The method of claim 12, wherein the at least one subordinate term includes synonyms of one of the subordinate terms.
  - 20. The method of claim 12, wherein one or more of the at least one subordinate term or the main term comprises a phrase.
  - 21. The method of claim 12, wherein the weighting factor comprises a plurality of different weighting factors and each of the different weighting factors relates to the importance of the main term or a subordinate term in the term group.

22. A system for comparing two or more documents, comprising:
- a document inputter, arranged to access a plurality of documents;
  
  a semantic analyzer, arranged to perform a linguistic analysis on each document to identify at least one term group in the document, each term group comprising a main term and at least one subordinate term semantically related to the main term;
  
  a semantic quantifier, arranged to output a quantified representation of a semantic content of each document, the quantified representation based at least in part on;
  
  a term group as a scalar in the document; and
  
  a weighting factor relating to an importance, based on characteristics of the document, of at least part of the term group; and
  
  a comparator, arranged to compare the quantified representations using a defined metric, wherein said defined metric measures the semantic distance between documents as a function of at least the weighting factors associated with the quantified representations for the documents to determine at least one document in the plurality of documents semantically similar to an identified document, and wherein said defined metric is related to;
  
  Sqrt(sum((w−
  
  Delta)^2*w−
  
  Avg)/Log(n)^3*1000), wherein w−
  
  Delta is the difference in weight between two common terms, w−
  
  Avg is the average weight between two common terms, and n is the number of common terms, between two documents.

23. A system for comparing two or more documents, comprising:
- a document inputter, arranged to access a plurality of documents;
  
  a semantic analyzer, arranged to perform a linguistic analysis on each document to identify at least one term group in the document, each term group comprising a main term and at least one subordinate term semantically related to the main term;
  
  a semantic vector generator, arranged to output a semantic vector associated with each document, each semantic vector comprising a plurality of components, each component including;
  
  a term group as a scalar in the document;
  
  a frequency value relating to a number of occurrences of the term group; and
  
  a comparator, arranged to compare the semantic vectors using a defined metric, wherein said metric measures the semantic distance between documents as a function of at least the frequency values included in the semantic vectors for the documents to determine at least one document in the plurality of documents semantically similar to an identified document, and wherein said metric is related to;
  
  Sqrt(f1²+f2²+f3²+f4²+ +f(N−
  
  1)²fN²)*100n²wherein f is a difference in frequency of a common term between the plurality of documents and n is the number of terms those documents have in common.

24. A computer program product comprising a computer usable medium device having computer readable program code means embodied therein, the computer readable program code means in said computer program product comprising means for causing a computer to:
- access a plurality of documents;
  
  perform a linguistic analysis on each document to identify at least one term group in the document, each term group comprising a main term and at least one subordinate term semantically related to the main term;
  
  output a quantified representation of a semantic content of each document, the quantified representation based at least in part on;
  
  a term group as a scalar in the document;
  
  a frequency value relating to a number of occurrences of the term group; and
  
  compare the quantified representations using a defined algorithm, wherein said defined metric measures the semantic distance between documents as a function of at least the frequency values associated with the quantified representations for the documents to determine at least one document in the plurality of documents semantically similar to an identified document, and wherein said metric is related to;
  
  Sqrt(f1²+f2²+f3²+f4²+ +f(N−
  
  1)²fN²)*100n²wherein f is a difference in frequency of a common term between the plurality of documents and n is the number of terms those documents have in common.

25. A computer program product comprising a computer usable medium device having computer readable program code means embodied therein, the computer readable program code means in said computer program product comprising means for causing a computer to:
- linguistically analyze a plurality of documents to identify at least one term group in the document, each term group comprising a main term and at least one subordinate term semantically related to the main term;
  
  generate a semantic vector associated with each document, each semantic vector comprising a plurality of components, each component including;
  
  a term group as a scalar in the document; and
  
  a weighting factor relating to an importance, based on characteristics of the document, of at least part of the term group; and
  
  compare the semantic vectors using a defined metric, wherein said metric measures the semantic distance between semantic vectors as a function of at least the weighting factors included in the semantic vectors to determine at least one document in the plurality of documents semantically similar to an identified document, and wherein said defined metric is related to;
  
  Sqrt(sum((w−
  
  Delta)^2*w−
  
  Avg))/(Log(n)^3*1000), wherein w−
  
  Delta is the difference in weight between two common terms, w−
  
  Avg is the average weight between two common terms, and n is the number of common terms, between two documents.
- View Dependent Claims (26, 27, 28, 29)
- - 26. The computer program product of claim 25, wherein the computer readable program code means in said computer program product further comprises means for causing a computer to:
    - identify one or more of said two or more documents using an autonomous software or '"'"'bot program.
  - 27. The computer program product of claim 26, wherein said '"'"'bot program automatically analyzes each document in a defined domain or network by executing a series of rules and assigning an overall score to the document.
  - 28. The computer program product of claim 25, wherein the semantic vector is a quantification of the semantic content of each document.
  - 29. The computer program product of claim 25, wherein an output of said defined metric is a measure of at least one of semantic distance, semantic similarity, semantic dissimilarity, degree of patentable novelty and degree of anticipation.

30. A system for comparing two or more documents, comprising:
- a document inputter, arranged to access two or more documents;
  
  a semantic analyzer, arranged to perform a linguistic analysis on each document;
  
  a semantic vector generator, arranged to output a semantic vector associated with each document; and
  
  a comparator, arranged to compare the semantic vectors using a defined metric, wherein said defined metric is one of;
  
  [Sqrt(f1²+f2²+f3²+f4²+ +f(N−
  
  1)²fN²/n]*100, wherein f is a difference in frequency of a common term between two documents and n is the number of terms those documents have in common;
  
  orSqrt(sum((w−
  
  Delta)^2*w−
  
  Avg))/(Log(n)^3*1000), wherein w−
  
  Delta is the difference in weight between two common terms, w−
  
  Avg is the average weight between two common terms, and n is the number of common terms, between two documents.

31. A computer-implemented method of comparing two or more documents, comprising:
- linguistically analyzing a plurality of documents;
  
  generating a semantic vector associated with each document;
  
  processing the semantic vector by a digital computer; and
  
  comparing the semantic vectors using a defined metric, wherein said defined metric is one of;
  
  [Sqrt(f1²+f2²+f3²+f4²+ +f(N−
  
  1)²fN²/n]*100, wherein f is a difference in frequency of a common term between documents and n is the number of terms those documents have in common;
  
  orSqrt(sum((w−
  
  Delta)^2*w−
  
  Avg))/(Log(n)^3*1000), wherein W−
  
  Delta is the difference in weight between two common terms, w−
  
  Avg is the average weight between two common terms, and n is the number of common terms, between documents to determine at least one document in the plurality of documents semantically similar to an identified document.

32. A computer program product comprising a computer usable medium device having computer readable program code means embodied therein, the computer readable program code means in said computer program product comprising means for causing a computer to access two or more documents;
- perform a linguistic analysis on each document;
  
  output a quantified representation of a semantic content of each document; and
  
  compare the quantified representations using a defined algorithm, wherein said defined algorithm is one of;
  
  [Sqrt(f1²+f2²+f3²+f4²+ +f(N−
  
  1)²fN²/n]*100, wherein f is a difference in frequency of a common term between two documents and n is the number of terms those documents have in common;
  
  orSqrt(sum((w−
  
  Delta)^2*w−
  
  Avg))/(Log(n)^3*1000), wherein w−
  
  Delta is the difference in weight between two common terms, w−
  
  Avg is the average weight between two common terms, and n is the number of common terms, between two documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Ent Services Development Corporation LP (DXC Technology Company)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Kasravi, Kas, Novinger, Walter B.
Primary Examiner(s)
SAINT CYR, LEONARD

Application Number

US10/766,308
Publication Number

US 20050165600A1
Time in Patent Office

3,920 Days
Field of Search

None
US Class Current

704/9
CPC Class Codes

G06F 16/3344   using natural language anal...

G06F 40/194   Calculation of difference b...

G06F 40/30   Semantic analysis

System and method for comparative analysis of textual documents

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

32 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for comparative analysis of textual documents

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

32 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links