System and method for comparative analysis of textual documents
First Claim
1. A computer-implemented method of comparing the semantic content of two or more documents, comprising:
- accessing a plurality of documents;
performing a linguistic analysis on each document;
defining a semantic vector for each document based on the linguistic analysis, said semantic vector having multiple components, wherein each component of said semantic vector has at least;
a weighting factor relating to an importance, based on characteristics of the document, of said term; and
a frequency value relating to a number of occurrences of said term;
processing the semantic vector by a digital computer; and
comparing a semantic vector of an identified document to the semantic vector for each document in the plurality of documents to determine at least one document semantically similar to the identified document, and wherein the comparing of the semantic vectors includes using a defined metric, wherein said defined metric is related to;
Sqrt(f12+f22+f32+f42+ +f(N−
1)2fN2)*100n2, wherein f is a difference in frequency of a common term between documents and n is the number of terms those documents have in common if the component has a weighting factor;
orSqrt(sum((w−
Delta)^2*w−
Avg))/(Log(n)^3*1000), wherein w−
Delta is the difference in weight between two common terms, w−
Avg is the average weight between two common terms, and n is the number of common terms if the component has a frequency value.
7 Assignments
0 Petitions
Accused Products
Abstract
A system and method are presented for the comparative analysis of textual documents. In an exemplary embodiment of the present invention the method includes accessing two or more documents, performing a linguistic analysis on each document, outputting a quantified representation of a semantic content of each document, and comparing the quantified representations using a defined metric. In exemplary embodiments of the present invention such a metric can measure relative semantic closeness or distance of two documents. In exemplary embodiments of the present invention the semantic content of a document can be expressed as a semantic vector. The format of a semantic vector is flexible, and in exemplary embodiments of the present invention it and any metric used to operate on it can be adapted and optimized to the type and/or domain of documents being analyzed and the goals of the comparison.
-
Citations
32 Claims
-
1. A computer-implemented method of comparing the semantic content of two or more documents, comprising:
-
accessing a plurality of documents; performing a linguistic analysis on each document; defining a semantic vector for each document based on the linguistic analysis, said semantic vector having multiple components, wherein each component of said semantic vector has at least; a weighting factor relating to an importance, based on characteristics of the document, of said term; and a frequency value relating to a number of occurrences of said term; processing the semantic vector by a digital computer; and comparing a semantic vector of an identified document to the semantic vector for each document in the plurality of documents to determine at least one document semantically similar to the identified document, and wherein the comparing of the semantic vectors includes using a defined metric, wherein said defined metric is related to; Sqrt(f12+f22+f32+f42+ +f(N−
1)2fN2)*100n2, wherein f is a difference in frequency of a common term between documents and n is the number of terms those documents have in common if the component has a weighting factor;
orSqrt(sum((w−
Delta)^2*w−
Avg))/(Log(n)^3*1000), wherein w−
Delta is the difference in weight between two common terms, w−
Avg is the average weight between two common terms, and n is the number of common terms if the component has a frequency value. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer-implemented method of comparing two or more documents, comprising:
-
linguistically analyzing a plurality of documents to identify at least one term group in each document, each term group comprising a main term and at least one subordinate term semantically related to the main term; generating a semantic vector associated with each document, the semantic vector comprising a plurality of components, each component including; a term group as a scalar in the document; a frequency value relating to a number of occurrences of the term group; and processing the semantic vector by a digital computer; and comparing a semantic vector of an identified document to the semantic vector for each document in the plurality of documents to determine at least one document semantically similar to the identified document using a defined metric, wherein said metric measures the semantic distance between documents as a function of at least the frequency values included in the semantic vectors for the documents, and wherein said metric is related to; Sqrt(f12+f22+f32+f42+ +f(N−
1)2fN2)*100n2 wherein f is a difference in frequency of a common term between the plurality of documents and n is the number of terms those documents have in common. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A system for comparing two or more documents, comprising:
-
a document inputter, arranged to access a plurality of documents; a semantic analyzer, arranged to perform a linguistic analysis on each document to identify at least one term group in the document, each term group comprising a main term and at least one subordinate term semantically related to the main term; a semantic quantifier, arranged to output a quantified representation of a semantic content of each document, the quantified representation based at least in part on; a term group as a scalar in the document; and a weighting factor relating to an importance, based on characteristics of the document, of at least part of the term group; and a comparator, arranged to compare the quantified representations using a defined metric, wherein said defined metric measures the semantic distance between documents as a function of at least the weighting factors associated with the quantified representations for the documents to determine at least one document in the plurality of documents semantically similar to an identified document, and wherein said defined metric is related to; Sqrt(sum((w−
Delta)^2*w−
Avg)/Log(n)^3*1000), wherein w−
Delta is the difference in weight between two common terms, w−
Avg is the average weight between two common terms, and n is the number of common terms, between two documents.
-
-
23. A system for comparing two or more documents, comprising:
-
a document inputter, arranged to access a plurality of documents; a semantic analyzer, arranged to perform a linguistic analysis on each document to identify at least one term group in the document, each term group comprising a main term and at least one subordinate term semantically related to the main term; a semantic vector generator, arranged to output a semantic vector associated with each document, each semantic vector comprising a plurality of components, each component including; a term group as a scalar in the document; a frequency value relating to a number of occurrences of the term group; and a comparator, arranged to compare the semantic vectors using a defined metric, wherein said metric measures the semantic distance between documents as a function of at least the frequency values included in the semantic vectors for the documents to determine at least one document in the plurality of documents semantically similar to an identified document, and wherein said metric is related to; Sqrt(f12+f22+f32+f42+ +f(N−
1)2fN2)*100n2 wherein f is a difference in frequency of a common term between the plurality of documents and n is the number of terms those documents have in common.
-
-
24. A computer program product comprising a computer usable medium device having computer readable program code means embodied therein, the computer readable program code means in said computer program product comprising means for causing a computer to:
-
access a plurality of documents; perform a linguistic analysis on each document to identify at least one term group in the document, each term group comprising a main term and at least one subordinate term semantically related to the main term; output a quantified representation of a semantic content of each document, the quantified representation based at least in part on; a term group as a scalar in the document; a frequency value relating to a number of occurrences of the term group; and compare the quantified representations using a defined algorithm, wherein said defined metric measures the semantic distance between documents as a function of at least the frequency values associated with the quantified representations for the documents to determine at least one document in the plurality of documents semantically similar to an identified document, and wherein said metric is related to; Sqrt(f12+f22+f32+f42+ +f(N−
1)2fN2)*100n2 wherein f is a difference in frequency of a common term between the plurality of documents and n is the number of terms those documents have in common.
-
-
25. A computer program product comprising a computer usable medium device having computer readable program code means embodied therein, the computer readable program code means in said computer program product comprising means for causing a computer to:
-
linguistically analyze a plurality of documents to identify at least one term group in the document, each term group comprising a main term and at least one subordinate term semantically related to the main term; generate a semantic vector associated with each document, each semantic vector comprising a plurality of components, each component including; a term group as a scalar in the document; and a weighting factor relating to an importance, based on characteristics of the document, of at least part of the term group; and compare the semantic vectors using a defined metric, wherein said metric measures the semantic distance between semantic vectors as a function of at least the weighting factors included in the semantic vectors to determine at least one document in the plurality of documents semantically similar to an identified document, and wherein said defined metric is related to; Sqrt(sum((w−
Delta)^2*w−
Avg))/(Log(n)^3*1000), wherein w−
Delta is the difference in weight between two common terms, w−
Avg is the average weight between two common terms, and n is the number of common terms, between two documents. - View Dependent Claims (26, 27, 28, 29)
-
-
30. A system for comparing two or more documents, comprising:
-
a document inputter, arranged to access two or more documents; a semantic analyzer, arranged to perform a linguistic analysis on each document; a semantic vector generator, arranged to output a semantic vector associated with each document; and a comparator, arranged to compare the semantic vectors using a defined metric, wherein said defined metric is one of; [Sqrt(f12+f22+f32+f42+ +f(N−
1)2fN2/n]*100, wherein f is a difference in frequency of a common term between two documents and n is the number of terms those documents have in common;
orSqrt(sum((w−
Delta)^2*w−
Avg))/(Log(n)^3*1000), wherein w−
Delta is the difference in weight between two common terms, w−
Avg is the average weight between two common terms, and n is the number of common terms, between two documents.
-
-
31. A computer-implemented method of comparing two or more documents, comprising:
-
linguistically analyzing a plurality of documents; generating a semantic vector associated with each document; processing the semantic vector by a digital computer; and comparing the semantic vectors using a defined metric, wherein said defined metric is one of; [Sqrt(f12+f22+f32+f42+ +f(N−
1)2fN2/n]*100, wherein f is a difference in frequency of a common term between documents and n is the number of terms those documents have in common;
orSqrt(sum((w−
Delta)^2*w−
Avg))/(Log(n)^3*1000), wherein W−
Delta is the difference in weight between two common terms, w−
Avg is the average weight between two common terms, and n is the number of common terms, between documents to determine at least one document in the plurality of documents semantically similar to an identified document.
-
-
32. A computer program product comprising a computer usable medium device having computer readable program code means embodied therein, the computer readable program code means in said computer program product comprising means for causing a computer to access two or more documents;
-
perform a linguistic analysis on each document; output a quantified representation of a semantic content of each document; and compare the quantified representations using a defined algorithm, wherein said defined algorithm is one of; [Sqrt(f12+f22+f32+f42+ +f(N−
1)2fN2/n]*100, wherein f is a difference in frequency of a common term between two documents and n is the number of terms those documents have in common;
orSqrt(sum((w−
Delta)^2*w−
Avg))/(Log(n)^3*1000), wherein w−
Delta is the difference in weight between two common terms, w−
Avg is the average weight between two common terms, and n is the number of common terms, between two documents.
-
Specification