System, method, and program product for identifying and describing topics in a collection of electronic documents
First Claim
1. A computer system for identifying and describing one or more topics in one or more documents in a document set, the system comprising:
- one or more central processing units and one or more memories;
a term set process that creates a basic term set from the document set being a set of one or more basic terms of one or more words;
a document vector process that creates a document vector for each document that has a document vector direction representing what the document is about;
a topic vector process that creates one or more topic vectors from the document vectors, each topic vector having topic vector direction representing a topic in the document set;
a topic term set process that creates a topic term set for each topic vector that comprises one or more of the basic terms describing the topic represented by the topic vector, each of the basic terms in the topic term set associated with relevancy of the basic term;
a topic-document relevance process that creates a topic-document relevance for each topic vector and each document vector, the topic-document relevance representing relevance of the document to the topic, wherein the topic-document relevance for a given topic vector and a given document vector is determined using corresponding ones of the topic vector directions and the document vector directions, where the topic-document relevance for a given topic vector and a given document vector is determined by computing a cosine between the given topic vector and the given document vector; and
a topic sentence set process that creates a topic sentence set for each topic vector that comprises of one or more topic sentences that describe the topic represented by the topic vector, each of the topic sentences associated with relevance of the topic sentence to the topic represented by the topic vector.
1 Assignment
0 Petitions
Accused Products
Abstract
To identify and describe one or more topics in one or more documents in a document set, a term set process creates a basic term set from the document set where the term set comprises one or more basic terms of one or more words in the document. A document vector process then creates a document vector for each document. The document vector has a document vector direction representing what the document is about. A topic vector process then creates one or more topic vectors from the document vectors. Each topic vector has a topic vector direction representing a topic in the document set. A topic term set process creates a topic term set for each topic vector that comprises one or more of the basic terms describing the topic represented by the topic vector. Each of the basic terms in the topic term set associated with the relevancy of the basic term. A topic-document relevance process creates a topic-document relevance for each topic vector and each document vector. The topic-document relevance representing the relevance of the document to the topic. A topic sentence set process creates a topic sentence set for each topic vector that comprises of one or more topic sentences that describe the topic represented by the topic vector. Each of the topic sentences is then associated with the relevance of the topic sentence to the topic represented by the topic vector.
94 Citations
16 Claims
-
1. A computer system for identifying and describing one or more topics in one or more documents in a document set, the system comprising:
-
one or more central processing units and one or more memories;
a term set process that creates a basic term set from the document set being a set of one or more basic terms of one or more words;
a document vector process that creates a document vector for each document that has a document vector direction representing what the document is about;
a topic vector process that creates one or more topic vectors from the document vectors, each topic vector having topic vector direction representing a topic in the document set;
a topic term set process that creates a topic term set for each topic vector that comprises one or more of the basic terms describing the topic represented by the topic vector, each of the basic terms in the topic term set associated with relevancy of the basic term;
a topic-document relevance process that creates a topic-document relevance for each topic vector and each document vector, the topic-document relevance representing relevance of the document to the topic, wherein the topic-document relevance for a given topic vector and a given document vector is determined using corresponding ones of the topic vector directions and the document vector directions, where the topic-document relevance for a given topic vector and a given document vector is determined by computing a cosine between the given topic vector and the given document vector; and
a topic sentence set process that creates a topic sentence set for each topic vector that comprises of one or more topic sentences that describe the topic represented by the topic vector, each of the topic sentences associated with relevance of the topic sentence to the topic represented by the topic vector. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
a term-document vector process that creates a term-document vector for each document in the document set, the term-document vector having one or more term-document vector elements, each of the term-document vector element being the relevancy of each basic term to the document;
a conversion matrix process that creates a matrix of one or more conversion vectors, called the conversion matrix, from the term-document vectors;
a document vector conversion process that creates a document vector for each document by multiplying the conversion matrix and the term-document vector for the document;
a term-sentence vector process that creates a term-sentence vector for each sentence in each document in the document set, the term-sentence vector having one or more term-sentence vector elements, each of the term-sentence vector elements being the relevance of each basic term to the sentence;
a sentence vector process that creates a sentence vector for each sentence by multiplying the conversion matrix and the term-sentence vector for the sentence; and
an interpretation matrix process that creates an interpretation matrix from the conversion matrix.
-
-
3. A system, as in claim 2, where the topic term set process creates a topic term set for each topic vector by selecting one or more of the basic terms having large values of the products of the interpretation matrix and the topic vector.
-
4. A system, as in claim 2, where the topic sentence set process creates a topic sentence set for each topic vector by selecting the sentence represented by the sentence vector that has a large sentence vector magnitude and a large direction similarity to the topic vector.
-
5. A system, as in claim 1, comprising a document map for each topic vector, the document map having one or more document image objects, each of the document image objects associated with each document, each of the document image objects positioned and colored based on the topic-document relevancy.
-
6. A system, as in claim 5, where more than one document map is displayed and when a mouse or other pointing device is over a document image object in a document map, the color of the document image objects representing the document are changed in all the document maps.
-
7. A system, as in claim 5, where a document map has an indicator image object that indicates the direction of positional change of the document image object which represents larger relevance of the document to the topic.
-
8. A system, as in claim 5, where full text of the document is displayed when the corresponding document image object in the document map is selected with a mouse or other pointing device.
-
9. A system, as in claim 1, comprising a topic sentence frame for each topic vector, the topic sentence frame in which one or more topic sentences of the topic sentence set for the topic vector are displayed.
-
10. A system, as in claim 9, where a document map and a sentence frame for each topic vector are displayed.
-
11. A system, as in claim 10, where each topic sentence displayed in the sentence frame is associated with an image object, called a sentence image object and when a mouse or other pointing device is over the sentence image object, the color of the document image objects representing the document containing that sentence are changed.
-
12. A system, as in claim 10, where full text of the document is displayed when the sentence image object associated with the sentence contained in that document is clicked by mouse or other pointing devices.
-
13. A system, as in claim 12, where the sentence associated with the clicked sentence image object is highlighted when full text of the document is displayed.
-
14. A method, executing on a computer, for identifying and describing one or more topics in one or more documents in a document set, the method comprising the steps of:
-
creating a basic term set from the document set being a set of one or more basic terms of one or more words;
creating a document vector for each document that has a document vector direction representing what the document is about;
creating one or more topic vectors from the document vectors, each topic vector having topic vector direction representing a topic in the document set;
creating a topic term set for each topic vector that comprises one or more of the basic terms describing the topic represented by the topic vector, each of the basic terms in the topic term set associated with relevancy of the basic term;
creating a topic-document relevance for each topic vector and each document vector, the topic-document relevance representing relevance of the document to the topic, wherein the topic-document relevance for a given topic vector and a given document vector is determined using corresponding ones of the topic vector directions and the document vector directions, where the topic-document relevance for a given topic vector and a given document vector is determined by computing a cosine between the given topic vector and the given document vector; and
creating a topic sentence set for each topic vector that comprises of one or more topic sentences that describe the topic represented by the topic vector, each of the topic sentences associated with relevance of the topic sentence to the topic represented by the topic vector.
-
-
15. A computer system for identifying and describing one or more topics in one or more documents in a document set, the system comprising:
-
means for creating a basic term set from the document set being a set of one or more basic terms of one or more words;
means for creating a document vector for each document that has a document vector direction representing what the document is about;
means for creating one or more topic vectors from the document vectors, each topic vector having topic vector direction representing a topic in the document set;
means for creating a topic term set for each topic vector that comprises one or more of the basic terms describing the topic represented by the topic vector, each of the basic terms in the topic term set associated with relevancy of the basic term;
means for creating a topic-document relevance for each topic vector and each document vector, the topic-document relevance representing relevance of the document to the topic, wherein the topic-document relevance for a given topic vector and a given document vector is determined using corresponding ones of the topic vector directions and the document vector directions, where the topic-document relevance for a given topic vector and a given document vector is determined by computing a cosine between the given topic Vector and the given document vector; and
means for creating a topic sentence set for each topic vector that comprises of one or more topic sentences that describe the topic represented by the topic vector, each of the topic sentences associated with relevance of the topic sentence to the topic represented by the topic vector.
-
-
16. A computer program product, for identifying and describing one or more topics in one or more documents in a document set, which performs the steps of:
-
creating a basic term set from the document set being a set of one or more basic terms of one or more words;
creating a document vector for each document that has a document vector direction representing what the document is about;
creating one or more topic vectors from the document vectors, each topic vector having topic vector direction representing a topic in the document set;
creating a topic term set for each topic vector that comprises one or more of the basic terms describing the topic represented by the topic vector, each of the basic terms in the topic term set associated with relevancy of the basic term;
creating a topic-document relevance for each topic vector and each document vector, the topic-document relevance representing relevance of the document to the topic, wherein the topic-document relevance for a given topic vector and a given document vector is determined using corresponding ones of the topic vector directions and the document vector directions, where the topic-document relevance for a given topic vector and a given document vector is determined by computing a cosine between the given topic vector and the given document vector; and
creating a topic sentence set for each topic vector that comprises of one or more topic sentences that describe the topic represented by the topic vector, each of the topic sentences associated with relevance of the topic sentence to the topic represented by the topic vector.
-
Specification