Automatically summarising topics in a collection of electronic documents
First Claim
1. A method of detecting and summarising at least one topic in at least one document of a document set, each document in said document set having a plurality of terms and a plurality of sentences comprising said plurality of terms, wherein said plurality of terms and said plurality of sentences are represented as a plurality of vectors in a two-dimensional space, said method comprising the steps of:
- pre-processing said at least one document to extract a plurality of significant terms and to create a plurality of basic terms;
formatting said at least one document and said plurality of basic terms;
reducing said plurality of basic terms;
reducing said plurality of sentences;
creating a matrix of said reduced plurality of basic terms and said reduced plurality of sentences;
utilising said matrix to correlate said plurality of basic terms;
transforming a two-dimensional coordinate associated with each of said correlated plurality of basic terms to an n-dimensional coordinate;
clustering said reduced plurality of sentence vectors in said n-dimensional space; and
associating magnitudes of said reduced plurality of sentence vectors with said at least one topic.
1 Assignment
0 Petitions
Accused Products
Abstract
Automatically detecting and summarising at least one topic in at least one document of a document set, whereby each document has a plurality of terms and a plurality of sentences comprising a plurality of terms. Furthermore, the plurality of terms and the plurality of sentences are represented as a plurality of vectors in a two-dimensional space. Firstly, the documents are pre-processed to extract a plurality of significant terms and to create a plurality of basic terms. Next, the documents and the basic terms are formatted. The basic terms and sentences are reduced and then utilised to create a matrix. This matrix is then used to correlate the basic terms. A two-dimensional co-ordinate associated with each of the correlated basic terms is transformed to an n-dimensional coordinate. Next, the reduced sentence vectors are clustered in the n-dimensional space. Finally, to summarise topics, magnitudes of the reduced sentence vectors are utilised.
148 Citations
11 Claims
-
1. A method of detecting and summarising at least one topic in at least one document of a document set, each document in said document set having a plurality of terms and a plurality of sentences comprising said plurality of terms, wherein said plurality of terms and said plurality of sentences are represented as a plurality of vectors in a two-dimensional space, said method comprising the steps of:
-
pre-processing said at least one document to extract a plurality of significant terms and to create a plurality of basic terms;
formatting said at least one document and said plurality of basic terms;
reducing said plurality of basic terms;
reducing said plurality of sentences;
creating a matrix of said reduced plurality of basic terms and said reduced plurality of sentences;
utilising said matrix to correlate said plurality of basic terms;
transforming a two-dimensional coordinate associated with each of said correlated plurality of basic terms to an n-dimensional coordinate;
clustering said reduced plurality of sentence vectors in said n-dimensional space; and
associating magnitudes of said reduced plurality of sentence vectors with said at least one topic. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for detecting and summarising at least one topic in at least one document of a document set, each document in said document set having a plurality of terms and a plurality of sentences comprising said plurality of terms, wherein said plurality of terms and said plurality of sentences are represented as a plurality of vectors in a two-dimensional space, said system comprising:
-
means for pre-processing said at least one document to extract a plurality of significant terms and to create a plurality of basic terms;
means for formatting said at least one document and said plurality of basic terms;
means for reducing said plurality of basic terms;
means for reducing said plurality of sentences;
means for creating a matrix of said reduced plurality of basic terms on said reduced plurality of sentences;
means for utilising said matrix to correlate said plurality of basic terms;
means for transforming a two-dimensional coordinate associated with each of said correlated plurality of basic terms to an n-dimensional co-ordinate;
means for clustering said reduced plurality of sentence vectors in said n-dimensional space; and
means for associating magnitudes of said reduced plurality of sentence vectors with said at least one topic.
-
-
11. Computer readable code stored on a computer readable storage medium for detecting and summarising at least one topic in at least one document of a document set, each document in said document set having a plurality of terms and a plurality of sentences comprising said plurality of terms, said computer readable code comprising:
-
first processes for pre-processing said at least one document to extract a plurality of significant terms and to create a plurality of basic terms;
second processes for formatting said at least one document and said plurality of basic terms;
third processes for reducing said plurality of basic terms;
fourth processes for reducing said plurality of sentences;
fifth processess for creating a matrix of said reduced plurality of basic terms and said reduced plurality of sentences;
sixth processes for utilising said matrix to correlate said plurality of basic terms;
seventh processes for transforming a two-dimensional coordinate associated with each of said correlated plurality of basic terms to an n-dimensional coordinate;
eighth processess for clustering said reduced plurality of sentence vectors in said n-dimensional space; and
ninth processes associating magnitudes of said reduced plurality of sentence vectors with said at least one topic.
-
Specification