Information data retrieval, where the data is organized in terms, documents and document corpora
First Claim
1. A method of processing digitized textual information, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, the method comprising:
- generating a concept vector for each document in a document corpus wherein the concept vector conceptually classifying the contents of the document on a relatively compact format, generating, for each term in the document corpus, a term-to-concept vector describing a relationship between the term and each of the concept vectors wherein the term-to-concept vectors being generated on basis of the concept vectors, comprises;
receiving the term-to-concept vectors for the document corpus and on basis thereof generating a term-term matrix describing a term-to-term relationship between the terms in the document corpus, and processing the term-term matrix into processed textual information.
2 Assignments
0 Petitions
Accused Products
Abstract
The invention relates to improved solutions for information retrieval, wherein the information is represented by digitized text data. This data is further presumed to be organized in terms (431-438), documents and document corpora, where each document contains at least one term (431-438) and each document corpus contains at least one document. Based on a concept vector (420-424), which conceptually classifies the contents of each document, a term-to-concept vector is generated for each term (431-438) in the document corpus. The term-to-concept vector describes a relationship between the term (431) and each of the concept vectors (420-424). On basis of the term-to-concept vectors for the document corpus, a term-term matrix is generated which describes a term-to-term relationship between all the terms (431-438) in the document corpus. The term-term matrix may then be processed and used for retrieving information from the document corpus, such as the fact that a first term (431) is related to a second term (436).
198 Citations
26 Claims
-
1. A method of processing digitized textual information, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, the method comprising:
-
generating a concept vector for each document in a document corpus wherein the concept vector conceptually classifying the contents of the document on a relatively compact format, generating, for each term in the document corpus, a term-to-concept vector describing a relationship between the term and each of the concept vectors wherein the term-to-concept vectors being generated on basis of the concept vectors, comprises;
receiving the term-to-concept vectors for the document corpus and on basis thereof generating a term-term matrix describing a term-to-term relationship between the terms in the document corpus, and processing the term-term matrix into processed textual information. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computer program directly loadable into the internal memory of a digital computer, comprising software for performing a method of processing digitized textual information, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, the method comprising:
-
generating a concept vector for each document in a document corpus wherein the concept vector conceptually classifying the contents of the document on a relatively compact format, generating, for each term in the document corpus, a term-to-concept vector describing a relationship between the term and each of the concept vectors wherein the term-to-concept vectors being generated on basis of the concept vectors, receiving the term-to-concept vectors for the document corpus and on basis thereof generating a term-term matrix describing a term-to-term relationship between the terms in the document corpus, and processing the term-term matrix into processed textual information.
-
-
16. A computer readable medium, having a program recorded thereon, where the program is to make a computer perform a method of processing digitized textual information, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, the method comprising:
-
generating a concept vector for each document in a document corpus wherein the concept vector conceptually classifying the contents of the document on a relatively compact format, generating, for each term in the document corpus, a term-to-concept vector describing a relationship between the term and each of the concept vectors wherein the term-to-concept vectors being generated on basis of the concept vectors, receiving the term-to-concept vectors for the document corpus and on basis thereof generating a term-term matrix describing a term-to-term relationship between the terms in the document corpus, and processing the term-term matrix into processed textual information.
-
-
17. A search engine for processing an amount of digitized textual information and extracting data there from, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, comprising:
-
an interface adapted to receive a query from a user, and a processing unit adapted to process a document corpus on basis of the query and return processed textual information being relevant to the query said process involving generating a concept vector for each document in the document corpus, the concept vector conceptually classifying the contents of the document on a relatively'"'"' compact format, and generating, for each term in the document corpus, a term-to-concept vector describing a relationship between the term and each of the concept vectors, wherein the processing unit in turn comprises;
a processing module adapted to receive the term-to-concept vectors for the document corpus and on basis thereof generate a term-term matrix describing a term-to-term relation-ship between the terms in the document corpus, and an exploring module adapted to receive the query and the term-term matrix, and on basis of the query process the term-term matrix into the processed textual information. - View Dependent Claims (18, 19, 20, 21, 22)
-
-
23. A method of processing digitized textual information, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, the method comprising:
-
identifying a particular document corpus, filtering the identified document corpus wherein a number of documents fulfilling at least one specified criterion are selected, and producing a new document corpus exclusively containing the selected documents. - View Dependent Claims (24, 25, 26)
-
Specification