Information data retrieval, where the data is organized in terms, documents and document corpora
First Claim
1. A method of processing digitized textual information in a computerized database system, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, the method comprising:
- generating, by using a computer, a concept vector for each document in a document corpus wherein the concept vector conceptually classifying contents of the document on a relatively compact format,generating, for each term in the document corpus, a term-to-concept vector describing a relationship between the term and each of the concept vectors wherein the term-to-concept vectors being generated on basis of the concept vectors, comprises;
receiving the term-to-concept vectors for the document corpus and on basis thereof generating a term-term matrix describing a term-to-term relationship between the terms in the document corpus, wherein the generation of the term-term matrix comprises;
retrieving, for each term in each combination of two unique terms in the document corpus, a respective term-to-concept vector, generating a relation vector describing the relationship between the terms in the each combination of two unique terms, each component in the relation vector being equal to a lowest component value of corresponding component values in the term-to-concept vectors, generating a relationship value for the each combination of two unique terms as the sum of all component values in the corresponding relation vector, and generating a matrix containing the relationship values of all combinations of two unique terms in the document corpus,processing the term-term matrix into processed textual information and displaying the processed textual information via a user output interface, anddisplaying the processed textual information as a distance graph in which each term constitutes a node wherein the node representing a first term is connected to one or more other nodes representing secondary terms to which the first term has a conceptual relationship of at least a specific strength, and a relevance measure between the first term and at least one second term is represented by a minimum number of node hops between the first term and the at least one second term.
2 Assignments
0 Petitions
Accused Products
Abstract
The invention relates to improved solutions for information retrieval, wherein the information is represented by digitized text data. This data is further presumed to be organized in terms (431-438), documents and document corpora, where each document contains at least one term (431-438) and each document corpus contains at least one document. Based on a concept vector (420-424), which conceptually classifies the contents of each document, a term-to-concept vector is generated for each term (431-438) in the document corpus. The term-to-concept vector describes a relationship between the term (431) and each of the concept vectors (420-424). On basis of the term-to-concept vectors for the document corpus, a term-term matrix is generated which describes a term-to-term relationship between all the terms (431-438) in the document corpus. The term-term matrix may then be processed and used for retrieving information from the document corpus, such as the fact that a first term (431) is related to a second term (436).
43 Citations
18 Claims
-
1. A method of processing digitized textual information in a computerized database system, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, the method comprising:
-
generating, by using a computer, a concept vector for each document in a document corpus wherein the concept vector conceptually classifying contents of the document on a relatively compact format, generating, for each term in the document corpus, a term-to-concept vector describing a relationship between the term and each of the concept vectors wherein the term-to-concept vectors being generated on basis of the concept vectors, comprises; receiving the term-to-concept vectors for the document corpus and on basis thereof generating a term-term matrix describing a term-to-term relationship between the terms in the document corpus, wherein the generation of the term-term matrix comprises;
retrieving, for each term in each combination of two unique terms in the document corpus, a respective term-to-concept vector, generating a relation vector describing the relationship between the terms in the each combination of two unique terms, each component in the relation vector being equal to a lowest component value of corresponding component values in the term-to-concept vectors, generating a relationship value for the each combination of two unique terms as the sum of all component values in the corresponding relation vector, and generating a matrix containing the relationship values of all combinations of two unique terms in the document corpus,processing the term-term matrix into processed textual information and displaying the processed textual information via a user output interface, and displaying the processed textual information as a distance graph in which each term constitutes a node wherein the node representing a first term is connected to one or more other nodes representing secondary terms to which the first term has a conceptual relationship of at least a specific strength, and a relevance measure between the first term and at least one second term is represented by a minimum number of node hops between the first term and the at least one second term. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-implemented search engine, embedded on a computer readable storage medium, for processing an amount of digitized textual information and extracting data there from, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, comprising:
-
an interface configured to receive a query from a user, and a processing unit configured to process a document corpus on basis of the query and return processed textual information being relevant to the query said process involving generating a concept vector for each document in the document corpus, the concept vector conceptually classifying contents of the document on a relatively compact format, and generating, for each term in the document corpus, a term-to-concept vector describing a relationship between the term and each of the concept vectors, wherein the processing unit in turn comprises; a processing module configured to receive the term-to-concept vectors for the document corpus and on basis thereof generate a term-term matrix describing a term-to-term relation-ship between the terms in the document corpus, wherein the generation of the term-term matrix comprises;
retrieving, for each term in each combination of two unique terms in the document corpus, a respective term-to-concept vector, generating a relation vector describing the relationship between the terms in the each combination of two unique terms, each component in the relation vector being equal to a lowest component value of corresponding component values in the term-to-concept vectors, generating a relationship value for the each combination of two unique terms as the sum of all component values in the corresponding relation vector, and generating a matrix containing the relationship values of all combinations of two unique terms in the document corpus,an exploring module configured to receive the query and the term-term matrix, and on basis of the query process the term-term matrix into the processed textual information, and a display module configured to display the processed textual information as a distance graph in which each term constitutes a node wherein the node representing a first term is connected to one or more other nodes representing secondary terms to which the first term has a conceptual relationship of at least a specific strength, and a relevance measure between the first term and at least one second term is represented by a minimum number of node hops between the first term and the at least one second term.
-
-
14. A computer-implemented database system comprising:
-
a processor; memory holding an amount of digitized textual information being organized in terms, documents and document corpora, wherein each document contains at least one term and each document corpus contains at least one document, wherein each document in a document corpus being associated with concept vector which conceptually classifies contents of the document on a relatively compact format, and wherein each term in the document corpus being associated with a term-to-concept vector describing a relationship between the term and each of the concept vectors, delivering the term-to-concept vectors to a search engine for processing an amount of digitized textual information and extracting data there from, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, and computer program instructions implementing; an interface configured to receive a query from a user, and a processing unit configured to process a document corpus on basis of the query and return processed textual information being relevant to the query said process involving generating a concept vector for each document in the document corpus, the concept vector conceptually classifying the contents of the document on a relatively compact format, and generating, for each term in the document corpus, a term-to-concept vector describing a relationship between the term and each of the concept vectors, wherein the processing unit in turn comprises; a processing module configured to receive the term-to-concept vectors for the document corpus and on basis thereof generate a term-term matrix describing a term-to-term relation-ship between the terms in the document corpus, wherein the generation of the term-term matrix comprises;
retrieving, for each term in each combination of two unique terms in the document corpus, a respective term-to-concept vector, generating a relation vector describing the reIationship between the terms in the each combination of two unique terms, each component in the relation vector being equal to a lowest component value of corresponding component values in the term-to-concept vectors, generating a relationship value for the each combination of two unique terms as the sum of all component values in the corresponding relation vector, and generating a matrix containing the relationship values of all combinations of two unique terms in the document corpus,an exploring module configured to receive the query and the term-term matrix, and on basis of the query process the term-term matrix into the processed textual information, and a display module configured to display the processed textual information as a distance graph in which each term constitutes a node wherein the node representing a first term is connected to one or more other nodes representing secondary terms to which the first term has a conceptual relationship of at least a specific strength, and a relevance measure between the first term and at least one second term is represented by a minimum number of node hops between the first term and the at least one second term. - View Dependent Claims (15)
-
-
16. A server computer system for providing data processing services in respect of digitized textual information, wherein the server comprises:
-
a processor; memory for storing computer program instructions and data; and computer pogram instructions stored in the memory for implementing; a search engine for processing an amount of digitized textual information and extracting data there from, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, comprising an interface configured to receive a query from a user, and a processing unit configured to process a document corpus on basis of the query and return processed textual information being relevant to the query said process involving generating a concept vector for each document in the document corpus, the concept vector conceptually classifying contents of the document on a relatively compact format, and generating, for each term in the document corpus, a term-to-concept vector describing a relationship between the term and each of the concept vectors, wherein the processing unit in turn comprises a processing module configured to receive the term-to-concept vectors for the document corpus and on basis thereof generate a term-term matrix describing a term-to-term relation-ship between the terms in the document corpus, wherein the generation of the term-term matrix comprises;
retrieving, for each term in each combination of two unique terms in the document corpus, a respective term-to-concept vector, generating a relation vector describing the relationship between the terms in the each combination of two unique terms, each component in the relation vector being equal to a lowest component value of corresponding component values in the term-to-concept vectors, generating a relationship value for the each combination of two unique terms as the sum of all component values in the corresponding relation vector, and generating a matrix containing the relationship values of all combinations of two unique terms in the document corpus, an exploring module configured to receive the query and the term-term matrix, and on basis of the query process the term-term matrix into the processed textual information, a display module configured to display the processed textual information as a distance graph in which each term constitutes a node wherein the node representing a first term is connected to one or more other nodes representing secondary terms to which the first term has a conceptual relationship of at least a specific strength, and a relevance measure between the first term and at least one second term is represented by a minimum number of node hops between the first term and the at least one second term, anda communication interface towards a database system holding an amount of digitized textual information and configured to deliver the term-to concept vectors to the search engine.
-
-
17. A computer system comprising:
- a processor for executing computer program instructions, a memory for storing computer program instructions and computer program instructions comprising software for processing digitized textual information, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, the digitized textual information processed by;
generating a concept vector for each document in a document corpus wherein the concept vector conceptually classifying the contents of the document on a relatively compact format, generating, for each term in the document corpus, a term-to-concept vector describing a relationship between the term and each of the concept vectors wherein the term-to-concept vectors being generated on basis of the concept vectors, receiving the term-to-concept vectors for the document corpus and on basis thereof generating a term-term matrix describing a term-to-term relationship between the terms in the document corpus, wherein the generation of the term-term matrix comprises;
retrieving, for each term in each combination of two unique terms in the document corpus, a respective term-to-concept vector, generating a relation vector describing the relationship between the terms in the each combination of two unique terms, each component in the relation vector being equal to a lowest component value of corresponding component values in the term-to-concept vectors, generating a relationship value for the each combination of two unique terms as the sum of all component values in the corresponding relation vector, and generating a matrix containing the relationship values of all combinations of two unique terms in the document corpus,processing the term-term matrix into processed textual information and displaying the processed textual information via a user output interface, and displaying the processed textual information as a distance graph in which each term constitutes a node wherein the node representing a first term is connected to one or more other nodes representing secondary terms to which the first term has a conceptual relationship of at least a specific strength, and a relevance measure between the first term and at least one second term is represented by a minimum number of node hops between the first term and the at least one second term.
- a processor for executing computer program instructions, a memory for storing computer program instructions and computer program instructions comprising software for processing digitized textual information, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, the digitized textual information processed by;
-
18. A computer program product stored in a computer readable storage medium, the computer program product comprising:
-
computer program instructions recorded thereon for causing a computer to process digitized textual information, the information being organized in terms, documents and document corpora, where each document contains at least one term and each document corpus contains at least one document, the digitized textual information processed by; generating a concept vector for each document in a document corpus wherein the concept vector conceptually classifying contents of the document on a relatively compact format, generating, for each term in the document corpus, a term-to-concept vector describing a relationship between the term and each of the concept vectors wherein the term-to-concept vectors being generated on basis of the concept vectors, receiving the term-to-concept vectors for the document corpus and on basis thereof generating a term-term matrix describing a term-to-term relationship between the terms in the document corpus, wherein the generation of the term-term matrix comprises;
retrieving, for each term in each combination of two unique terms in the document corpus, a respective term-to-concept vector, generating a relation vector describing the relationship between the terms in the each combination of two unique terms, each component in the relation vector being equal to a lowest component value of corresponding component values in the term-to-concept vectors, generating a relationship value for the each combination of two unique terms as the sum of all component values in the corresponding relation vector, and generating a matrix containing the relationship values of all combinations of two unique terms in the document corpus,processing the term-term matrix into processed textual information and displaying the processed textual information via a user output interface, and displaying the processed textual information as a distance graph in which each term constitutes a node wherein the node representing a first term is connected to one or more other nodes representing secondary terms to which the first term has a conceptual relationship of at least a specific strength, and a relevance measure between the first term and at least one second term is represented by a minimum number of node hops between the first term and the at least one second term.
-
Specification