Systems and methods for determining relevant information based on document structure
First Claim
1. A method of determining information relevant to a location within a first document, the method comprising:
- receiving a selection of the first document, the first document being received through an input and output interface of a computer;
identifying at least two structural elements in the first document having a dominance relationship, the identifying being performed by one or more processors of the computer;
receiving a selection of a first location in the first document from a user through the input and output interface;
determining surrounding structural elements surrounding the first location, the determining comprising selecting from the at least two structural elements;
characterizing the surrounding structural elements by the one or more processors;
characterizing one or more non-surrounding structural elements from among the at least two structural elements not determined to be the surrounding structural elements by the one or more processors;
characterizing surrounding phrase for frequency of occurrence of a plurality of first terms by the one or more processors;
characterizing non-surrounding phrases in the first document for the occurrence of the plurality of the first terms by the one or more processors, the non-surrounding phrases being phrases in the first document other than the surrounding phrase;
associating one or more second documents with the surrounding structural elements based on the characterization of the surrounding structural elements and the one or more non-surrounding structural elements by the one or more processors, wherein the one or more second documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements;
creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of;
Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterizing of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location; and
removing a second group of the one or more second documents from among first groups of the one or more second documents to obtain a third group of the one or more documents, wherein the removing is based on the characterizing the surrounding structure elements.
1 Assignment
0 Petitions
Accused Products
Abstract
Techniques are provided for determining relevant information from a document based on document structure. A document is selected and structural elements within the document having a dominance relationship are determined. A first location within the document is selected. The structural element surrounding the first location is determined and the surrounding and non-surrounding structural elements are characterized. Additional documents are associated with the first location in the surrounding structural element based on the surrounding structural element characterization and the non-surrounding structural element characterization. Techniques for dynamically determining annotations for images based on document structure are also provided.
-
Citations
14 Claims
-
1. A method of determining information relevant to a location within a first document, the method comprising:
-
receiving a selection of the first document, the first document being received through an input and output interface of a computer; identifying at least two structural elements in the first document having a dominance relationship, the identifying being performed by one or more processors of the computer; receiving a selection of a first location in the first document from a user through the input and output interface; determining surrounding structural elements surrounding the first location, the determining comprising selecting from the at least two structural elements; characterizing the surrounding structural elements by the one or more processors; characterizing one or more non-surrounding structural elements from among the at least two structural elements not determined to be the surrounding structural elements by the one or more processors; characterizing surrounding phrase for frequency of occurrence of a plurality of first terms by the one or more processors; characterizing non-surrounding phrases in the first document for the occurrence of the plurality of the first terms by the one or more processors, the non-surrounding phrases being phrases in the first document other than the surrounding phrase; associating one or more second documents with the surrounding structural elements based on the characterization of the surrounding structural elements and the one or more non-surrounding structural elements by the one or more processors, wherein the one or more second documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements; creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of;
Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterizing of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location; andremoving a second group of the one or more second documents from among first groups of the one or more second documents to obtain a third group of the one or more documents, wherein the removing is based on the characterizing the surrounding structure elements. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An apparatus for determining relevant information comprising:
-
one or more processors; an input/output circuit that retrieves a first document from a document repository responsive to a user selection; a document structure manager that identifies at least two structural elements in the first document having a dominance relationship; input and output interface that receives a selection of a first location in the first document; a structural element manger identifies surrounding structural elements surrounding the selected first location and one or more non-surrounding structural elements from among the at least two structural elements; a characterization manger characterizes the surrounding structural elements and the one or more non-surrounding structural elements from among the at least two structural elements that is not determined to be the surrounding structural elements; the characterization manger characterizes surrounding phrase for frequency of occurrence of a plurality of first terms; the characterization manger further characterizes non-surrounding phrases in the first document for the occurrence of the plurality of the first terms, the non-surrounding phrases being phrases in the first document other than the surrounding phrase; and a readable program code for; associating one or more second documents with the surrounding structural elements based on the characterization of the surrounding structural elements and the one or more non-surrounding structural elements, wherein the one or more second documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements; creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of;
Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterizing of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location; andremoving a second group of the one or more second documents from among first groups of the one or more second documents to obtain a third group of the one or more documents, wherein the removing is based on the characterizing the surrounding structure elements. - View Dependent Claims (9, 10, 11, 12)
-
-
13. A computer readable storage medium comprising computer readable program code embodied on the computer readable storage medium, the computer readable program code useable to program a computer for performing the steps of:
-
receiving a selection of a first document, the first document being received through an input and output interface of a computer; identifying at least two structural elements in the first document having a dominance relationship, the identifying being performed by one or more processors of the computer; receiving a selection of a first location in the first document from a user through the input and output interface; determining surrounding structural elements surrounding the first location, the determining comprising selecting from the at least two structural elements; characterizing the surrounding structural elements by the one or more processors; characterizing one or more non-surrounding structural elements from among the at least two structural elements not determined to be the surrounding structural elements by the one or more processors; characterizing surrounding phrase for frequency of occurrence of a plurality of first terms by the one or more processors; characterizing non-surrounding phrases in the first document for the occurrence of the plurality of the first terms by the one or more processors, the non-surrounding phrases being phrases in the first document other than the surrounding phrase; associating one or more second documents with the surrounding structural elements based on the characterization of the surrounding structural elements and the one or more non-surrounding structural elements by the one or more processors, wherein the one or more second documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements; creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of;
Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterizing of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location; andremoving a second group of the one or more second documents from among first groups of the one or more second documents to obtain a third group of the one or more documents, wherein the removing is based on the characterizing the surrounding structure elements.
-
-
14. A method for retrieving information relevant to a word within a document, the method comprising:
-
retrieving, responsive to a first user input, a first document from a document repository saved on a database coupled to a computer, the first document including a plurality of phrases; determining the plurality of phrases in the first document by one or more processors of the computer, the determining comprising selecting from at least two structural documents; selecting, responsive to a second user input, a first word within the first document by the one or more processors, the first user input and the second user input being received through an input and output interface of the computer; determining a first phrase that includes the first word as a surrounding phrase by the one or more processors; characterizing the surrounding phrase for frequency of occurrence of a plurality of first terms by the one or more processors; characterizing non-surrounding phrases in the first document for the occurrence of the plurality of the first terms by the one or more processors, the non-surrounding phrases being phrases in the first document other than the surrounding phrase; finding a first group of one or more documents being similar to the surrounding phrase based on the characterization of the surrounding phrase; finding within the first group of the one or more documents, a second group of the one or more documents being similar to the non-surrounding phrases, the second group of one or more documents being similar to both the surrounding phrase and the non-surrounding phrases; associating the one or more documents with surrounding structural elements based on characterization of the surrounding structural elements and one or more non-surrounding structural elements by the one or more processors, wherein the one or more documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements; creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of;
Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterization of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location;removing a second group of the one or more documents from among first groups of the one or more documents to obtain a third group of the one or more documents, wherein the removing is based on the characterization of the surrounding structure elements; and outputting the third group of the one or more documents to the user on the input and output interface.
-
Specification