Systems and methods for determining relevant information based on document structure

US 7,739,279 B2
Filed: 12/12/2005
Issued: 06/15/2010
Est. Priority Date: 12/12/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method of determining information relevant to a location within a first document, the method comprising:

receiving a selection of the first document, the first document being received through an input and output interface of a computer;

identifying at least two structural elements in the first document having a dominance relationship, the identifying being performed by one or more processors of the computer;

receiving a selection of a first location in the first document from a user through the input and output interface;

determining surrounding structural elements surrounding the first location, the determining comprising selecting from the at least two structural elements;

characterizing the surrounding structural elements by the one or more processors;

characterizing one or more non-surrounding structural elements from among the at least two structural elements not determined to be the surrounding structural elements by the one or more processors;

characterizing surrounding phrase for frequency of occurrence of a plurality of first terms by the one or more processors;

characterizing non-surrounding phrases in the first document for the occurrence of the plurality of the first terms by the one or more processors, the non-surrounding phrases being phrases in the first document other than the surrounding phrase;

associating one or more second documents with the surrounding structural elements based on the characterization of the surrounding structural elements and the one or more non-surrounding structural elements by the one or more processors, wherein the one or more second documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements;

creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of;

Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterizing of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location; and

removing a second group of the one or more second documents from among first groups of the one or more second documents to obtain a third group of the one or more documents, wherein the removing is based on the characterizing the surrounding structure elements.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are provided for determining relevant information from a document based on document structure. A document is selected and structural elements within the document having a dominance relationship are determined. A first location within the document is selected. The structural element surrounding the first location is determined and the surrounding and non-surrounding structural elements are characterized. Additional documents are associated with the first location in the surrounding structural element based on the surrounding structural element characterization and the non-surrounding structural element characterization. Techniques for dynamically determining annotations for images based on document structure are also provided.

Citations

14 Claims

1. A method of determining information relevant to a location within a first document, the method comprising:
- receiving a selection of the first document, the first document being received through an input and output interface of a computer;
  
  identifying at least two structural elements in the first document having a dominance relationship, the identifying being performed by one or more processors of the computer;
  
  receiving a selection of a first location in the first document from a user through the input and output interface;
  
  determining surrounding structural elements surrounding the first location, the determining comprising selecting from the at least two structural elements;
  
  characterizing the surrounding structural elements by the one or more processors;
  
  characterizing one or more non-surrounding structural elements from among the at least two structural elements not determined to be the surrounding structural elements by the one or more processors;
  
  characterizing surrounding phrase for frequency of occurrence of a plurality of first terms by the one or more processors;
  
  characterizing non-surrounding phrases in the first document for the occurrence of the plurality of the first terms by the one or more processors, the non-surrounding phrases being phrases in the first document other than the surrounding phrase;
  
  associating one or more second documents with the surrounding structural elements based on the characterization of the surrounding structural elements and the one or more non-surrounding structural elements by the one or more processors, wherein the one or more second documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements;
  
  creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of;
  
  Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterizing of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location; and
  
  removing a second group of the one or more second documents from among first groups of the one or more second documents to obtain a third group of the one or more documents, wherein the removing is based on the characterizing the surrounding structure elements.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, in which the first location is selected based on at least one of:
    - manually and programmatic control.
  - 3. The method of claim 2, in which the manual selection of the first location is based on at least one of:
    - implicit and explicit user input.
  - 4. The method of claim 1, in which the second documents comprise human sensible information.
  - 5. The method of claim 4, in which the human sensible information is at least one of textual, audio and video information.
  - 6. The method of claim 1, in which the first document comprises at least one of textual, audio and video information.
  - 7. The method of claim 1, wherein the associating second documents with the surrounding structural element comprises:
    - determining third documents being similar to the surrounding structural element; and
      
      removing, from among the third documents, fourth documents being similar to the non-surrounding structural elements to obtain the second documents.

8. An apparatus for determining relevant information comprising:
- one or more processors;
  
  an input/output circuit that retrieves a first document from a document repository responsive to a user selection;
  
  a document structure manager that identifies at least two structural elements in the first document having a dominance relationship;
  
  input and output interface that receives a selection of a first location in the first document;
  
  a structural element manger identifies surrounding structural elements surrounding the selected first location and one or more non-surrounding structural elements from among the at least two structural elements;
  
  a characterization manger characterizes the surrounding structural elements and the one or more non-surrounding structural elements from among the at least two structural elements that is not determined to be the surrounding structural elements;
  
  the characterization manger characterizes surrounding phrase for frequency of occurrence of a plurality of first terms;
  
  the characterization manger further characterizes non-surrounding phrases in the first document for the occurrence of the plurality of the first terms, the non-surrounding phrases being phrases in the first document other than the surrounding phrase; and
  
  a readable program code for;
  
  associating one or more second documents with the surrounding structural elements based on the characterization of the surrounding structural elements and the one or more non-surrounding structural elements, wherein the one or more second documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements;
  
  creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of;
  
  Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterizing of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location; and
  
  removing a second group of the one or more second documents from among first groups of the one or more second documents to obtain a third group of the one or more documents, wherein the removing is based on the characterizing the surrounding structure elements.
- View Dependent Claims (9, 10, 11, 12)
- - 9. The apparatus of claim 8, in which the first location is selected based on at least one of:
    - manually and programmatically.
  - 10. The apparatus of claim 8, in which additional documents comprise human sensible information.
  - 11. The apparatus of claim 10, in which the human sensible information is at least one of textual, audio and video information.
  - 12. The apparatus of claim 8, in which the determined document comprises at least one of textual, audio and video information.

13. A computer readable storage medium comprising computer readable program code embodied on the computer readable storage medium, the computer readable program code useable to program a computer for performing the steps of:
- receiving a selection of a first document, the first document being received through an input and output interface of a computer;
  
  identifying at least two structural elements in the first document having a dominance relationship, the identifying being performed by one or more processors of the computer;
  
  receiving a selection of a first location in the first document from a user through the input and output interface;
  
  determining surrounding structural elements surrounding the first location, the determining comprising selecting from the at least two structural elements;
  
  characterizing the surrounding structural elements by the one or more processors;
  
  characterizing one or more non-surrounding structural elements from among the at least two structural elements not determined to be the surrounding structural elements by the one or more processors;
  
  characterizing surrounding phrase for frequency of occurrence of a plurality of first terms by the one or more processors;
  
  characterizing non-surrounding phrases in the first document for the occurrence of the plurality of the first terms by the one or more processors, the non-surrounding phrases being phrases in the first document other than the surrounding phrase;
  
  associating one or more second documents with the surrounding structural elements based on the characterization of the surrounding structural elements and the one or more non-surrounding structural elements by the one or more processors, wherein the one or more second documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements;
  
  creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of;
  
  Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterizing of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location; and
  
  removing a second group of the one or more second documents from among first groups of the one or more second documents to obtain a third group of the one or more documents, wherein the removing is based on the characterizing the surrounding structure elements.

14. A method for retrieving information relevant to a word within a document, the method comprising:
- retrieving, responsive to a first user input, a first document from a document repository saved on a database coupled to a computer, the first document including a plurality of phrases;
  
  determining the plurality of phrases in the first document by one or more processors of the computer, the determining comprising selecting from at least two structural documents;
  
  selecting, responsive to a second user input, a first word within the first document by the one or more processors, the first user input and the second user input being received through an input and output interface of the computer;
  
  determining a first phrase that includes the first word as a surrounding phrase by the one or more processors;
  
  characterizing the surrounding phrase for frequency of occurrence of a plurality of first terms by the one or more processors;
  
  characterizing non-surrounding phrases in the first document for the occurrence of the plurality of the first terms by the one or more processors, the non-surrounding phrases being phrases in the first document other than the surrounding phrase;
  
  finding a first group of one or more documents being similar to the surrounding phrase based on the characterization of the surrounding phrase;
  
  finding within the first group of the one or more documents, a second group of the one or more documents being similar to the non-surrounding phrases, the second group of one or more documents being similar to both the surrounding phrase and the non-surrounding phrases;
  
  associating the one or more documents with surrounding structural elements based on characterization of the surrounding structural elements and one or more non-surrounding structural elements by the one or more processors, wherein the one or more documents are determined as being similar to the surrounding structural elements and being dissimilar to the one or more non-surrounding structural elements;
  
  creating representative vectors based on the frequency of occurrence of the first terms in the surrounding structural elements, performing latent semantic analysis (LSA) on the surrounding structural elements, the surrounding structural elements are determined based on explicit or implicit information, the implicit information is determined based on theory of analysis, the theory of analysis is at least one of;
  
  Linguistic Discourse Model (LDM), Universal Linguistic Discourse Model (ULDM), Discourse Structures Theory (DST), Rhetorical Structures Theory (RST), and Structure Discourse Representation Theory (SDRT), the characterization of the surrounding structural elements is based on similarity of the representative vectors, the representative vectors are used to select additional documents that are similar in meaning to the surrounding structure elements but are dissimilar to the non-surrounding structure elements, wherein the additional documents are in association with the first location;
  
  removing a second group of the one or more documents from among first groups of the one or more documents to obtain a third group of the one or more documents, wherein the removing is based on the characterization of the surrounding structure elements; and
  
  outputting the third group of the one or more documents to the user on the input and output interface.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fuji Xerox Company Limited (Fujifilm Holdings Corporation)
Original Assignee
Fuji Xerox Company Limited (Fujifilm Holdings Corporation)
Inventors
Van Den Berg, Martin H., Liew, Bee Yian, Chiu, Patrick, Polanyi, Livia, Rieffel, Eleanor G., Thione, Giovanni L.
Primary Examiner(s)
Truong; Cam Y T

Application Number

US11/301,853
Publication Number

US 20070143098A1
Time in Patent Office

1,646 Days
Field of Search

707/1, 707/10
US Class Current

707/730
CPC Class Codes

G06F 40/35 Discourse or dialogue repre...

Systems and methods for determining relevant information based on document structure

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for determining relevant information based on document structure

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links