Document analyzer and metadata generation

US 8,060,506 B1
Filed: 11/02/2010
Issued: 11/15/2011
Est. Priority Date: 11/28/2007
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving, at a computer system, a collection of text-based terms associated with a document;

performing, via the computer system, a statistical analysis on the text-based terms to identify a distribution of the text-based terms in the document, wherein the statistical analysis uses one or more locations at which the text-based terms appear in the document; and

providing, via the computer system, representative terms for association with the document, wherein the representative terms are identified by identifying which of the text-based terms are most representative of the document based on the distribution of the text-based terms in the document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A document analyzer receives a collection of text-based terms associated with a document. The document analyzer performs a statistical analysis on the text-based terms to identify a distribution of where the text-based terms appear in the document and relative frequency indicating how often the text-based terms appear in the document. The document analyzer utilizes the distribution and relative frequency information derived from the statistical analysis to rank multiple themes associated with the document. For example, a received listing of multiple themes may not be presented in any useful order, although it can be assumed that the themes in the listing are present in the document. Based on application of distribution and relative frequency information derived from the analysis, the document analyzer can identify which themes are most relevant to the document as a whole and/or which of themes correspond to different portions (e.g., pages or sections) of the document.

36 Citations

View as Search Results

20 Claims

1. A computer-implemented method comprising:
- receiving, at a computer system, a collection of text-based terms associated with a document;
  
  performing, via the computer system, a statistical analysis on the text-based terms to identify a distribution of the text-based terms in the document, wherein the statistical analysis uses one or more locations at which the text-based terms appear in the document; and
  
  providing, via the computer system, representative terms for association with the document, wherein the representative terms are identified by identifying which of the text-based terms are most representative of the document based on the distribution of the text-based terms in the document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 wherein performing the statistical analysis on the text-based terms to identify the distribution comprises, for a given text-based term, identifying a location range of where the given text-based term can be found in the document based on detecting a first occurrence of the given text-based term and a last occurrence of the given text-based term in the document.
  - 3. The method of claim 1 wherein performing the statistical analysis on the text-based terms to identify the distribution comprises:
    - for a given text-based term of the text-based terms, detecting relative locations where the given text-based term can be found in the document; and
      
      based on the relative locations where the given text-based term can be found in the document, generating a weighted average location value specifying a centroid associated with occurrences of the given text-based term in the document.
  - 4. The method of claim 3 wherein performing the statistical analysis on the text-based terms to identify the distribution comprises, for a given text-based term, identifying a standard deviation of different locations of the given text-based term in the document relative to the centroid.
  - 5. The method of claim 1 further comprising identifying relevant advertisements for displaying upon retrieval of the document.
  - 6. The method of claim 1 further comprising providing the keywords associated with the document for inclusion in metadata associated with the document.
  - 7. The method of claim 1 wherein providing representative terms for association with the document comprises associating representative terms with different portions of the document.
  - 8. The method of claim 1 wherein providing representative terms for association with the document comprises providing keywords or categories for association with the document.

9. A non-transitory computer readable medium having computer code thereon, the medium comprising:
- instructions for receiving a collection of text-based terms associated with a document;
  
  instructions for performing a statistical analysis on the text-based terms to identify a distribution of the text-based terms in the document, wherein the statistical analysis uses one or more locations at which the text-based terms appear in the document; and
  
  instructions for providing representative terms for association with the document, wherein the representative terms are identified by identifying which of the text-based terms are most representative of the document based on the distribution of the text-based terms in the document.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The computer readable medium of claim 9 wherein performing the statistical analysis on the text-based terms to identify the distribution comprises, for a given text-based term, identifying a location range of where the given text-based term can be found in the document based on detecting a first occurrence of the given text-based term and a last occurrence of the given text-based term in the document.
  - 11. The computer readable medium of claim 9 wherein performing the statistical analysis on the text-based terms to identify the distribution comprises:
    - for a given text-based term of the text-based terms, detecting relative locations where the given text-based term can be found in the document; and
      
      based on the relative locations where the given text-based term can be found in the document, generating a weighted average location value specifying a centroid associated with occurrences of the given text-based term in the document.
  - 12. The computer readable medium of claim 11 wherein performing the statistical analysis on the text-based terms to identify the distribution comprises, for a given text-based term, identifying a standard deviation of different locations of the given text-based term in the document relative to the centroid.
  - 13. The computer readable medium of claim 9 further comprising instructions for identifying relevant advertisements for displaying upon retrieval of the document.
  - 14. The computer readable medium of claim 9 wherein providing representative terms for association with the document comprises providing the representative terms associated with the document for inclusion in metadata of the document.
  - 15. The computer readable medium of claim 9 wherein providing representative terms for association with the document comprises associating representative terms with different portions of the document.
  - 16. The computer readable medium of claim 9 wherein providing representative terms for association with the document comprises providing keywords or categories for association with the document.

17. A computer system comprising:
- a processor;
  
  a memory unit that stores instructions associated with an application executed by the processor; and
  
  an interconnect coupling the processor and the memory unit, enabling the computer system to execute the application and perform operations comprising;
  
  receiving a collection of text-based terms associated with a document;
  
  performing a statistical analysis on the text-based terms to identify a distribution of the text-based terms in the document, wherein the statistical analysis uses one or more locations at which the text-based terms appear in the document; and
  
  providing representative terms for association with the document, wherein the representative terms are identified by identifying which of the text-based terms are most representative of the document based on the distribution of the text-based terms in the document.
- View Dependent Claims (18, 19, 20)
- - 18. The system of claim 17 wherein the system is enabled to perform the operation of performing the statistical analysis on the text-based terms to identify the distribution by performing an operation comprising, for a given text-based term, identifying a location range of where the given text-based term can be found in the document based on detecting a first occurrence of the given text-based term and a last occurrence of the given text-based term in the document.
  - 19. The system of claim 17 wherein the system is enabled to perform the operation of performing the statistical analysis on the text-based terms to identify the distribution by performing operations comprising:
    - for a given text-based term of the text-based terms, detecting relative locations where the given text-based term can be found in the document; and
      
      based on the relative locations where the given text-based term can be found in the document, generating a weighted average location value specifying a centroid associated with occurrences of the given text-based term in the document.
  - 20. The system of claim 19 wherein the system is enabled to perform the operation of performing the statistical analysis on the text-based terms to identify the distribution by performing an operation comprising, for a given text-based term, identifying a standard deviation of different locations of the given text-based term in the document relative to the centroid.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Adobe Inc.
Original Assignee
Adobe Systems Incorporated (Adobe Inc.)
Inventors
Welch, Michael J., Chang, Walter
Primary Examiner(s)
LEWIS, CHERYL RENEA

Application Number

US12/917,505
Time in Patent Office

378 Days
Field of Search

None
US Class Current

707/729
CPC Class Codes

G06F 16/345 Summarisation for human users

G06F 16/355 Class or cluster creation o...

Document analyzer and metadata generation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

36 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Document analyzer and metadata generation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

36 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links