Document analyzer and metadata generation
First Claim
1. A computer-implemented method comprising:
- receiving, at a computer system, a collection of text-based terms associated with a document;
performing, via the computer system, a statistical analysis on the text-based terms to identify a distribution of the text-based terms in the document, wherein the statistical analysis uses one or more locations at which the text-based terms appear in the document; and
providing, via the computer system, representative terms for association with the document, wherein the representative terms are identified by identifying which of the text-based terms are most representative of the document based on the distribution of the text-based terms in the document.
2 Assignments
0 Petitions
Accused Products
Abstract
A document analyzer receives a collection of text-based terms associated with a document. The document analyzer performs a statistical analysis on the text-based terms to identify a distribution of where the text-based terms appear in the document and relative frequency indicating how often the text-based terms appear in the document. The document analyzer utilizes the distribution and relative frequency information derived from the statistical analysis to rank multiple themes associated with the document. For example, a received listing of multiple themes may not be presented in any useful order, although it can be assumed that the themes in the listing are present in the document. Based on application of distribution and relative frequency information derived from the analysis, the document analyzer can identify which themes are most relevant to the document as a whole and/or which of themes correspond to different portions (e.g., pages or sections) of the document.
36 Citations
20 Claims
-
1. A computer-implemented method comprising:
-
receiving, at a computer system, a collection of text-based terms associated with a document; performing, via the computer system, a statistical analysis on the text-based terms to identify a distribution of the text-based terms in the document, wherein the statistical analysis uses one or more locations at which the text-based terms appear in the document; and providing, via the computer system, representative terms for association with the document, wherein the representative terms are identified by identifying which of the text-based terms are most representative of the document based on the distribution of the text-based terms in the document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A non-transitory computer readable medium having computer code thereon, the medium comprising:
-
instructions for receiving a collection of text-based terms associated with a document; instructions for performing a statistical analysis on the text-based terms to identify a distribution of the text-based terms in the document, wherein the statistical analysis uses one or more locations at which the text-based terms appear in the document; and instructions for providing representative terms for association with the document, wherein the representative terms are identified by identifying which of the text-based terms are most representative of the document based on the distribution of the text-based terms in the document. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer system comprising:
-
a processor; a memory unit that stores instructions associated with an application executed by the processor; and an interconnect coupling the processor and the memory unit, enabling the computer system to execute the application and perform operations comprising; receiving a collection of text-based terms associated with a document; performing a statistical analysis on the text-based terms to identify a distribution of the text-based terms in the document, wherein the statistical analysis uses one or more locations at which the text-based terms appear in the document; and providing representative terms for association with the document, wherein the representative terms are identified by identifying which of the text-based terms are most representative of the document based on the distribution of the text-based terms in the document. - View Dependent Claims (18, 19, 20)
-
Specification