System and methods for automated document topic discovery, browsable search and document categorization
First Claim
1. A computer-assisted method for automatically discovering topics in a document collection comprising one or more documents that comprise sentences and terms, comprising:
- automatically dividing each of the sentences into the grammatical components of a subject and a predicate by a computer system, wherein the predicate is defined as the portion of a sentence other than the subject;
assigning a first weighting coefficient to subjects in the sentences;
assigning a second weighting coefficient to predicates in the sentences;
tokenizing the sentences to produce a plurality of tokens, wherein each of the terms is associated with one or more of the plurality of tokens;
for each one of the terms,calculating a first token count in the plurality of tokens in which the one of the terms matches a subject in the sentences;
multiplying the first weighting coefficient with the first token count to produce a first weighted token count;
calculating a second token count in the plurality of tokens in which the one of the terms matches a predicate in the sentences;
multiplying the second weighting coefficient with the second token count to produce a second weighted token count;
producing a first score value for the one of the terms based on the first weighted token count and the second weighted token count, wherein the first score can be referred to as an internal term prominence (ITP) value in comparison to a second value, wherein the second value can be referred to as an external term prominence (ETP) value, wherein the ETP value represents the prominence of the term outside the document collection;
selecting one or more terms from the terms at least in part based on the first score values of the terms; and
outputting the one or more terms representing one or more topics associated with the document collection.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-assisted method for discovering topics in a document collection is disclosed. The method includes obtaining a group of text units in the document collection, tokenizing the words in the group of text units to produce a plurality of tokens that include a jth token, and adding a weighting coefficient to a parameter token_j_count for each text unit in the first group that includes the jth token. The weighting coefficient is dependent on the grammatical role of the jth token. The method includes calculating an internal term prominence value (ITP) using token_j_count, selecting one or more tokens from the tokens based on the ITP values of the respective tokens, and outputting the one or more selected tokens as topic terms associated with the document collection.
28 Citations
20 Claims
-
1. A computer-assisted method for automatically discovering topics in a document collection comprising one or more documents that comprise sentences and terms, comprising:
-
automatically dividing each of the sentences into the grammatical components of a subject and a predicate by a computer system, wherein the predicate is defined as the portion of a sentence other than the subject; assigning a first weighting coefficient to subjects in the sentences; assigning a second weighting coefficient to predicates in the sentences; tokenizing the sentences to produce a plurality of tokens, wherein each of the terms is associated with one or more of the plurality of tokens; for each one of the terms, calculating a first token count in the plurality of tokens in which the one of the terms matches a subject in the sentences; multiplying the first weighting coefficient with the first token count to produce a first weighted token count; calculating a second token count in the plurality of tokens in which the one of the terms matches a predicate in the sentences; multiplying the second weighting coefficient with the second token count to produce a second weighted token count; producing a first score value for the one of the terms based on the first weighted token count and the second weighted token count, wherein the first score can be referred to as an internal term prominence (ITP) value in comparison to a second value, wherein the second value can be referred to as an external term prominence (ETP) value, wherein the ETP value represents the prominence of the term outside the document collection; selecting one or more terms from the terms at least in part based on the first score values of the terms; and outputting the one or more terms representing one or more topics associated with the document collection. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for automatically discovering topics in a document collection comprising one or more documents that comprise sentences and terms, comprising:
-
a computer processor configured to automatically divide each of the sentences into the grammatical components of a subject and a predicate by a computer system, wherein the predicate is defined as the portion of a sentence other than the subject; assign a first weighting coefficient to subjects in the sentences; assign a second weighting coefficient to predicates in the sentences; tokenize the sentences to produce a plurality of tokens, wherein each of the terms is associated with one or more of the plurality of tokens; for each one of the terms, calculate a first token count in the plurality of tokens in which the one of the terms matches a subject in the sentences; multiply the first weighting coefficient with the first token count to produce a first weighted token count; calculate a second token count in the plurality of tokens in which the one of the terms matches a predicate in the sentences; multiply the second weighting coefficient with the second token count to produce a second weighted token count; produce a first value for the one of the terms based on the first weighted token count and the second weighted token count, wherein the first value can be referred to as an internal term prominence (ITP) value in comparison to a second value, wherein the second value can be referred to as an external term prominence (ETP) value, wherein the ETP value represents the prominence of the term outside the document collection; select one or more terms from the terms at least in part based on the score values of the terms; and output the one or more terms to represent the topics associated with the document collection.
-
-
13. A computer-assisted method for determining a term score and displaying terms extracted from a document collection, comprising:
-
obtaining, by a computer system, a document collection comprising one or more documents each comprising sentences and terms, wherein each term comprises a word or a multi-word phrase;
wherein a multi-word phrase comprises two or more sub-phrases, wherein each sub-phrase comprises a word or another multi-word phrase;assigning a sub-phrase weighting co-efficient to the sub-phrase of a multi-word phrase; tokenizing the sentences to produce a plurality of words or multi-word phrases as tokens, wherein a term matches with one or more of the plurality of tokens; for one or more terms in the document collection, calculating a sub-phrase token count of the term for each token of the term that is a sub-phrase of a multi-word phrase; multiplying the sub-phrase weighting co-efficient with the sub-phrase token count to produce a weighted sub-phrase token count for the term; calculating a non-sub-phrase token count of the term for each token of the term that is not a sub-phrase of a multi-word phrase; producing a score value for the term based on the weighted sub-phrase token count and the non-sub-phrase token count, wherein the score value can be referred to as an internal term prominence value in comparison to a second value, wherein the second value can be referred to as an external term prominence value, wherein the second value represents the prominence of the term outside the document collection; selecting one or more terms at least in part based on the score values of the terms; and outputting, for storage or display, the selected terms. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
Specification