×

System and methods for automated document topic discovery, browsable search and document categorization

  • US 8,843,476 B1
  • Filed: 05/18/2010
  • Issued: 09/23/2014
  • Est. Priority Date: 03/16/2009
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-assisted method for automatically discovering topics in a document collection comprising one or more documents that comprise sentences and terms, comprising:

  • automatically dividing each of the sentences into the grammatical components of a subject and a predicate by a computer system, wherein the predicate is defined as the portion of a sentence other than the subject;

    assigning a first weighting coefficient to subjects in the sentences;

    assigning a second weighting coefficient to predicates in the sentences;

    tokenizing the sentences to produce a plurality of tokens, wherein each of the terms is associated with one or more of the plurality of tokens;

    for each one of the terms,calculating a first token count in the plurality of tokens in which the one of the terms matches a subject in the sentences;

    multiplying the first weighting coefficient with the first token count to produce a first weighted token count;

    calculating a second token count in the plurality of tokens in which the one of the terms matches a predicate in the sentences;

    multiplying the second weighting coefficient with the second token count to produce a second weighted token count;

    producing a first score value for the one of the terms based on the first weighted token count and the second weighted token count, wherein the first score can be referred to as an internal term prominence (ITP) value in comparison to a second value, wherein the second value can be referred to as an external term prominence (ETP) value, wherein the ETP value represents the prominence of the term outside the document collection;

    selecting one or more terms from the terms at least in part based on the first score values of the terms; and

    outputting the one or more terms representing one or more topics associated with the document collection.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×