×

Bag-of-repeats representation of documents

  • US 9,183,193 B2
  • Filed: 02/12/2013
  • Issued: 11/10/2015
  • Est. Priority Date: 02/12/2013
  • Status: Expired due to Fees
First Claim
Patent Images

1. A system for representing a textual document based on the occurrence of repeats, comprising:

  • a sequence generator which defines a sequence representing words forming a collection of documents;

    a repeat calculator which identifies a set of repeats within the sequence, the set of repeats comprising subsequences of the sequence which each have more than one occurrence in the sequence;

    a context calculator which identifies at least one of a left context and a right context for each occurrence of a repeat in the set of repeats, the left context of an occurrence of the repeat being a word which immediately precedes the occurrence of the repeat in the document collection sequence, the right context of an occurrence of the repeat being a word which immediately follows the occurrence of the repeat in the document collection sequence, the context calculator identifying repeats in the set of repeats which are at least one of left context diverse, riqht context diverse, left context unique, and right context unique based on the identified at least one of the left context and the right context for each occurrence of the repeat,a repeat being identified as left context diverse if it appears in at least two different left contexts,a repeat being identified as riqht context diverse if it appears in at least two different right contexts,an occurrence of a repeat being identified as left context unique if it is the sole occurrence of the repeat in that left context, andan occurrence of a repeat being identified as right context unique if it is the sole occurrence of the repeat in that right context;

    a representation generator which generates a representation for at least one document in the collection of documents based on occurrence, in the document, of repeats from the set of repeats, the representation accounting for the context of at least some of the repeats in the set of repeats; and

    a processor which implements the sequence generator, repeat calculator, and representation generator.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×