Bag-of-repeats representation of documents
First Claim
Patent Images
1. A system for representing a textual document based on the occurrence of repeats, comprising:
- a sequence generator which defines a sequence representing words forming a collection of documents;
a repeat calculator which identifies a set of repeats within the sequence, the set of repeats comprising subsequences of the sequence which each have more than one occurrence in the sequence;
a context calculator which identifies at least one of a left context and a right context for each occurrence of a repeat in the set of repeats, the left context of an occurrence of the repeat being a word which immediately precedes the occurrence of the repeat in the document collection sequence, the right context of an occurrence of the repeat being a word which immediately follows the occurrence of the repeat in the document collection sequence, the context calculator identifying repeats in the set of repeats which are at least one of left context diverse, riqht context diverse, left context unique, and right context unique based on the identified at least one of the left context and the right context for each occurrence of the repeat,a repeat being identified as left context diverse if it appears in at least two different left contexts,a repeat being identified as riqht context diverse if it appears in at least two different right contexts,an occurrence of a repeat being identified as left context unique if it is the sole occurrence of the repeat in that left context, andan occurrence of a repeat being identified as right context unique if it is the sole occurrence of the repeat in that right context;
a representation generator which generates a representation for at least one document in the collection of documents based on occurrence, in the document, of repeats from the set of repeats, the representation accounting for the context of at least some of the repeats in the set of repeats; and
a processor which implements the sequence generator, repeat calculator, and representation generator.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for representing a textual document based on the occurrence of repeats are disclosed. The system includes a sequence generator which defines a sequence representing words forming a collection of documents. A repeat calculator identifies a set of repeats within the sequence, the set of repeats comprising subsequences of the sequence which each occur more than once. A representation generator generates a representation for at least one document in the collection of documents based on occurrence, in the document, of repeats from the set of repeats.
6 Citations
24 Claims
-
1. A system for representing a textual document based on the occurrence of repeats, comprising:
-
a sequence generator which defines a sequence representing words forming a collection of documents; a repeat calculator which identifies a set of repeats within the sequence, the set of repeats comprising subsequences of the sequence which each have more than one occurrence in the sequence; a context calculator which identifies at least one of a left context and a right context for each occurrence of a repeat in the set of repeats, the left context of an occurrence of the repeat being a word which immediately precedes the occurrence of the repeat in the document collection sequence, the right context of an occurrence of the repeat being a word which immediately follows the occurrence of the repeat in the document collection sequence, the context calculator identifying repeats in the set of repeats which are at least one of left context diverse, riqht context diverse, left context unique, and right context unique based on the identified at least one of the left context and the right context for each occurrence of the repeat, a repeat being identified as left context diverse if it appears in at least two different left contexts, a repeat being identified as riqht context diverse if it appears in at least two different right contexts, an occurrence of a repeat being identified as left context unique if it is the sole occurrence of the repeat in that left context, and an occurrence of a repeat being identified as right context unique if it is the sole occurrence of the repeat in that right context; a representation generator which generates a representation for at least one document in the collection of documents based on occurrence, in the document, of repeats from the set of repeats, the representation accounting for the context of at least some of the repeats in the set of repeats; and a processor which implements the sequence generator, repeat calculator, and representation generator. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method for representing a textual document based on the occurrence of repeats, comprising:
-
receiving a collection of text documents; defining a sequence representing words forming the collection of documents; identifying a set of repeats within the sequence, the set of repeats comprising subsequences of the sequence which each have more than one occurrence in the sequence; for each repeat in the set of repeats, identifying at least one of a left context and a right context for each occurrence of the repeat, the left context of the occurrence of the repeat being a word which immediately precedes the occurrence of the repeat in the document collection sequence, the right context of the occurrence of the repeat being a word which immediately follows the occurrence of the repeat in the document collection sequence; generating a representation for at least one document in the collection of documents based on occurrence, in the document, of repeats from the set of repeats and their identified at least one of left and right contexts; and wherein at least one of the defining a sequence, identifying a set of repeats, and generating a representation is performed by a computer processor. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
-
-
23. A method for representing a textual document based on the occurrence of repeats, comprising:
-
receiving a collection of documents; defining a sequence representing words forming the collection of documents; identifying a set of repeats within the sequence, the set of repeats comprising all subsequences of the sequence which each occur more than once, the repeats being identified regardless of length; from the set of repeats, identifying a subset of the repeats that are at least one of;
both left and right context diverse and both left and right context unique;generating a vectorial representation for at least one document in the collection of documents based on occurrence, in the document, of repeats identified as being in the subset of repeats; and wherein at least one of the defining a sequence, identifying a set of repeats, identifying a subset of the repeats, and generating a representation is performed by a computer processor. - View Dependent Claims (24)
-
Specification