BAG-OF-REPEATS REPRESENTATION OF DOCUMENTS
First Claim
Patent Images
1. A system for representing a textual document based on the occurrence of repeats, comprising:
- a sequence generator which defines a sequence representing words forming a collection of documents;
a repeat calculator which identifies a set of repeats within the sequence, the set of repeats comprising subsequences of the sequence which each occur more than once;
a representation generator which generates a representation for at least one document in the collection of documents based on occurrence, in the document, of repeats from the set of repeats; and
a processor which implements the sequence generator, repeat calculator, and representation generator.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for representing a textual document based on the occurrence of repeats are disclosed. The system includes a sequence generator which defines a sequence representing words forming a collection of documents. A repeat calculator identifies a set of repeats within the sequence, the set of repeats comprising subsequences of the sequence which each occur more than once. A representation generator generates a representation for at least one document in the collection of documents based on occurrence, in the document, of repeats from the set of repeats.
24 Citations
24 Claims
-
1. A system for representing a textual document based on the occurrence of repeats, comprising:
-
a sequence generator which defines a sequence representing words forming a collection of documents; a repeat calculator which identifies a set of repeats within the sequence, the set of repeats comprising subsequences of the sequence which each occur more than once; a representation generator which generates a representation for at least one document in the collection of documents based on occurrence, in the document, of repeats from the set of repeats; and a processor which implements the sequence generator, repeat calculator, and representation generator. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A method for representing a textual document based on the occurrence of repeats, comprising:
-
receiving a collection of text documents; defining a sequence representing words forming the collection of documents; identifying a set of repeats within the sequence, the set of repeats comprising subsequences of the sequence which each occur more than once; generating a representation for at least one document in the collection of documents based on occurrence, in the document, of repeats from the set of repeats; and wherein at least one of the defining a sequence, identifying a set of repeats, and generating a representation is performed by a computer processor. - View Dependent Claims (19, 20, 21, 22, 23)
-
-
24. A method for representing a textual document based on the occurrence of repeats, comprising:
-
receiving a collection of documents; defining a sequence representing words forming the collection of documents; identifying a set of repeats within the sequence, the set of repeats comprising all subsequences of the sequence which each occur more than once; from the set of repeats, identifying a subset of the repeats that are at least one of;
both left and right context diverse and both left and right context unique;generating a vectorial representation for at least one document in the collection of documents based on occurrence, in the document, of repeats identified as being in the subset of repeats; and wherein at least one of the defining a sequence, identifying a set of repeats, identifying a subset of the repeats, and generating a representation is performed by a computer processor.
-
Specification