Header-token driven automatic text segmentation
First Claim
Patent Images
1. A method of automatic text segmentation, the method comprising:
- estimating, for each token in a set of tokens in a description, through use of a machine having one or more processors, a probability that the token is irrelevant;
associating, with each token in the set of tokens in the description, one of a first value, a second value, or a third value, based on whether, respectively,the token occurs in a header of the description,a lexical association exists between the token and a token in the header, orthe lexical association is absent and the token does not occur in the header;
iterating, through use of the machine, over a plurality of groups of sequential tokens in the description, in each iteration,selecting a group,computing a relevance value of the selected group based, at least in part, on at least one estimated probability of one or more tokens outside the selected group and on values associated with one or more tokens in the selected group; and
indicating, through use of the machine, one of the plurality of groups as having a greatest relevance value.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and a system to automatically segment text based on header tokens is described. A relevance value and an irrelevance value are determined for each token in a description, assuming no tokens are left out of computations. The irrelevance value is based on occurrences of a token in a sample set of descriptions. The relevance value is an estimated probability of relevance based on the header of the description being segmented.
50 Citations
7 Claims
-
1. A method of automatic text segmentation, the method comprising:
-
estimating, for each token in a set of tokens in a description, through use of a machine having one or more processors, a probability that the token is irrelevant; associating, with each token in the set of tokens in the description, one of a first value, a second value, or a third value, based on whether, respectively, the token occurs in a header of the description, a lexical association exists between the token and a token in the header, or the lexical association is absent and the token does not occur in the header; iterating, through use of the machine, over a plurality of groups of sequential tokens in the description, in each iteration, selecting a group, computing a relevance value of the selected group based, at least in part, on at least one estimated probability of one or more tokens outside the selected group and on values associated with one or more tokens in the selected group; and indicating, through use of the machine, one of the plurality of groups as having a greatest relevance value. - View Dependent Claims (2, 3, 4)
-
-
5. A non-transitory machine-readable storage medium comprising a set of instructions which, when executed by one or more processors of a machine, cause the machine to perform operations comprising:
-
estimating, for each token in a set of tokens in a description, a probability that the token is irrelevant; associating, with each token in the set of tokens in the description, one of a first value, a second value, or a third value, based on whether, respectively, the token occurs in a header of the description, a lexical association exists between the token and a token in the header, or the lexical association is absent and the token does not occur in the header; iterating over a plurality of groups of sequential tokens in the description, in each iteration, selecting a group, computing a relevance value of the selected group based, at least in part, on at least one estimated probability of one or more tokens outside the selected group and on values associated with one or more tokens in the selected group; and indicating one of the plurality of groups as having a greatest relevance value. - View Dependent Claims (6, 7)
-
Specification