Header-token driven automatic text segmentation

US 8,631,005 B2
Filed: 12/28/2006
Issued: 01/14/2014
Est. Priority Date: 12/28/2006
Status: Active Grant

First Claim

Patent Images

1. A method of automatic text segmentation, the method comprising:

estimating, for each token in a set of tokens in a description, through use of a machine having one or more processors, a probability that the token is irrelevant;

associating, with each token in the set of tokens in the description, one of a first value, a second value, or a third value, based on whether, respectively,the token occurs in a header of the description,a lexical association exists between the token and a token in the header, orthe lexical association is absent and the token does not occur in the header;

iterating, through use of the machine, over a plurality of groups of sequential tokens in the description, in each iteration,selecting a group,computing a relevance value of the selected group based, at least in part, on at least one estimated probability of one or more tokens outside the selected group and on values associated with one or more tokens in the selected group; and

indicating, through use of the machine, one of the plurality of groups as having a greatest relevance value.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and a system to automatically segment text based on header tokens is described. A relevance value and an irrelevance value are determined for each token in a description, assuming no tokens are left out of computations. The irrelevance value is based on occurrences of a token in a sample set of descriptions. The relevance value is an estimated probability of relevance based on the header of the description being segmented.

50 Citations

View as Search Results

7 Claims

1. A method of automatic text segmentation, the method comprising:
- estimating, for each token in a set of tokens in a description, through use of a machine having one or more processors, a probability that the token is irrelevant;
  
  associating, with each token in the set of tokens in the description, one of a first value, a second value, or a third value, based on whether, respectively,the token occurs in a header of the description,a lexical association exists between the token and a token in the header, orthe lexical association is absent and the token does not occur in the header;
  
  iterating, through use of the machine, over a plurality of groups of sequential tokens in the description, in each iteration,selecting a group,computing a relevance value of the selected group based, at least in part, on at least one estimated probability of one or more tokens outside the selected group and on values associated with one or more tokens in the selected group; and
  
  indicating, through use of the machine, one of the plurality of groups as having a greatest relevance value.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein estimating the probability that the token is irrelevant includes examining a plurality of descriptions and determining a frequency of occurrence of the token throughout the plurality of descriptions.
  - 3. The method of claim 1, wherein the first value represents a first estimated probability that the token is relevant given occurrence of the token in the header, wherein the second value represents a second estimated probability that the token is relevant given the occurrence of a lexically associated token in the header, and wherein the third value represents a third estimated probability that the token is relevant given that the token does not occur in the header and is not lexically associated with a token in the header.
  - 4. The method of claim 1, wherein indicating the one of the plurality of groups includes tagging the description.

5. A non-transitory machine-readable storage medium comprising a set of instructions which, when executed by one or more processors of a machine, cause the machine to perform operations comprising:
- estimating, for each token in a set of tokens in a description, a probability that the token is irrelevant;
  
  associating, with each token in the set of tokens in the description, one of a first value, a second value, or a third value, based on whether, respectively,the token occurs in a header of the description,a lexical association exists between the token and a token in the header, orthe lexical association is absent and the token does not occur in the header;
  
  iterating over a plurality of groups of sequential tokens in the description, in each iteration,selecting a group,computing a relevance value of the selected group based, at least in part, on at least one estimated probability of one or more tokens outside the selected group and on values associated with one or more tokens in the selected group; and
  
  indicating one of the plurality of groups as having a greatest relevance value.
- View Dependent Claims (6, 7)
- - 6. The machine-readable storage medium of claim 5, wherein the description corresponds to one of an abstract item and a concrete item.
  - 7. The machine-readable storage medium of claim 5, wherein the set of tokens are encoded according to at least one of ASCII, Universal Character Set, ANSI, Double Byte Character Sets, or Unicode.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
PayPal, Inc. (PayPal Holdings, Inc.)
Original Assignee
eBay Inc.
Inventors
Sarwar, Badrul M., Mount, John A.
Primary Examiner(s)
Fleurantin, Jean B
Assistant Examiner(s)
Ly, Anh

Application Number

US11/646,900
Publication Number

US 20080162520A1
Time in Patent Office

2,574 Days
Field of Search

707/5, 707/6, 707/10, 707/728, 707/749, 707/917, 707/750, 707/E17.064, 707/999.101, 715/513, 715/531, 715/233, 715/254, 715/237, 715/256, 715/FOR.239, 715/FOR.241, 704/1, 704/9, 704/4, 704/8, 704/258, 704/E15.005, 704/256.4
US Class Current

707/728
CPC Class Codes

G06F 16/22   Indexing; Data structures t...

G06F 16/24578   using ranking

G06F 16/285   Clustering or classification

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/289   Phrasal analysis, e.g. fini...

Header-token driven automatic text segmentation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

50 Citations

7 Claims

Specification

Solutions

Use Cases

Quick Links

Header-token driven automatic text segmentation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

50 Citations

7 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links