Header-token driven automatic text segmentation

US 20080162520A1
Filed: 12/28/2006
Published: 07/03/2008
Est. Priority Date: 12/28/2006
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a first machine configured to transmit a header and a corresponding unstructured description; and

a second machine in at least selective communication with the first machine, the second machine configured to,receive the header and the unstructured description,determine relevant state values and irrelevant state values for a set of tokens in the unstructured description based on the header,indicate as most relevant a sequence of tokens based, at least in part, on the relevant state values of the sequence of tokens and the irrelevant state values of those of the set of tokens outside of the sequence of tokens.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and a system to automatically segment text based on header tokens is described. A relevance value and an irrelevance value are determined for each token in a description, assuming no tokens are left out of computations. The irrelevance value is based on occurrences of a token in a sample set of descriptions. The relevance value is an estimated probability of relevance based on the header of the description being segmented.

Citations

30 Claims

1. A system comprising:
- a first machine configured to transmit a header and a corresponding unstructured description; and
  
  a second machine in at least selective communication with the first machine, the second machine configured to,receive the header and the unstructured description,determine relevant state values and irrelevant state values for a set of tokens in the unstructured description based on the header,indicate as most relevant a sequence of tokens based, at least in part, on the relevant state values of the sequence of tokens and the irrelevant state values of those of the set of tokens outside of the sequence of tokens.
- View Dependent Claims (2, 3, 4)
- - 2. The system of claim 1, further comprising a database configured to host a plurality of descriptions and respective headers, wherein the second machine is configured to examine the received description and header and the plurality of descriptions and respective headers to determine the values.
  - 3. The system of claim 2, wherein the irrelevant state value for a particular one of the set of tokens in the unstructured description is based on frequency of the particular token in the plurality of descriptions, wherein the relevant state value for the particular token is based on frequency that the particular token occurs in both descriptions and respective headers in the database and frequency that the particular token occurs in descriptions in the database with lexically associated header tokens.
  - 4. The system of claim 1, wherein the first and the second machine are communicatively coupled as peers or as client-server.

5. A method of automatic text segmentation, the method comprising the acts of:
- estimating for each of a set of tokens in a description,a first probability that the token occurs as an irrelevant token in the description,a second probability that the token occurs as a relevant token in the description based, at least in part, on a header for the description; and
  
  identifying a group of sequential tokens in the description with a maximum probability of relevance based, at least in part, on the computed first probabilities of those of the set of tokens outside of the group of sequential tokens and the computed second probabilities of those of the set of tokens in the group of sequential tokens.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 6. The method of claim 5, wherein said act of estimating the second probability comprises the act of determining at least one of if the token occurs in the header and if a lexical association exists between the token and a token in the header.
  - 7. The method of claim 6, wherein the lexical association represents a probability that the token occurs in the description given the observance of the header token.
  - 8. The method of claim 5,wherein the first probability for the token is estimated based on frequency of occurrence of the token in a plurality of descriptions,wherein the second probability for the token is estimated based on occurrence of the token in the plurality of descriptions and occurrence of the token or a lexically associated token in headers of the plurality of descriptions.
  - 9. The method of claim 5, wherein said identifying comprises tagging the identified sequence of tokens as a relevant segment of the description.
  - 10. The method of claim 9, further comprising tagging a remainder of the description as one or more irrelevant segments.
  - 11. The method of claim 10, further comprising revising the estimated probabilities based, at least in part, on said tagging.
  - 12. The method of claim 15, further comprising assigning pre-defined probabilities to a pre-defined set of tokens.
  - 13. The method of claim 5, further comprising ignoring one or more pre-defined tokens that occur in the description.
  - 14. The method of claim 5, further comprising:
    - receiving the description and the header; and
      
      adding the header and the description to a plurality of descriptions.
  - 15. The method of claim 14, further comprising maintaining first and second probabilities for at least some tokens that occur in the plurality of descriptions.
  - 16. The method of claim 5, wherein said identifying comprises:
    - iterating over different groups of sequential tokens in the description; and
      
      for each iteration, selecting a group of sequential tokens and computing the probability of relevance for the selected groups based on the first probabilities of those tokens outside of the selected group and on the second probabilities of those tokens in the selected group.
  - 17. The method of claim 16, wherein said computing the probability of relevance for the selected group further comprises factoring in probability of transitioning between relevance to irrelevance with respect to a previously selected group.
  - 18. The method of claim 16, wherein said computing the probability of relevance for the selected group comprises computing the sum of logarithms of the probabilities.
  - 19. The method of claim 5, embodied as a set of instructions encoded in one or more machine-readable media.

20. A method of automatic text segmentation, the method comprising the acts of:
- estimating for each token in set of tokens in a description, a probability that the token is irrelevant for the description;
  
  associating with each token in the set of tokens in the description, one of a first, second or third values dependent on whether respectively,the token occurs in a header for the descriptiona lexical association exists between the token and a token in the header,the lexical association is absent and the token does not occur in the header;
  
  iterating over a plurality of groups of sequential tokens in the description, in each iteration,selecting a group,computing a relevance value for the selected group based, at least in part, on the estimated probability of one or more tokens out of the selected group and values associated with one or more tokens in the selected group; and
  
  indicating the one of the plural groups having a greatest relevance value.
- View Dependent Claims (21, 22, 23, 24, 25, 26)
- - 21. The method of claim 20, wherein said act of estimating the probability for the token comprises examining a plurality of descriptions and determining a frequency of occurrence of the token throughout the plurality of descriptions.
  - 22. The method of claim 20, wherein the first value represents an estimated probability that the token is relevant given occurrence of the token in the header, wherein the second value represents an estimated probability that the token is relevant given the occurrence of a lexically associated token in the header, wherein the third value represents an estimated probability that the token is relevant given the token does not occur in the header and is not lexically associated with a header token.
  - 23. The method of claim 20, wherein said indicating comprises tagging the description.
  - 24. The method of claim 20, embodied as a set of instructions encoded in one or more machine-readable media.
  - 25. The method of claim 20, wherein the description corresponds to one of an abstract item and a concrete item.
  - 26. The method of claim 20, wherein the set of tokens are encoded as one of a set consisting essentially of ASCII, Universal Character Set, ANSI, Double Byte Character Sets, and Unicode.

27. A set of instructions encoded in one or more machine-readable media, the set of instructions comprising:
- a first sequence of instructions executable to, for each of a plurality of tokens of a description, associate an estimated probability of irrelevance and an estimated probability of relevance based on a set of one or more tokens in a header for the description; and
  
  a second sequence of instructions executable to indicate a group of sequential tokens of the plurality of tokens based, at least in part, on the estimated probabilities of relevance associated by the first sequence of instructions with those of the plurality of tokens in the group and on the estimated probabilities of irrelevance associated by the first sequence of instructions with those of the plurality of tokens outside of the group.
- View Dependent Claims (28)
- - 28. The set of instructions of claim 27, further comprising a third sequence of instructions executable to compute, for each of the plurality of tokens, the estimated probability of irrelevance for the token based on occurrence of the token in a plurality of descriptions, and executable to compute an estimated probability of relevance based on one of occurrence of the token in the header and lexical association between the token and a token in the header.

29. An apparatus comprising:
- memory operable to host a description of a product or service and a header for the description, wherein the description is represented with a plurality of tokens; and
  
  means for automatically identifying a group of sequential tokens of the plurality of tokens as most relevant to the description based on estimated probabilities of relevance of the plurality of tokens based, at least in part, on the header and estimated probabilities of irrelevance of the plurality of tokens.
- View Dependent Claims (30)
- - 30. The apparatus of claim 29, further comprising means for computing the estimated probabilities.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
PayPal, Inc. (PayPal Holdings, Inc.)
Original Assignee
eBay Inc.
Inventors
Sarwar, Badrul M., Mount, John A.

Granted Patent

US 8,631,005 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/101
CPC Class Codes

G06F 16/22   Indexing; Data structures t...

G06F 16/24578   using ranking

G06F 16/285   Clustering or classification

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/289   Phrasal analysis, e.g. fini...

Header-token driven automatic text segmentation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Header-token driven automatic text segmentation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links