Methods and systems for improving text segmentation

US 7,680,648 B2
Filed: 09/30/2004
Issued: 03/16/2010
Est. Priority Date: 09/30/2004
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

receiving a string of characters that comprises a plurality of characters with no token-delineating breaks;

segmenting the string of characters into a first segmented result that comprises a first plurality of tokens and at least one break, wherein the first plurality of tokens includes all of the plurality of characters;

segmenting the string of characters into a second segmented result that comprises a second plurality of tokens and at least one break, wherein the second plurality of tokens includes all the plurality of characters, and wherein the second plurality of tokens is different than the first plurality of tokens;

determining a first frequency of occurrence for the first segmented result in a corpus and a second frequency of occurrence for the second segmented result in the corpus by providing the first segmented result and second segmented result to a search engine and receiving in response from the search engine the first frequency of occurrence for the first segmented result and the second frequency of occurrence for the second segmented result;

comparing the first frequency of occurrence for the first result to the second frequency of occurrence for the second segmented result;

selecting the first segmented result as an operable segmented result for the received string of characters when the first frequency of occurrence for the first request is determined to exceed a determined value relative to the second frequency of occurrence for the second result; and

providing the operable segmented result for further processing.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and systems for improving text segmentation are disclosed. In one embodiment, at least a first segmented result and a second segmented result are determined from a string of characters, a first frequency of occurrence for the first segmented result and a second frequency of occurrence for the second segmented result are determined, and an operable segmented result is identified from the first segmented result and the second segmented result based at least in part on the first frequency of occurrence and the second frequency of occurrence.

87 Citations

View as Search Results

31 Claims

1. A computer-implemented method, comprising:
- receiving a string of characters that comprises a plurality of characters with no token-delineating breaks;
  
  segmenting the string of characters into a first segmented result that comprises a first plurality of tokens and at least one break, wherein the first plurality of tokens includes all of the plurality of characters;
  
  segmenting the string of characters into a second segmented result that comprises a second plurality of tokens and at least one break, wherein the second plurality of tokens includes all the plurality of characters, and wherein the second plurality of tokens is different than the first plurality of tokens;
  
  determining a first frequency of occurrence for the first segmented result in a corpus and a second frequency of occurrence for the second segmented result in the corpus by providing the first segmented result and second segmented result to a search engine and receiving in response from the search engine the first frequency of occurrence for the first segmented result and the second frequency of occurrence for the second segmented result;
  
  comparing the first frequency of occurrence for the first result to the second frequency of occurrence for the second segmented result;
  
  selecting the first segmented result as an operable segmented result for the received string of characters when the first frequency of occurrence for the first request is determined to exceed a determined value relative to the second frequency of occurrence for the second result; and
  
  providing the operable segmented result for further processing.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The computer-implemented method of claim 1, wherein determining the first frequency of occurrence comprises determining a number of articles in the corpus containing the first segmented result that are identified by the search engine.
  - 3. The computer-implemented method of claim 2, wherein determining the number of articles in the corpus containing the first segmented result comprises determining a number of article identifiers in a search result set for the corpus that is generated in response to a search query comprising the first segmented result.
  - 4. The computer-implemented method of claim 2, wherein determining the number of articles in the corpus containing the first segmented result comprises accessing an index of articles for the corpus that is maintained by the search engine.
  - 5. The computer-implemented method of claim 1, wherein determining the first frequency of occurrence comprises determining a number of occurrences of the first segmented result in a plurality of search queries for the corpus that were previously received by the search engine.
  - 6. The computer-implemented method of claim 1, wherein the string of characters comprises a domain name.
  - 7. The computer-implemented method of claim 1, wherein the further processing comprises selecting an article based at least in part on the operable segmented result.
  - 8. The computer-implemented method of claim 7, wherein the article comprises an advertisement.
  - 9. The computer-implemented method of claim 1, wherein the further processing comprises determining whether to filter a domain name comprising the string of characters based at least in part on the operable segmented result.
  - 10. The computer-implemented method of claim 1, wherein determining the first segmented result and the second segmented result comprises:
    - segmenting the string of characters into a plurality of segmented results; and
      
      selecting the first segmented result and the second segmented result from the plurality of segmented results.
  - 11. The computer-implemented method of claim 10, wherein selecting the first segmented result and the second segmented result comprises calculating a probability value for each of the plurality of segmented results.
  - 12. The computer-implemented method of claim 11, wherein a first probability value associated with the first segmented result is based at least in part on a frequency of occurrence in the corpus for each token within the first segmented result.
  - 13. The computer-implemented method of claim 1, wherein the second segmented result is a spelling corrected version of the first segmented result.

14. A computer-implemented method, comprising:
- segmenting a string of characters into a plurality of segmented results, where the string of characters comprises a plurality of characters with no token-delineating breaks;
  
  selecting at least a first segmented result and a second segmented result from the plurality of segmented results, wherein the first segmented result comprises a first plurality of tokens, at least one break, and all of the plurality of characters, wherein the second segmented result comprises a second plurality of tokens, at least one break, and all of the plurality of characters, and wherein the first plurality of tokens is different than the second plurality of tokens;
  
  applying the first segmented result in a first search query to a search corpus, and receiving in response a first search results set comprising a first number of article identifiers, each article identifier in the first number of article identifiers corresponding to an article that includes the first segmented result and that is referenced in the search corpus; and
  
  applying the second segmented result in a second search query to the search corpus, and receiving in response a second search results set comprising a second number of article identifiers, each article identifier in the second number of article identifiers corresponding to an article that includes the second segmented result and that is referenced in the search corpus;
  
  comparing the first number of article identifiers for the first result to the second number of article identifiers for the second segmented result to generate a difference indicator;
  
  selecting the first segmented result as an operable segmented result for the string of characters when the difference indicator exceeds a predetermined value; and
  
  providing the operable segmented result for further processing.
- View Dependent Claims (15)
- - 15. The computer-implemented method of claim 14, wherein applying the first segmented result in the first search query to a search corpus comprises generating the first search query comprising the first segmented result and providing the first search query to a search engine that processes the first search query using the search corpus, and wherein applying the second segmented result in the second search query to the search corpus comprises generating the second search query comprising the second segmented result and providing the second search query to the search engine.

16. A system comprising a computer-readable storage device that stores program code, which, when executed by a processor, performs operations comprising:
- receiving a string of characters that comprises a plurality of characters with no token-delineating breaks;
  
  segmenting the string of characters into a first segmented result that comprises a first plurality of tokens and at least one break, wherein the first plurality of tokens includes all of the plurality of characters;
  
  segmenting the string of characters into a second segmented result that comprises a second plurality of tokens and at least one break, wherein the second plurality of tokens includes all the plurality of characters, wherein the second plurality of tokens is different than the first plurality of tokens;
  
  determining a first frequency of occurrence for the first segmented result in a corpus and a second frequency of occurrence for the second segmented result in the corpus by providing the first segmented result and second segmented result to a search engine and receiving in response from the search engine the first frequency of occurrence for the first segmented result and the second frequency of occurrence for the second segmented result;
  
  comparing the first frequency of occurrence for the first result to the second frequency of occurrence for the second segmented result;
  
  selecting the first segmented result as an operable segmented result for the received string of characters when the first frequency of occurrence for the first request is determined to exceed a determined value relative to the second frequency of occurrence for the second result; and
  
  providing the operable segmented result for further processing.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 17. The system of claim 16, wherein determining the first frequency of occurrence comprises determining a number of articles in the corpus containing the first segmented result that are identified by the search engine.
  - 18. The system of claim 17, wherein determining the number of articles in the corpus containing the first segmented result comprises determining a number of article identifiers in a search result set for the corpus that is generated in response to a search query comprising the first segmented result.
  - 19. The system of claim 17, wherein determining the number of articles in the corpus containing the first segmented result comprises accessing an index of articles for the corpus that is maintained by the search engine.
  - 20. The system of claim 16, wherein determining the first frequency of occurrence comprises determining a number of occurrences of the first segmented result in a plurality of search queries for the corpus that were previously received by the search engine.
  - 21. The system of claim 16, wherein the string of characters comprises a domain name.
  - 22. The system of claim 16, wherein the further processing comprises selecting an article based at least in pan on the operable segmented result.
  - 23. The system of claim 22, wherein the article comprises an advertisement.
  - 24. The system of claim 16, wherein the further processing comprises determining whether to filter a domain name comprising the string of characters based at least in part on the operable segmented result.
  - 25. The system of claim 24, wherein determining the first segmented result and the second segmented result comprises:
    - segmenting the string of characters into a plurality of segmented results; and
      
      selecting the first segmented result and the second segmented result from the plurality of segmented results.
  - 26. The system of claim 25, wherein selecting the first segmented result and the second segmented result comprises calculating a probability value for each of the plurality of segmented results.
  - 27. The system of claim 26, wherein a first probability value associated with the first segmented result is based at least in part on a frequency of occurrence in the corpus for each token within the first segmented result.
  - 28. The system of claim 16, wherein the second segmented result is a spelling corrected version of the first segmented result.

29. A system comprising a computer-readable storage device that stores instructions, which, when executed by a processor, perform operations comprising:
- segmenting a string of characters into a plurality of segmented results, where the string of characters comprises a plurality of characters with no token-delineating breaks;
  
  selecting at least a first segmented result and a second segmented result from the plurality of segmented results, wherein the first segmented result comprises a first plurality of tokens and at leas one break, and wherein the first segmented result includes all of the plurality of characters, and wherein the second segmented result comprises a second plurality of tokens and at least one break, and wherein the second segmented result includes all of the plurality of characters, and wherein the first plurality of tokens is different than the second plurality of tokens;
  
  applying the first segmented result in a query to a search corpus, and receiving in response a first number of article identifiers, each article identifier in the first number of article identifiers corresponding to an article that includes the first segmented result and that is referenced in the search corpus; and
  
  applying the second segmented result in a query to the search corpus, and receiving in response a second search results set comprising a second number of article identifiers, each article identifier in the second number of article identifiers corresponding to an article that includes the second segmented result and that is referenced in the search corpus;
  
  comparing the first number of article identifiers for the first result to the second number of article identifiers for the second segmented result to generate a difference indicator;
  
  selecting the first segmented result as an operable segmented result for the string of characters when the difference indicator exceeds a predetermined value; and
  
  providing the operable segmented result for further processing.
- View Dependent Claims (30)
- - 30. The system of claim 29, wherein applying the first segmented result in the first search query to a search corpus comprises generating the first search query comprising the first segmented result and providing the first search query to a search engine that processes the first search query using the search corpus, and wherein applying the second segmented result in the second search query to the search corpus comprises generating the second search query comprising the second segmented result and providing the second search query to the search engine.

31. A method, comprising:
- receiving a domain name that comprises a plurality of characters with no token-delineating breaks;
  
  segmenting the domain name into a first segmented result that comprises a first plurality of tokens and at least one break, wherein the first plurality of tokens includes all of the plurality of characters;
  
  segmenting the domain name into a second segmented result that comprises a second plurality of tokens and at least one break, wherein the second plurality of tokens includes all of the plurality of characters;
  
  determining a first frequency of occurrence for the first segmented result in at least one of an article index, a text index, and a search result set;
  
  determining a second frequency of occurrence for the second segmented result in at least one of the article index, the text index, and the search result set;
  
  determining whether a difference between the first frequency of occurrence for the first segmented result and the second frequency of occurrence for the second segmented result exceeds an identified value;
  
  selecting the first segmented result as an operable segmented result when the difference exceeds the identified value;
  
  selecting an advertisement based at least in part on the operable segmented result, wherein the advertisement includes text associated with the operable segmented result; and
  
  causing the selected advertisement to be displayed in association with a web page that is associated with the domain name.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Elbaz, Gilad Israel, Mandelson, Jacob Leon
Primary Examiner(s)
Mofiz; Apu M
Assistant Examiner(s)
Le; Hung D

Application Number

US10/955,281
Publication Number

US 20070124301A1
Time in Patent Office

1,993 Days
Field of Search

704/1, 704/231, 704/257, 707 2- 3, 707101-102, 382176-177
US Class Current

704/9
CPC Class Codes

G06F 40/284 Lexical analysis, e.g. toke...

Methods and systems for improving text segmentation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

87 Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for improving text segmentation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

87 Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links