Methods and systems for improving text segmentation
First Claim
Patent Images
1. A computer-implemented method, comprising:
- receiving a string of characters that comprises a plurality of characters with no token-delineating breaks;
segmenting the string of characters into a first segmented result that comprises a first plurality of tokens and at least one break, wherein the first plurality of tokens includes all of the plurality of characters;
segmenting the string of characters into a second segmented result that comprises a second plurality of tokens and at least one break, wherein the second plurality of tokens includes all the plurality of characters, and wherein the second plurality of tokens is different than the first plurality of tokens;
determining a first frequency of occurrence for the first segmented result in a corpus and a second frequency of occurrence for the second segmented result in the corpus by providing the first segmented result and second segmented result to a search engine and receiving in response from the search engine the first frequency of occurrence for the first segmented result and the second frequency of occurrence for the second segmented result;
comparing the first frequency of occurrence for the first result to the second frequency of occurrence for the second segmented result;
selecting the first segmented result as an operable segmented result for the received string of characters when the first frequency of occurrence for the first request is determined to exceed a determined value relative to the second frequency of occurrence for the second result; and
providing the operable segmented result for further processing.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems for improving text segmentation are disclosed. In one embodiment, at least a first segmented result and a second segmented result are determined from a string of characters, a first frequency of occurrence for the first segmented result and a second frequency of occurrence for the second segmented result are determined, and an operable segmented result is identified from the first segmented result and the second segmented result based at least in part on the first frequency of occurrence and the second frequency of occurrence.
87 Citations
31 Claims
-
1. A computer-implemented method, comprising:
-
receiving a string of characters that comprises a plurality of characters with no token-delineating breaks; segmenting the string of characters into a first segmented result that comprises a first plurality of tokens and at least one break, wherein the first plurality of tokens includes all of the plurality of characters; segmenting the string of characters into a second segmented result that comprises a second plurality of tokens and at least one break, wherein the second plurality of tokens includes all the plurality of characters, and wherein the second plurality of tokens is different than the first plurality of tokens; determining a first frequency of occurrence for the first segmented result in a corpus and a second frequency of occurrence for the second segmented result in the corpus by providing the first segmented result and second segmented result to a search engine and receiving in response from the search engine the first frequency of occurrence for the first segmented result and the second frequency of occurrence for the second segmented result; comparing the first frequency of occurrence for the first result to the second frequency of occurrence for the second segmented result; selecting the first segmented result as an operable segmented result for the received string of characters when the first frequency of occurrence for the first request is determined to exceed a determined value relative to the second frequency of occurrence for the second result; and providing the operable segmented result for further processing. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer-implemented method, comprising:
-
segmenting a string of characters into a plurality of segmented results, where the string of characters comprises a plurality of characters with no token-delineating breaks; selecting at least a first segmented result and a second segmented result from the plurality of segmented results, wherein the first segmented result comprises a first plurality of tokens, at least one break, and all of the plurality of characters, wherein the second segmented result comprises a second plurality of tokens, at least one break, and all of the plurality of characters, and wherein the first plurality of tokens is different than the second plurality of tokens; applying the first segmented result in a first search query to a search corpus, and receiving in response a first search results set comprising a first number of article identifiers, each article identifier in the first number of article identifiers corresponding to an article that includes the first segmented result and that is referenced in the search corpus; and
applying the second segmented result in a second search query to the search corpus, and receiving in response a second search results set comprising a second number of article identifiers, each article identifier in the second number of article identifiers corresponding to an article that includes the second segmented result and that is referenced in the search corpus;comparing the first number of article identifiers for the first result to the second number of article identifiers for the second segmented result to generate a difference indicator; selecting the first segmented result as an operable segmented result for the string of characters when the difference indicator exceeds a predetermined value; and providing the operable segmented result for further processing. - View Dependent Claims (15)
-
-
16. A system comprising a computer-readable storage device that stores program code, which, when executed by a processor, performs operations comprising:
-
receiving a string of characters that comprises a plurality of characters with no token-delineating breaks; segmenting the string of characters into a first segmented result that comprises a first plurality of tokens and at least one break, wherein the first plurality of tokens includes all of the plurality of characters; segmenting the string of characters into a second segmented result that comprises a second plurality of tokens and at least one break, wherein the second plurality of tokens includes all the plurality of characters, wherein the second plurality of tokens is different than the first plurality of tokens; determining a first frequency of occurrence for the first segmented result in a corpus and a second frequency of occurrence for the second segmented result in the corpus by providing the first segmented result and second segmented result to a search engine and receiving in response from the search engine the first frequency of occurrence for the first segmented result and the second frequency of occurrence for the second segmented result; comparing the first frequency of occurrence for the first result to the second frequency of occurrence for the second segmented result; selecting the first segmented result as an operable segmented result for the received string of characters when the first frequency of occurrence for the first request is determined to exceed a determined value relative to the second frequency of occurrence for the second result; and providing the operable segmented result for further processing. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A system comprising a computer-readable storage device that stores instructions, which, when executed by a processor, perform operations comprising:
-
segmenting a string of characters into a plurality of segmented results, where the string of characters comprises a plurality of characters with no token-delineating breaks; selecting at least a first segmented result and a second segmented result from the plurality of segmented results, wherein the first segmented result comprises a first plurality of tokens and at leas one break, and wherein the first segmented result includes all of the plurality of characters, and wherein the second segmented result comprises a second plurality of tokens and at least one break, and wherein the second segmented result includes all of the plurality of characters, and wherein the first plurality of tokens is different than the second plurality of tokens; applying the first segmented result in a query to a search corpus, and receiving in response a first number of article identifiers, each article identifier in the first number of article identifiers corresponding to an article that includes the first segmented result and that is referenced in the search corpus; and
applying the second segmented result in a query to the search corpus, and receiving in response a second search results set comprising a second number of article identifiers, each article identifier in the second number of article identifiers corresponding to an article that includes the second segmented result and that is referenced in the search corpus;comparing the first number of article identifiers for the first result to the second number of article identifiers for the second segmented result to generate a difference indicator; selecting the first segmented result as an operable segmented result for the string of characters when the difference indicator exceeds a predetermined value; and providing the operable segmented result for further processing. - View Dependent Claims (30)
-
-
31. A method, comprising:
-
receiving a domain name that comprises a plurality of characters with no token-delineating breaks; segmenting the domain name into a first segmented result that comprises a first plurality of tokens and at least one break, wherein the first plurality of tokens includes all of the plurality of characters; segmenting the domain name into a second segmented result that comprises a second plurality of tokens and at least one break, wherein the second plurality of tokens includes all of the plurality of characters; determining a first frequency of occurrence for the first segmented result in at least one of an article index, a text index, and a search result set; determining a second frequency of occurrence for the second segmented result in at least one of the article index, the text index, and the search result set; determining whether a difference between the first frequency of occurrence for the first segmented result and the second frequency of occurrence for the second segmented result exceeds an identified value; selecting the first segmented result as an operable segmented result when the difference exceeds the identified value; selecting an advertisement based at least in part on the operable segmented result, wherein the advertisement includes text associated with the operable segmented result; and causing the selected advertisement to be displayed in association with a web page that is associated with the domain name.
-
Specification