Search-based word segmentation method and device for language without word boundary tag
First Claim
1. A search-based word segmentation method for a language without a word boundary tag, comprising the steps of:
- a. providing at least one search engine with a segment of a text comprising at least one segment;
b. searching for the segment through the at least one search engine, and returning search results each including candidate word segmentation units; and
c. determining a word segmentation approach for the segment in accordance with at least part of the returned search results by performing steps of;
extracting, from the at least part of the returned search results, all candidate word segmentation units appearing in the segment;
scoring the extracted candidate word segmentation units;
ranking subsets of extracted candidate word segmentation units in accordance with the scoring, wherein the candidate word segmentation units in each subset sequentially form the segment; and
selecting a highest-ranked subset as the word segmentation approach for the segment.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention discloses a search-based segmentation method and device for a language without a word boundary tag. The inventive method includes the steps of: a. providing at least one search engine with a segment of a text including at least one segment; b. searching for the segment through the at least one search engine, and returning search results; and c. selecting a word segmentation approach for the segment in accordance with at least part of the returned search results. The invention solves the problems of word segmentation for a language without a word boundary tag, and thus combat the limitations of the prior art in terms of flexibility, dependence upon coverage of dictionaries, available training data corpuses, processing of a new word, etc.
-
Citations
19 Claims
-
1. A search-based word segmentation method for a language without a word boundary tag, comprising the steps of:
-
a. providing at least one search engine with a segment of a text comprising at least one segment; b. searching for the segment through the at least one search engine, and returning search results each including candidate word segmentation units; and c. determining a word segmentation approach for the segment in accordance with at least part of the returned search results by performing steps of; extracting, from the at least part of the returned search results, all candidate word segmentation units appearing in the segment; scoring the extracted candidate word segmentation units; ranking subsets of extracted candidate word segmentation units in accordance with the scoring, wherein the candidate word segmentation units in each subset sequentially form the segment; and selecting a highest-ranked subset as the word segmentation approach for the segment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A search-based word segmentation device for a language without a word boundary tag, comprising:
-
at least one search engine, adapted to receive a segment of a text comprising at least one segment, to search in a search network for the segment, and to return search results each including candidate word segmentation units; and a word segmentation result generating means, adapted to select a word segmentation approach for the segment in accordance with at least part of the returned search results by extracting, from the at least part of the returned search results, all candidate word segmentation units appearing in the segment;
scoring the extracted candidate word segmentation units;ranking subsets of extracted candidate word segmentation units in accordance with the scoring, wherein the candidate word segmentation units in each subset sequentially form the segment; and
selecting a highest-ranked subset as the word segmentation approach for the segment. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer program product stored on a non-transitory computer readable storage medium and executed by a computer to perform a search-based word segmentation method for a language without a word boundary tag, wherein said method comprises the steps of:
-
a. providing at least one search engine with a segment of a text comprising at least one segment; b. searching for the segment through the at least one search engine, and returning search results each including candidate word segmentation units; and c. determining a word segmentation approach for the segment in accordance with at least part of the returned search results by extracting, from the at least part of the returned search results, all candidate word segmentation units appearing in the segment;
scoring the extracted candidate word segmentation units;ranking subsets of extracted candidate word segmentation units in accordance with the scoring, wherein the candidate word segmentation units in each subset sequentially form the segment; and
selecting a highest-ranked subset as the word segmentation approach for the segment.
-
Specification