Character string dividing or separating method and related system for segmenting agglutinative text or document into words
First Claim
Patent Images
1. A character string dividing system for segmenting a character string into a plurality of words, comprising:
- input means for receiving a document;
document data storing means serving as a document database for storing a received document;
character joint probability calculating means for calculating a joint probability of two neighboring characters appearing in said document database;
probability table storing means for storing a table of calculated joint probabilities;
character string dividing means for segmenting an objective character string into a plurality of words with reference to said table of calculated joint probabilities; and
output means for outputting a division result of said objective character string.
1 Assignment
0 Petitions
Accused Products
Abstract
A joint probability of two neighboring characters appearing in a given Japanese document database is statistically calculated. The calculated joint probabilities are stored in a table. An objective Japanese sentence is segmented into a plurality of words with reference to the calculated joint probabilities so that each division point of the objective Japanese sentence is present between two neighboring characters having a smaller joint probability.
-
Citations
20 Claims
-
1. A character string dividing system for segmenting a character string into a plurality of words, comprising:
-
input means for receiving a document;
document data storing means serving as a document database for storing a received document;
character joint probability calculating means for calculating a joint probability of two neighboring characters appearing in said document database;
probability table storing means for storing a table of calculated joint probabilities;
character string dividing means for segmenting an objective character string into a plurality of words with reference to said table of calculated joint probabilities; and
output means for outputting a division result of said objective character string.
-
-
2. A character string dividing method for segmenting a character string into a plurality of words, said method comprising the steps of:
-
statistically calculating a joint probability of two neighboring characters appearing in a given document database; and
segmenting an objective character string into a plurality of words with reference to calculated joint probabilities so that each division point of said objective character string is present between two neighboring characters having a smaller joint probability. - View Dependent Claims (7, 8, 9)
-
-
3. A character string dividing method for segmenting a character string into a plurality of words, said method comprising the steps of:
-
statistically calculating a joint probability of two neighboring characters appearing in a given document database, said joint probability being calculated as an appearance probability of a specific character string appearing immediately before a specific character, said specific character string including a former one of said two neighboring characters as a tail thereof and said specific character being a latter one of said two neighboring characters; and
segmenting an objective character string into a plurality of words with reference to calculated joint probabilities so that each division point of said objective character string is present between two neighboring characters having a smaller joint probability.
-
-
4. A character string dividing method for segmenting a character string into a plurality of words, said method comprising the steps of:
-
statistically calculating a joint probability of two neighboring characters appearing in a given document database, said joint probability being calculated as an appearance probability of a first character string appearing immediately before a second character string, said first character string including a former one of said two neighboring characters as a tail thereof and said second character string including a latter one of said two neighboring characters as a head thereof; and
segmenting an objective character string into a plurality of words with reference to calculated joint probabilities so that each division point of said objective character string is present between two neighboring characters having a smaller joint probability. - View Dependent Claims (5)
-
-
6. A character string dividing method for segmenting a character string into a plurality of words, said method comprising the steps of:
-
statistically calculating a joint probability of two neighboring characters appearing in a given document database prepared for learning purpose; and
segmenting an objective character string into a plurality of words with reference to calculated joint probabilities so that each division point of said objective character string is present between two neighboring characters having a smaller joint probability, wherein, when said objective character string involves a sequence of characters not involved in said document database, a joint probability of any two neighboring characters not appearing in said database is estimated based on said calculated joint probabilities for the neighboring characters stored in said document database.
-
-
10. A character string dividing system for segmenting a character string into a plurality of words, comprising:
-
input means for receiving a document;
document data storing means serving as a document database for storing a received document;
character joint probability calculating means for calculating a joint probability of two neighboring characters appearing in said document database;
probability table storing means for storing a table of calculated joint probabilities;
word dictionary storing means for storing a word dictionary prepared or produced beforehand;
division pattern producing means for producing a plurality of candidates for a division pattern of an objective character string with reference to information of said word dictionary;
correct pattern selecting means for selecting a correct division pattern from said plurality of candidates with reference to said table of character joint probabilities; and
output means for outputting said selected correct division pattern as a division result of said objective character string.
-
-
11. A character string dividing method for segmenting a character string into a plurality of words, said method comprising the steps of:
-
statistically calculating a joint probability of two neighboring characters appearing in a given document database;
storing calculated joint probabilities; and
segmenting an objective character string into a plurality of words with reference to a word dictionary, wherein, when there are a plurality of candidates for a division pattern of said objective character string, a correct division pattern is selected from said plurality of candidates with reference to calculated joint probabilities so that each division point of said objective character string is present between two neighboring characters having a smaller joint probability. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A character string dividing system for segmenting a character string into a plurality of words, comprising:
-
input means for receiving a document;
document data storing means serving as a document database for storing a received document;
character joint probability calculating means for calculating a joint probability of two neighboring characters appearing in said document database;
probability table storing means for storing a table of calculated joint probabilities;
word dictionary storing means for storing a word dictionary prepared or produced beforehand;
unknown word estimating means for estimating unknown words not registered in said word dictionary;
division pattern producing means for producing a plurality of candidates for a division pattern of an objective character string with reference to information of said word dictionary and said estimated unknown words;
correct pattern selecting means for selecting a correct division pattern from said plurality of candidates with reference to said table of character joint probabilities; and
output means for outputting said selected correct division pattern as a division result of said objective character string.
-
-
17. A character string dividing method for segmenting a character string into a plurality of words, said method comprising the steps of:
-
statistically calculating a joint probability of two neighboring characters appearing in a given document database;
storing calculated joint probabilities; and
segmenting an objective character string into a plurality of words with reference to dictionary words and estimated unknown words, wherein, when there are a plurality of candidates for a division pattern of said objective character string, a correct division pattern is selected from said plurality of candidates with reference to calculated joint probabilities so that each division point of said objective character string is present between two neighboring characters having a smaller joint probability. - View Dependent Claims (18, 19, 20)
-
Specification