Identification and Extraction of New Terms in Documents
First Claim
1. A method comprising:
- parsing a document to obtain an n-gram phrase indicative of a new term, the phrase comprised of a plurality of words;
breaking the n-gram phrase into a bi-gram phrase comprised of a first and a second phrase part, the first and second phrase part including at least one word;
determining whether the first or second phrase part is in a vocabulary collection;
estimating the probability that the bi-gram phrase should be in the vocabulary collection if it is not; and
adding the bi-gram phrase to the vocabulary collection if the probability that the bi-gram phrase should be in the vocabulary collection exceeds a minimum threshold level.
8 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus that can extract new terms from documents for inclusion in a vocabulary collection is disclosed. A document may be parsed to obtain an n-gram phrase indicative of a new term. The phrase may include a plurality of words. The n-gram phrase may be decomposed into a series of bi-gram phrases each including a first and a second phrase part. The first and second phrase parts each include at least one word. It may then be determined whether the first or second phrase part is in a vocabulary collection. If not, it may be estimated as to the probability that the bi-gram phrase should be in the vocabulary collection. The bi-gram phrase may be added to the vocabulary collection if the probability that the bi-gram phrase should be in the vocabulary collection exceeds a minimum threshold level.
-
Citations
15 Claims
-
1. A method comprising:
-
parsing a document to obtain an n-gram phrase indicative of a new term, the phrase comprised of a plurality of words; breaking the n-gram phrase into a bi-gram phrase comprised of a first and a second phrase part, the first and second phrase part including at least one word; determining whether the first or second phrase part is in a vocabulary collection; estimating the probability that the bi-gram phrase should be in the vocabulary collection if it is not; and adding the bi-gram phrase to the vocabulary collection if the probability that the bi-gram phrase should be in the vocabulary collection exceeds a minimum threshold level. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An apparatus comprising:
-
a processor circuit; a memory; a parsing module stored in the memory and executable by the processor circuit, the parsing module to obtain an n-gram phrase indicative of a new term, the phrase comprised of a plurality of words; a decomposition module stored in the memory and executable by the processor circuit, the decomposition module to break the n-gram phrase into a bi-gram phrase comprised of a first and a second phrase part, the first and second phrase part including at least one word; a phrase determination module stored in the memory and executable by the processor circuit, the phrase determination module to determine whether the first or second phrase part is in a vocabulary collection; and a probability module stored in the memory and executable by the processor circuit, the probability module to estimate the probability that the bi-gram phrase should be in the vocabulary collection if it is not; and
add the bi-gram phrase to the vocabulary collection if the probability that the bi-gram phrase should be in the vocabulary collection exceeds a minimum threshold level. - View Dependent Claims (9, 10, 11)
-
-
12. An article of manufacture comprising a non-transitory computer-readable storage medium containing instructions that if executed enable a system to:
-
parse a document to obtain an n-gram phrase indicative of a new term, the phrase comprised of a plurality of words; break the n-gram phrase into a bi-gram phrase comprised of a first and a second phrase part, the first and second phrase part including at least one word; determine whether the first or second phrase part is in a vocabulary collection; estimate the probability that the bi-gram phrase should be in the vocabulary collection if it is not; and add the bi-gram phrase to the vocabulary collection if the probability that the bi-gram phrase should be in the vocabulary collection exceeds a minimum threshold level. - View Dependent Claims (13, 14, 15)
-
Specification