Method, apparatus, and computer storage medium for automatically adding tags to document
First Claim
1. A method for automatically adding a tag to a document, comprising:
- determining, by an apparatus comprising a processor, a plurality of candidate tag words corresponding to the document;
determining, by the apparatus, a corpus comprising a plurality of texts;
selecting, by the apparatus, commonly-used words from the corpus as characteristic words;
determining, by the apparatus, for each of the characteristic words and each of the candidate tag words, a probability for co-occurrence of the candidate tag word with the characteristic word;
abstracting, by the apparatus, characteristic words from the document;
calculating, by the apparatus, a weight for each of the abstracted characteristic words;
calculating, by the apparatus, in the corpus, a weighted probability for co-occurrence of each of the candidate tag words with all of the characteristic words abstracted from the document; and
selecting, by the apparatus, the candidate tag word with a high weighted co-occurrence probability as a tag word to be added to the document;
wherein the weight for the characteristic word Y abstracted from the document is denoted as Wy, and WY is equal to a product of the number of times that Y occurs in the document and the number of the texts in the corpus in which Y occurs.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for automatically adding a tag to a document are provided. The method comprises: determining a plurality of candidate tag words corresponding to the document; determining a corpus comprising a plurality of texts; selecting commonly-used words from the corpus as characteristic words; determining, for each of the characteristic words and each of the candidate tag words, a probability for co-occurrence of the candidate tag word with the characteristic word; abstracting characteristic words from the document, and calculating a weight for each of the abstracted characteristic words; and calculating, in the corpus, a weighted probability for co-occurrence of each of the candidate tag words with all of the characteristic words abstracted from the document; selecting the candidate tag word with a high weighted co-occurrence probability as a tag word to be added to the document.
-
Citations
18 Claims
-
1. A method for automatically adding a tag to a document, comprising:
-
determining, by an apparatus comprising a processor, a plurality of candidate tag words corresponding to the document; determining, by the apparatus, a corpus comprising a plurality of texts; selecting, by the apparatus, commonly-used words from the corpus as characteristic words; determining, by the apparatus, for each of the characteristic words and each of the candidate tag words, a probability for co-occurrence of the candidate tag word with the characteristic word; abstracting, by the apparatus, characteristic words from the document; calculating, by the apparatus, a weight for each of the abstracted characteristic words; calculating, by the apparatus, in the corpus, a weighted probability for co-occurrence of each of the candidate tag words with all of the characteristic words abstracted from the document; and selecting, by the apparatus, the candidate tag word with a high weighted co-occurrence probability as a tag word to be added to the document; wherein the weight for the characteristic word Y abstracted from the document is denoted as Wy, and WY is equal to a product of the number of times that Y occurs in the document and the number of the texts in the corpus in which Y occurs. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. An apparatus for automatically adding a tag to a document, comprising:
-
a candidate tag word determining module comprising a processor, configured to determine a plurality of candidate tag words corresponding to the document; a co-occurrence probability determining module comprising a processor, configured to determine a corpus comprising a plurality of texts, select commonly-used words from the corpus as characteristic words, and determine, for each of the characteristic words and each of the candidate tag words, a probability for co-occurrence of the candidate tag word with the characteristic word; a weight calculating module comprising a processor, configured to abstract characteristic words from the document, and calculate a weight for each of the abstracted characteristic words; a weighted co-occurrence probability calculating module comprising a processor, configured to calculate, in the corpus, a weighted probability for co-occurrence of each of the candidate tag words with all of the characteristic words abstracted from the document; and a tag word adding module comprising a processor, configured to select the candidate tag word with a high weighted co-occurrence probability as a tag word to be added to the document; wherein the weight for the characteristic word Y abstracted from the document is denoted as Wy, and the weight calculating module is configured to calculate Wy as being equal to a product of the number of times that Y occurs in the document and the number of the texts in the corpus in which Y occurs. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer storage medium storing computer program codes for implementing a method for automatically adding a tag to a document, executable by a computer, wherein the computer program codes comprise:
-
instructions for determining a plurality of candidate tag words corresponding to the document; instructions for determining a corpus comprising a plurality of texts; instructions for selecting commonly-used words from the corpus as characteristic words; instructions for determining, for each of the characteristic words and each of the candidate tag words, a probability for co-occurrence of the candidate tag word with the characteristic word;
instructions for abstracting characteristic words from the document;instructions for calculating a weight for each of the abstracted characteristic words; instructions for calculating, in the corpus, a weighted probability for co-occurrence of each of the candidate tag words with all of the characteristic words abstracted from the document; and instructions for selecting the candidate tag word with a high weighted co-occurrence probability as a tag word to be added to the document; wherein the weight for the characteristic word Y abstracted from the document is denoted as Wy, and a weight calculating module is configured to calculate Wy as being equal to a product of the number of times that Y occurs in the document and the number of the texts in the corpus in which Y occurs. - View Dependent Claims (18)
-
Specification