Method, apparatus, and computer storage medium for automatically adding tags to document

US 9,146,915 B2
Filed: 12/17/2012
Issued: 09/29/2015
Est. Priority Date: 01/05/2012
Status: Active Grant

First Claim

Patent Images

1. A method for automatically adding a tag to a document, comprising:

determining, by an apparatus comprising a processor, a plurality of candidate tag words corresponding to the document;

determining, by the apparatus, a corpus comprising a plurality of texts;

selecting, by the apparatus, commonly-used words from the corpus as characteristic words;

determining, by the apparatus, for each of the characteristic words and each of the candidate tag words, a probability for co-occurrence of the candidate tag word with the characteristic word;

abstracting, by the apparatus, characteristic words from the document;

calculating, by the apparatus, a weight for each of the abstracted characteristic words;

calculating, by the apparatus, in the corpus, a weighted probability for co-occurrence of each of the candidate tag words with all of the characteristic words abstracted from the document; and

selecting, by the apparatus, the candidate tag word with a high weighted co-occurrence probability as a tag word to be added to the document;

wherein the weight for the characteristic word Y abstracted from the document is denoted as W_y, and W_Yis equal to a product of the number of times that Y occurs in the document and the number of the texts in the corpus in which Y occurs.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for automatically adding a tag to a document are provided. The method comprises: determining a plurality of candidate tag words corresponding to the document; determining a corpus comprising a plurality of texts; selecting commonly-used words from the corpus as characteristic words; determining, for each of the characteristic words and each of the candidate tag words, a probability for co-occurrence of the candidate tag word with the characteristic word; abstracting characteristic words from the document, and calculating a weight for each of the abstracted characteristic words; and calculating, in the corpus, a weighted probability for co-occurrence of each of the candidate tag words with all of the characteristic words abstracted from the document; selecting the candidate tag word with a high weighted co-occurrence probability as a tag word to be added to the document.

Citations

18 Claims

1. A method for automatically adding a tag to a document, comprising:
- determining, by an apparatus comprising a processor, a plurality of candidate tag words corresponding to the document;
  
  determining, by the apparatus, a corpus comprising a plurality of texts;
  
  selecting, by the apparatus, commonly-used words from the corpus as characteristic words;
  
  determining, by the apparatus, for each of the characteristic words and each of the candidate tag words, a probability for co-occurrence of the candidate tag word with the characteristic word;
  
  abstracting, by the apparatus, characteristic words from the document;
  
  calculating, by the apparatus, a weight for each of the abstracted characteristic words;
  
  calculating, by the apparatus, in the corpus, a weighted probability for co-occurrence of each of the candidate tag words with all of the characteristic words abstracted from the document; and
  
  selecting, by the apparatus, the candidate tag word with a high weighted co-occurrence probability as a tag word to be added to the document;
  
  wherein the weight for the characteristic word Y abstracted from the document is denoted as W_y, and W_Yis equal to a product of the number of times that Y occurs in the document and the number of the texts in the corpus in which Y occurs.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method according to claim 1, wherein the co-occurrence probability is denoted as P(X|Y), wherein X denotes one of the candidate tag words and Y denotes one of the characteristic words which occurs in the corpus;
    - andP(X|Y) is determined as a result of dividing the number of times for the co-occurrence of X and Y in a same text comprised in the corpus by the number of times for the occurrence of Y in the corpus.
  - 3. The method according to claim 1, wherein the co-occurrence probability is denoted as P(X|Y), wherein X denotes one of the candidate tag words and Y denotes one of the characteristic words which occurs in the corpus;
    - andP(X|Y) is determined as
  - 4. The method according to claim 1, wherein the co-occurrence probability is denoted as P(X|Y), wherein X denotes one of the candidate tag words and Y denotes one of the characteristic words which occurs in the corpus;
    - andP(X|Y) is determined by using a lexical database.
  - 5. The method according to claim 1, wherein the weighted co-occurrence probability is denoted as
  - 6. The method according to claim 1, wherein calculating, in the corpus, the weighted probability for co-occurrence of each of the candidate tag words with all of the characteristic words abstracted from the document comprises:
    - calculating, in the corpus, the weighted probability for each of the candidate tag words that co-occur with more than one characteristic word abstracted from the document.

7. An apparatus for automatically adding a tag to a document, comprising:
- a candidate tag word determining module comprising a processor, configured to determine a plurality of candidate tag words corresponding to the document;
  
  a co-occurrence probability determining module comprising a processor, configured to determine a corpus comprising a plurality of texts, select commonly-used words from the corpus as characteristic words, and determine, for each of the characteristic words and each of the candidate tag words, a probability for co-occurrence of the candidate tag word with the characteristic word;
  
  a weight calculating module comprising a processor, configured to abstract characteristic words from the document, and calculate a weight for each of the abstracted characteristic words;
  
  a weighted co-occurrence probability calculating module comprising a processor, configured to calculate, in the corpus, a weighted probability for co-occurrence of each of the candidate tag words with all of the characteristic words abstracted from the document; and
  
  a tag word adding module comprising a processor, configured to select the candidate tag word with a high weighted co-occurrence probability as a tag word to be added to the document;
  
  wherein the weight for the characteristic word Y abstracted from the document is denoted as W_y, and the weight calculating module is configured to calculate W_yas being equal to a product of the number of times that Y occurs in the document and the number of the texts in the corpus in which Y occurs.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 8. The apparatus according to claim 7, wherein the co-occurrence probability is denoted as P(X|Y), wherein X denotes one of the candidate tag words and Y denotes one of the characteristic words which occurs in the corpus;
    - andthe co-occurrence probability determining module is configured to calculate P(X|Y) as a result of dividing the number of times for the co-occurrence of X and Y in a same text comprised in the corpus by the number of times for the occurrence of Y in the corpus.
  - 9. The apparatus according to claim 8, wherein the weighted co-occurrence probability is denoted as
  - 10. The apparatus according to claim 8, wherein the weighted co-occurrence probability calculating module is configured to calculate, in the corpus, the weighted probability for each of the candidate tag words that co-occur with more than one characteristic word abstracted from the document.
  - 11. The apparatus according to claim 7, wherein the co-occurrence probability is denoted as P(X|Y), wherein X denotes one of the candidate tag words and Y denotes one of the characteristic words which occurs in the corpus;
    - andthe co-occurrence probability determining module is configured to calculate P(X|Y) as
  - 12. The apparatus according to claim 11, wherein the weighted co-occurrence probability is denoted as
  - 13. The apparatus according to claim 11, wherein the weighted co-occurrence probability calculating module is configured to calculate, in the corpus, the weighted probability for each of the candidate tag words that co-occur with more than one characteristic word abstracted from the document.
  - 14. The apparatus according to claim 7, wherein the co-occurrence probability is denoted as P(X|Y), wherein X denotes one of the candidate tag words and Y denotes one of the characteristic words which occurs in the corpus;
    - andthe co-occurrence probability determining module is configured to calculate P(X|Y) by using a lexical database.
  - 15. The apparatus according to claim 7, wherein the weighted co-occurrence probability is denoted as
  - 16. The apparatus according to claim 7, wherein the weighted co-occurrence probability calculating module is configured to calculate, in the corpus, the weighted probability for each of the candidate tag words that co-occur with more than one characteristic word abstracted from the document.

17. A computer storage medium storing computer program codes for implementing a method for automatically adding a tag to a document, executable by a computer, wherein the computer program codes comprise:
- instructions for determining a plurality of candidate tag words corresponding to the document;
  
  instructions for determining a corpus comprising a plurality of texts;
  
  instructions for selecting commonly-used words from the corpus as characteristic words;
  
  instructions for determining, for each of the characteristic words and each of the candidate tag words, a probability for co-occurrence of the candidate tag word with the characteristic word;
  
  instructions for abstracting characteristic words from the document;
  
  instructions for calculating a weight for each of the abstracted characteristic words;
  
  instructions for calculating, in the corpus, a weighted probability for co-occurrence of each of the candidate tag words with all of the characteristic words abstracted from the document; and
  
  instructions for selecting the candidate tag word with a high weighted co-occurrence probability as a tag word to be added to the document;
  
  wherein the weight for the characteristic word Y abstracted from the document is denoted as W_y, and a weight calculating module is configured to calculate W_yas being equal to a product of the number of times that Y occurs in the document and the number of the texts in the corpus in which Y occurs.
- View Dependent Claims (18)
- - 18. The computer storage medium according to claim 17, wherein the co-occurrence probability is denoted as P(X|Y), wherein X denotes one of the candidate tag words and Y denotes one of the characteristic words which occurs in the corpus;
    - andP(X|Y) is determined as a result of dividing the number of times for the co-occurrence of X and Y in a same text comprised in the corpus by the number of times for the occurrence of Y in the corpus.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Tencent Technology Shenzhen Company Limited (Tencent Holdings Limited)
Original Assignee
Tencent Technology Shenzhen Company Limited (Tencent Holdings Limited)
Inventors
He, Xiang, Wang, Ye, Jiao, Feng
Primary Examiner(s)
Paula, Cesar
Assistant Examiner(s)
Blackwell, James H

Application Number

US14/370,418
Publication Number

US 20150019951A1
Time in Patent Office

1,016 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 40/117   Tagging; Marking up details...

G06F 40/169   Annotation, e.g. comment da...

G06F 40/242   Dictionaries

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/30   Semantic analysis

Method, apparatus, and computer storage medium for automatically adding tags to document

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Method, apparatus, and computer storage medium for automatically adding tags to document

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links