Diverse topic phrase extraction

US 8,280,877 B2
Filed: 09/21/2007
Issued: 10/02/2012
Est. Priority Date: 02/22/2007
Status: Active Grant

First Claim

Patent Images

1. A method, implemented by one or more computing devices, of summarizing search results from a corpus, the method comprising:

re-weighting, by the one or more computing devices, documents based at least in part on a latent topic to identify topic phrases associated with candidate phrases occurring within one or more of the documents, the re-weighting comprising strengthening a subset of the documents that are indicated by the latent topic by adjusting document weights to increase a term frequency of one or more terms within the subset of the documents associated with the latent topic;

evaluating, by the one or more computing devices, based in part on the increase of the term frequency of the one or more terms, a modified latent semantic analysis (LSA)-weighted frequency of one or more documents of the subset of the documents to identify the topic phrases; and

filtering, by the one or more computing devices, the documents having similar topic phrases to remove redundancy of the similar topic phrases.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for implementing diverse topic phrase extraction are disclosed. According to one implementation, multiple word candidate phrases are extracted from a corpus and weighed. One or more documents are re-weighed to identify less obvious candidate topics using latent semantic analysis (LSA). Phrase diversification is then used to remove redundancy and select informative and distinct topic phrases.

Citations

18 Claims

1. A method, implemented by one or more computing devices, of summarizing search results from a corpus, the method comprising:
- re-weighting, by the one or more computing devices, documents based at least in part on a latent topic to identify topic phrases associated with candidate phrases occurring within one or more of the documents, the re-weighting comprising strengthening a subset of the documents that are indicated by the latent topic by adjusting document weights to increase a term frequency of one or more terms within the subset of the documents associated with the latent topic;
  
  evaluating, by the one or more computing devices, based in part on the increase of the term frequency of the one or more terms, a modified latent semantic analysis (LSA)-weighted frequency of one or more documents of the subset of the documents to identify the topic phrases; and
  
  filtering, by the one or more computing devices, the documents having similar topic phrases to remove redundancy of the similar topic phrases.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the re-weighting further comprises:
    - selecting one or more additional latent topics from the corpus, each latent topic being characterized by a singular value, wherein each singular value indicates a relevance of topic phrases associated with the documents;
      
      arranging the one or more additional latent topics based on the singular values;
      
      extracting at least one topic phrase from the one or more additional latent topics.
  - 3. The method of claim 1, wherein concepts associated with the documents are based at least on an association between the documents and constituent phrases of the documents.
  - 4. The method of claim 3, wherein the association between the documents and the constituent phrases is implemented using latent semantic analysis.
  - 5. The method of claim 2, further comprising associating a weighted frequency value based on the singular value of an associated latent topic.
  - 6. The method of claim 1, wherein the filtering comprises:
    - evaluating a dissimilarity score between a pair of the topic phrases in the documents, wherein the filtering is based at least on a basis of the dissimilarity score.
  - 7. The method of claim 6, wherein the dissimilarity score indicates an extent of dissimilarity between the dissimilar topic phrases.

8. A system comprising:
- one or more processors;
  
  a memory;
  
  a re-weighting module configured to perform acts, the acts comprising;
  
  identifying topic phrases associated with documents;
  
  determining a latent topic by associating the documents with weight factors based on an association of terms within constituent phrases that appear in the documents, at least one concept being characterized by the latent topic; and
  
  re-weighting the documents, the re-weighting comprising strengthening a subset of the documents that are indicated by the latent topic by adjusting document weights to increase a term frequency of one or more terms within the subset of the documents associated with the latent topic;
  
  a diversification module configured to perform acts comprising;
  
  evaluating, based in part on the increase of the term frequency of the one or more terms, a modified latent semantic analysis (LSA)-weighted frequency of one or more documents of the subset of the documents;
  
  removing documents associated with one or more similar concepts to result in a list of documents with unlike concepts.
- View Dependent Claims (9, 10, 11, 12)
- - 9. The system of claim 8, wherein the re-weighting module performs acts further comprising:
    - arranging the latent topic among other determined latent topics based on a singular value of the latent topic; and
      
      extracting at least one topic phrase from the latent topic based in part on the evaluating the modified LSA-weighted frequency.
  - 10. The system of claim 8, wherein the at least one concept is based at least on an association between at least one of the constituent phrases and one or more of the documents.
  - 11. The system of claim 10, wherein the association between at least one of the constituent phrases and one or more of the documents is implemented using latent semantic analysis.
  - 12. The system of claim 8, wherein the diversification module:
    - calculates a dissimilarity score between at least a pair of the topic phrases in the documents; and
      
      retains one or more topic phrases based on the dissimilarity score.

13. One or more computer-readable storage devices comprising computer executable instructions that, when executed, direct a computing system to perform acts, the acts comprising:
- re-weighting documents in a corpus based at least in part on a latent topic to identify one or more concepts associated with the documents, the re-weighting comprising strengthening a subset of the documents that are indicated by the latent topic by adjusting document weights to increase a term frequency of each of one or more terms within the subset of the documents associated with the latent topic;
  
  evaluating, based in part on the increase of the term frequency of each of the one or more terms, a modified latent semantic analysis (LSA)-weighted frequency of one or more documents of the subset of the documents to identify one or more topic phrases associated with the latent topic; and
  
  filtering the documents associated with similar concepts to result in a document collection with unlike concepts.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The one or more computer-readable storage devices of claim 13, wherein the re-weighting further comprises:
    - selecting one or more additional latent topics from the corpus, each latent topic being characterized by a singular value, wherein each singular value indicates a relevance of concepts associated with the documents;
      
      arranging the latent topics based on the singular values; and
      
      extracting at least one phrase from one or more of the latent topics.
  - 15. The one or more computer-readable storage devices of claim 14, wherein the concepts associated with the documents are based at least on an association implemented using latent semantic analysis.
  - 16. The one or more computer-readable storage devices of claim 13, wherein the filtering comprises:
    - evaluating a dissimilarity score between at least a pair of phrases in the documents;
      
      sifting phrases based at least on a basis of the dissimilarity score.
  - 17. The one or more computer-readable storage devices of claim 16, wherein the dissimilarity score indicates an extent of dissimilarity between dissimilar phrases.
  - 18. The one or more computer-readable storage devices of claim 16, wherein the dissimilarity score is dependent on order of appearance of words in the phrases.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Zhang, Benyu, Chen, Jilin, Chen, Zheng, Zeng, HuaJun, Wang, Jian
Primary Examiner(s)
Cao, Phuong Thao

Application Number

US11/859,461
Publication Number

US 20080208840A1
Time in Patent Office

1,838 Days
Field of Search

707/2, 707/999.002, 707/723, 707/739, 707/748
US Class Current

707/723
CPC Class Codes

G06F 40/258 Heading extraction; Automat...

G06F 40/295 Named entity recognition

Diverse topic phrase extraction

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Diverse topic phrase extraction

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links