Mining multilingual topics

US 8,825,648 B2
Filed: 04/15/2010
Issued: 09/02/2014
Est. Priority Date: 04/15/2010
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

under a control of one or more processors,identifying multiple concept-units from a multi-language document corpus, a respective concept-unit including a set of documents in a plurality of languages describing a particular concept, the identifying including identifying one or more hyperlinks or references within a respective document that identify one or more other documents in one or more other languages relating to the particular concept; and

modeling the concept-units of the multi-language document corpus by maintaining a separation of term-by-document matrices for the plurality of languages to create a generative model, the generative model representing;

a plurality of universal topics, at least one respective universal topic being defined by a plurality of topic word distributions in the plurality of languages, at least one of the plurality of topic word distributions for a respective universal topic corresponding to a respective language from the plurality of languages and including one or more words in the respective language with corresponding probability values characterizing the respective universal topic; and

a topic distribution for at least one concept-unit, the topic distribution for a respective concept-unit including one or more universal topics and their distributions for the respective concept-unit, the set of documents in the different plurality of languages of the respective concept-unit being constrained to share a common topic distribution.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for utilizing data mining technology to extract universal topics with multilingual representations from a multilingual database, and to organize existing or new documents in different languages by analyzing their respective topic distributions.

24 Citations

View as Search Results

18 Claims

1. A method comprising:
- under a control of one or more processors,identifying multiple concept-units from a multi-language document corpus, a respective concept-unit including a set of documents in a plurality of languages describing a particular concept, the identifying including identifying one or more hyperlinks or references within a respective document that identify one or more other documents in one or more other languages relating to the particular concept; and
  
  modeling the concept-units of the multi-language document corpus by maintaining a separation of term-by-document matrices for the plurality of languages to create a generative model, the generative model representing;
  
  a plurality of universal topics, at least one respective universal topic being defined by a plurality of topic word distributions in the plurality of languages, at least one of the plurality of topic word distributions for a respective universal topic corresponding to a respective language from the plurality of languages and including one or more words in the respective language with corresponding probability values characterizing the respective universal topic; and
  
  a topic distribution for at least one concept-unit, the topic distribution for a respective concept-unit including one or more universal topics and their distributions for the respective concept-unit, the set of documents in the different plurality of languages of the respective concept-unit being constrained to share a common topic distribution.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A method as recited in claim 1, further comprising inferring the plurality of universal topics from the documents of the concept-units based on the generative model, wherein the inferring comprises performing a latent Dirichlet allocation analysis.
  - 3. A method as recited in claim 1, wherein the generative model further represents:
    - the concept-units;
      
      the documents of the concept-units; and
      
      word distributions of the documents.
  - 4. A method as recited in claim 1, wherein identifying multiple concept-units comprises identifying hyperlinks within the documents that identify other documents in other languages relating to common concepts.
  - 5. A method as recited in claim 1, further comprising comparing a new document in a given language to the topic word distributions corresponding to the given language to estimate a topic distribution of the new document.
  - 6. A method as recited in claim 1, further comprising comparing new documents of different languages to the topic distributions to identify one or more groups of the new documents sharing common topics.
  - 7. A method as recited in claim 1, further comprising:
    - obtaining topic distributions of documents of a classified document corpus;
      
      obtaining topic distributions of documents of an unclassified document corpus;
      
      comparing topic distributions between the documents of the unclassified document corpus and the documents of the classified document corpus; and
      
      classifying one or more documents of the unclassified document corpus according to classifications of documents in the classified document corpus having common topic distributions with the one or more documents of the unclassified document corpus.
  - 8. A method as recited in claim 1, further comprising:
    - comparing a reference document of a first language and a plurality of documents of a second language to the topic distributions to identify topics of the documents;
      
      recommending documents of the second language that are related to the reference document based on their identified topics.

9. A method comprising:
- under a control of one or more processors,identifying multiple concept-units from a multi-language document corpus, a respective concept-unit including a set of documents in a plurality of languages describing a particular concept, the identifying including identifying one or more hyperlinks or references within a respective document that identify one or more other documents in one or more other languages relating to the particular concept;
  
  maintaining a separation of term-by-document matrices for the plurality of languages; and
  
  inferring a plurality of universal topics from the multiple concept-units, at least one respective universal topic being defined by a plurality of topic word distributions in the plurality of languages, at least one of the plurality of topic word distributions for the respective universal topic corresponding to a respective language from the plurality of languages and including one or more words in the respective language with corresponding probability values characterizing the respective universal topic.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. A method as recited in claim 9, wherein the inferring comprises performing a latent Dirichlet allocation analysis.
  - 11. A method as recited in claim 9, wherein the inferring comprises performing a probabilistic latent semantic analysis.
  - 12. A method as recited in claim 9, wherein the inferring comprises performing a latent Dirichlet allocation analysis while constraining the documents within a single concept-unit to share a common topic distribution.
  - 13. A method as described in claim 9, further comprising comparing a new document in a given language to the topic word distributions corresponding to the given language to estimate a topic distribution of the new document.
  - 14. A method as recited in claim 9, further comprising comparing new documents of different languages to the topic distributions to identify one or more groups of the new documents sharing common topics.
  - 15. A method as recited in claim 9, further comprising:
    - obtaining topic distributions of documents of a classified document corpus;
      
      obtaining topic distributions of documents of an unclassified document corpus;
      
      comparing topic distributions between the documents of the unclassified document corpus and the documents of the classified document corpus; and
      
      classifying one or more documents of the unclassified document corpus according to classifications of documents in the classified document corpus having common topic distributions with the one or more documents of the unclassified document corpus.
  - 16. A method as recited in claim 9, further comprising:
    - comparing a reference document of a first language and a plurality of documents of a second language to the topic distributions to identify topics of the documents; and
      
      recommending documents of the second language that are related to the reference document based on their identified topics.

17. A method comprising:
- identifying multiple concept-units from a multi-language document corpus, a respective concept-unit including a set of documents in a plurality of languages describing a particular concept, the identifying including identifying one or more hyperlinks or references within a respective document that identify one or more other documents in one or more other languages relating to the particular concept;
  
  maintaining a separation of term-by-document matrices for the plurality of languages;
  
  deriving a universal topic space from the multiple concept-units, the universal topic space including a plurality of universal topics, at least one respective universal topic being defined by a plurality of topic word distributions in the plurality of languages, at least one of the plurality of topic word distributions for the respective universal topic corresponding to a respective language from the plurality of languages and including one or more words in the respective language with corresponding probability values characterizing the respective universal topic; and
  
  analyzing one or more new documents of different languages to place them within the universal topic space.
- View Dependent Claims (18)
- - 18. A method as recited in claim 17, wherein identifying the multiple concept-units comprises identifying hyperlinks within the documents that identify other documents in other languages relating to common concepts.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Ni, Xiaochuan, Sun, Jian-Tao, Chen, Zheng, Hu, Jian
Primary Examiner(s)
KINSAUL, DANIEL W

Application Number

US12/760,844
Publication Number

US 20110258229A1
Time in Patent Office

1,601 Days
Field of Search

None
US Class Current

707/737
CPC Class Codes

G06F 16/24   Querying

G06F 16/35   Clustering; Classification

G06F 16/36   Creation of semantic tools,...

G06F 40/284   Lexical analysis, e.g. toke...

Mining multilingual topics

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

24 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Mining multilingual topics

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

24 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links