Mining multilingual topics
First Claim
Patent Images
1. A method comprising:
- under a control of one or more processors,identifying multiple concept-units from a multi-language document corpus, a respective concept-unit including a set of documents in a plurality of languages describing a particular concept, the identifying including identifying one or more hyperlinks or references within a respective document that identify one or more other documents in one or more other languages relating to the particular concept; and
modeling the concept-units of the multi-language document corpus by maintaining a separation of term-by-document matrices for the plurality of languages to create a generative model, the generative model representing;
a plurality of universal topics, at least one respective universal topic being defined by a plurality of topic word distributions in the plurality of languages, at least one of the plurality of topic word distributions for a respective universal topic corresponding to a respective language from the plurality of languages and including one or more words in the respective language with corresponding probability values characterizing the respective universal topic; and
a topic distribution for at least one concept-unit, the topic distribution for a respective concept-unit including one or more universal topics and their distributions for the respective concept-unit, the set of documents in the different plurality of languages of the respective concept-unit being constrained to share a common topic distribution.
2 Assignments
0 Petitions
Accused Products
Abstract
Techniques for utilizing data mining technology to extract universal topics with multilingual representations from a multilingual database, and to organize existing or new documents in different languages by analyzing their respective topic distributions.
24 Citations
18 Claims
-
1. A method comprising:
-
under a control of one or more processors, identifying multiple concept-units from a multi-language document corpus, a respective concept-unit including a set of documents in a plurality of languages describing a particular concept, the identifying including identifying one or more hyperlinks or references within a respective document that identify one or more other documents in one or more other languages relating to the particular concept; and modeling the concept-units of the multi-language document corpus by maintaining a separation of term-by-document matrices for the plurality of languages to create a generative model, the generative model representing; a plurality of universal topics, at least one respective universal topic being defined by a plurality of topic word distributions in the plurality of languages, at least one of the plurality of topic word distributions for a respective universal topic corresponding to a respective language from the plurality of languages and including one or more words in the respective language with corresponding probability values characterizing the respective universal topic; and a topic distribution for at least one concept-unit, the topic distribution for a respective concept-unit including one or more universal topics and their distributions for the respective concept-unit, the set of documents in the different plurality of languages of the respective concept-unit being constrained to share a common topic distribution. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method comprising:
-
under a control of one or more processors, identifying multiple concept-units from a multi-language document corpus, a respective concept-unit including a set of documents in a plurality of languages describing a particular concept, the identifying including identifying one or more hyperlinks or references within a respective document that identify one or more other documents in one or more other languages relating to the particular concept; maintaining a separation of term-by-document matrices for the plurality of languages; and inferring a plurality of universal topics from the multiple concept-units, at least one respective universal topic being defined by a plurality of topic word distributions in the plurality of languages, at least one of the plurality of topic word distributions for the respective universal topic corresponding to a respective language from the plurality of languages and including one or more words in the respective language with corresponding probability values characterizing the respective universal topic. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A method comprising:
-
identifying multiple concept-units from a multi-language document corpus, a respective concept-unit including a set of documents in a plurality of languages describing a particular concept, the identifying including identifying one or more hyperlinks or references within a respective document that identify one or more other documents in one or more other languages relating to the particular concept; maintaining a separation of term-by-document matrices for the plurality of languages; deriving a universal topic space from the multiple concept-units, the universal topic space including a plurality of universal topics, at least one respective universal topic being defined by a plurality of topic word distributions in the plurality of languages, at least one of the plurality of topic word distributions for the respective universal topic corresponding to a respective language from the plurality of languages and including one or more words in the respective language with corresponding probability values characterizing the respective universal topic; and analyzing one or more new documents of different languages to place them within the universal topic space. - View Dependent Claims (18)
-
Specification