Identification of topics for online discussions based on language patterns
First Claim
1. A method in a computing device for identifying keywords from a corpus of sentences of words, the method comprising:
- storing an initial set of keywords as a current set of keywords;
locating, from sentences of the corpus, words that are keywords of the current set of keywords and replacing each located word with an occurrence of keyword symbol;
for each occurrence of a keyword symbol of a sentence of the corpus, identifying a sequence segment that includes the occurrence of the keyword symbol along with words of the sentence that are adjacent to the keyword symbol;
applying a pattern mining algorithm to the identified sequence segments to identify patterns of words adjacent to the occurrences of the keyword symbol by comparing words adjacent to an occurrence of a keyword symbol to words adjacent to other occurrences of the keyword symbol to derive patterns from the adjacent words, some of identified patterns including the keyword symbol and others of the identified patterns not including the keyword symbol;
filtering out from the identified patterns the identified patterns that do not include the keyword symbol;
filtering out from the identified patterns the identified patterns that include only prepositions in addition to the keyword symbol;
identifying, from the sentences of the corpus, a new current set of keywords that satisfy a keyword confidence criterion based on the identified patterns by applying each identified pattern to the sentences and when an identified pattern matches a sentence, designating the word of the sentence corresponding to the keyword symbol of the identified pattern as a keyword of the new current set of keywords; and
repeating the locating of words, the identifying of sequence segments, the applying of the pattern matching algorithm to identify patterns, and the identifying of keywords using the new current set of keywords until a termination criterion is satisfied and then indicating that the keywords of the identified new current sets of keywords are keywords of the corpus.
2 Assignments
0 Petitions
Accused Products
Abstract
A topic identification system identifies topics of online discussions by iteratively identifying topic words or keywords of the online discussions and identifying language patterns associated with those keywords. The topic identification system starts out with an initial set of keywords and identifies language patterns that each include a keyword. The topic identification system then uses the identified language patterns to identify additional keywords of the online discussion that match the patterns. The topic identification system then again identifies language patterns using the keywords including the newly identified keywords. The topic identification system may repeat the process of identifying language patterns and keywords until a termination criterion is satisfied.
-
Citations
17 Claims
-
1. A method in a computing device for identifying keywords from a corpus of sentences of words, the method comprising:
-
storing an initial set of keywords as a current set of keywords; locating, from sentences of the corpus, words that are keywords of the current set of keywords and replacing each located word with an occurrence of keyword symbol; for each occurrence of a keyword symbol of a sentence of the corpus, identifying a sequence segment that includes the occurrence of the keyword symbol along with words of the sentence that are adjacent to the keyword symbol; applying a pattern mining algorithm to the identified sequence segments to identify patterns of words adjacent to the occurrences of the keyword symbol by comparing words adjacent to an occurrence of a keyword symbol to words adjacent to other occurrences of the keyword symbol to derive patterns from the adjacent words, some of identified patterns including the keyword symbol and others of the identified patterns not including the keyword symbol; filtering out from the identified patterns the identified patterns that do not include the keyword symbol; filtering out from the identified patterns the identified patterns that include only prepositions in addition to the keyword symbol; identifying, from the sentences of the corpus, a new current set of keywords that satisfy a keyword confidence criterion based on the identified patterns by applying each identified pattern to the sentences and when an identified pattern matches a sentence, designating the word of the sentence corresponding to the keyword symbol of the identified pattern as a keyword of the new current set of keywords; and repeating the locating of words, the identifying of sequence segments, the applying of the pattern matching algorithm to identify patterns, and the identifying of keywords using the new current set of keywords until a termination criterion is satisfied and then indicating that the keywords of the identified new current sets of keywords are keywords of the corpus. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computing device that identifies keywords from a corpus of sentences of words from online discussions, comprising:
-
a keyword store containing keywords, the keyword store having an initial set of keywords; a corpus store containing sentences of the corpus; a component that identifies sequence segments of the sentences of the corpus, a sequence segment being a sequence of words that includes a keyword of the keyword store; a component that identifies, from the identified sequence segments, patterns of sequences of words that include a keyword by comparing a sequence segment to other sequence segments to determine whether a pattern can be derived from the sequence segments by applying a pattern mining algorithm to the identified sequence segments to identify patterns of words adjacent to the occurrences of the keywords based on comparison of words adjacent to an occurrence of a keyword to words adjacent to other occurrences of keywords to derive patterns from the adjacent words, some of identified patterns including a keyword; a component that filters out from the identified patterns the identified patterns that do not include a keyword; a component that filters out from the identified patterns the identified patterns that include only preposition in addition to a keyword; a component that identifies, from the sentences of the corpus, keywords within the identified patterns that satisfy a keyword confidence criterion and adds the identified keywords to the keyword store; and a component that determines whether a termination criterion is satisfied so that the iterative identification of sequence segments, patterns, and keywords is terminated. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A computer-readable medium containing instructions for controlling a computing device to identify topic information from a corpus of sentences of words, by a method comprising:
-
storing an initial set of keywords; and repeating the steps of identifying sequence segments of the sentences of the corpus, a sequence segment being a sequence of words that includes a keyword of the stored keywords; identifying from the identified sequence segments patterns of sequences of words that include a keyword and that satisfy a pattern support criterion by comparing a sequence segment to other sequence segments to determine whether a pattern can be derived from the sequence segments by applying a pattern mining algorithm to the identified sequence segments to identify patterns of words adjacent to the occurrences of the keywords based on comparison of words adjacent to an occurrence of a keyword to words adjacent to other occurrences of keywords to derive patterns from the adjacent words, some of identified patterns including a keyword; filtering out from the identified patterns the identified patterns that do not include a keyword; filtering out from the identified patterns the identified patterns that include only prepositions in addition to a keyword; identifying from the sentences of the corpus keywords within the identified patterns that satisfy a keyword confidence criterion; and storing the identified keywords until a termination criterion is satisfied. - View Dependent Claims (16, 17)
-
Specification