Method and system for off-line detection of textual topical changes and topic identification via likelihood based methods for improved language modeling
First Claim
Patent Images
1. A computer system, comprising:
- at least one central processing unit (CPU);
at least one memory coupled to said at least one CPU;
a network connectable to said at least one CPU; and
a database, stored on said at least one memory, containing a plurality of textual data set of topics, wherein said at least one CPU executes first and second processes in forward and reverse directions, respectively, for extracting a segment having a predetermined size from a text, computing likelihood scores of a text in the segment for each topic, computing likelihood ratios based on said likelihood scores, comparing said likelihood ratios to a threshold and defining whether there is a change point at the current last word in the window.
1 Assignment
0 Petitions
Accused Products
Abstract
A system (and method) for off-line detection of textual topical changes includes at least one central processing unit (CPU), at least one memory coupled to the at least one CPU, a network connectable to the at least one CPU, and a database, stored on the at least one memory, containing a plurality of textual data set of topics. The CPU executes first and second processes in first and second directions, respectively, for extracting a segment having a predetermined size from a text, computing likelihood scores of a text in the segment for each topic, computing likelihood ratios, comparing them to a threshold, and defining whether there is a change point at the current last word in a window.
171 Citations
53 Claims
-
1. A computer system, comprising:
-
at least one central processing unit (CPU);
at least one memory coupled to said at least one CPU;
a network connectable to said at least one CPU; and
a database, stored on said at least one memory, containing a plurality of textual data set of topics, wherein said at least one CPU executes first and second processes in forward and reverse directions, respectively, for extracting a segment having a predetermined size from a text, computing likelihood scores of a text in the segment for each topic, computing likelihood ratios based on said likelihood scores, comparing said likelihood ratios to a threshold and defining whether there is a change point at the current last word in the window. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
wherein the segment T1 is biased to the right relative to T, and the segment T2 is biased to the left, and wherein forward segmentation assigns the left edge of T to a preceding topic and reverse segmentation assigns the right edge of T to a preceding topic, wherein averaging of the edges of T1 and T2 is performed. -
8. The method according to claim 7, wherein when T1 covers more than one occurrence of T2 such that T1 covers a sub-segment identified by a reverse search, surrounded by T2 on both sides, said sub-segment is considered as part of T2, and extraction of the edges of said topic by averaging edges of T1 and T2 is performed.
-
-
9. A method of performing off-line detection of textual topical changes and topic identification, comprising:
-
extracting a segment of a size one, two and more from a text;
computing likelihood scores of a text in the segment for each topic;
computing likelihood ratios based on said likelihood scores; and
comparing said likelihood ratios to a threshold and defining whether there is a change point at a current last word in a window, wherein said method is executed in forward and reverse directions. - View Dependent Claims (10, 11, 12, 13, 14, 15)
extracting a segment of size one, two and more from a text;
computing likelihood ratios based on the likelihood scores of a text in the segment for each topic;
comparing the likelihood ratios to a threshold; and
one of defining a new topic and determining that the topic is not in a battery of topics.
-
-
12. The method as in claim 9, wherein computing the likelihood score includes using a T-gram language model.
-
13. The method as in claim 9, further comprising:
translating segmented text using a topic labeling of each segment.
-
14. The method as in claim 9, further comprising:
labelling non-confusable segments.
-
15. The method as in claim 9, further comprising:
labelling confusable segments including using a mixture of language models.
-
16. A method of detecting topical changes in a textual segment, comprising:
-
evaluating text probabilities under each topic of a plurality of topics including segmenting textual data of said textual segment and identifying a topic of a current segment in different time direction; and
selecting a new topic when one of said text probabilities becomes larger than others of said text probabilities wherein said topic detection is performed off-line, wherein said segmenting includes segmenting said textual segment into homogenous segments and wherein said different time directions include moving from a beginning of a text to the end of said text and vice versa. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
wherein said labelling includes marking topics moving from an end of a text to a beginning. -
18. The method according to claim 17, further comprising:
estimating probability scores for topics using marks that obtained by said labeling.
-
19. The method according to claim 16, wherein a cumulative sum-based method is used for detecting change-points in textual data and for estimating probabilities of sample distribution for topic identification.
-
20. The method according to claim 19, further comprising:
applying a change-point detection for detection of homogenous segments of textual data while moving in two different time directions, thereby to identify hidden regularities of textual components that are obtained for each time direction.
-
21. The method according to claim 20, wherein if said regularities coincide for both directions, then said regularities are used for topic labeling, and
wherein if said regularities do not coincide for both directions, then said regularities are used to build a mathematical model that reflect confusability in these regularities. -
22. The method according to claim 16, further comprising:
-
performing labelling for each of a plurality of directions; and
resolving confusable decisions obtained for different directions.
-
-
23. The method according to claim 16, further comprising:
performing off-line segmentation of textual data that uses a change-point detection such that off-line topic identification of textual data can be performed.
-
24. The method according to claim 16, wherein said method includes utilizing a change-point detection using a cumulative sum (CUSUM) technique.
-
25. The method according to claim 16, further comprising:
performing a one direction topic segmentation and labeling including a training procedure.
-
26. The method according to claim 25, wherein said training procedure comprises:
-
creating a battery of topics T1, T2, . . . , Tn, wherein each topic is characterized by one of frequencies of tokens, frequencies of combination of two words and three words, wherein a Kullback-Liebler distance between any two topics is defined as at least h, where h is a threshold, wherein probabilities of a tuple of words depend on time directions, and for each time direction, separate probabilities for a combination(s) of words are estimated;
creating a neutral topic T derived from textual data that include all topics and other general texts; and
finding typical durations of texts belonging to topics in the battery.
-
-
27. The method according to claim 26, wherein said training procedure further comprises:
-
examining estimated frequencies and comparing said estimated frequencies to all topics in the list, to determine a single topic as a predetermined topic and declaring said topic, wherein if a conclusion is reached that a current topic is not in the list, then {tilde over (T)} is declared as a current topic.
-
-
28. The method according to claim 27, wherein said training procedure further comprises:
-
once a topic {circumflex over (T)} has been declared, monitoring to detect a change in said topic, wherein during monitoring a likelihood of text is compared against {circumflex over (T)} until a regeneration point is reached.
-
-
29. The method according to claim 28, wherein said training procedure further comprises:
-
establishing a current topic including, once the topic change is detected, establishing a time of onset of the new topic;
without declaring a new topic, continuing to accumulate words until a topic having a predetermined high score is found, said continuing including;
finding a candidate topic Ti for which the likelihood of the text is maximal, and then finding the closest competing topic representing a topic that has the maximum of likelihoods corresponding to remaining topics excluding {tilde over (T)};
comparing a likelihood of the group of words in the candidate topic Ti against {T1, . . . }, and if the likelihood is higher by a given factor than the likelihood of a text to the closest competing topic, then Ti is declared as the current topic, and wherein if a conclusion is reached that a topic is not in the list, then {tilde over (T)} is declared as the current topic.
-
-
30. The method according to claim 29, wherein said training procedure further comprises:
restarting monitoring the current topic to detect a change.
-
31. The method according to claim 29, wherein with topic Tj be declared and an onset time thereof found, a new word is taken and k is set to 1 and, starting from a current word wm+1, go k words back and find probabilities Pm+1−
- i(Tj)=P(wm+1−
i|Tj),i=0,1, . . . ,k−
1, and a distribution {{circumflex over (P)}m−
k+2, . . . {circumflex over (P)}m+1} is constructed where
- i(Tj)=P(wm+1−
-
32. The method according to claim 27, further comprising:
identifying a topic of a given text segment wl, . . . , wm−
1.
-
33. The method according to claim 32, wherein said identifying includes:
-
verifying whether said segment is covered by a new topic that is not in the battery of topics, wherein if it is determined that the segment is covered by some topic from the battery, then said topic is identified.
-
-
34. The method according to claim 33, wherein said testing whether said topic is in the battery comprises:
-
for a given k corresponding to the segment associated with the current topic, computing for every j in the battery wherein if S(j)>
c for all j, then said topic is declared as not being in the battery and a switch is performed to the neutral topic {tilde over (T)}.
-
-
35. The method according to claim 34, wherein said topic identification further comprises:
-
establishing the topic by one of;
determining if
for some i, and if so, then Ti is declared as a new topic; and
determining if
for some i and l=m+l, m, . . . , m−
k+2, and if so, then Ti is declared as a new topic.
-
-
36. The method according to claim 35, wherein if it was neither established that the topic is not in the battery nor that the topic was identified, an additional point is taken by incrementing m, and it is retested of whether said topic is in the battery.
-
37. The method according to claim 27, further comprising:
resolving different labeling sets performed in different labelling procedures.
-
38. The method according to claim 37, wherein said resolving includes:
checking said labels which were obtained through different time directions, wherein if some labels from different time directions coincide and are located closely in a predetermined manner, then this label is chosen and is placed in a middle of a location of labels from two directions.
-
39. The method according to claim 37, wherein a sum is constructed of:
-
wherein if Sk>
c, then a change in topic is determined to have occurred and m−
k+2 is declared as the onset point, andwherein if Sk>
c is not determined, then k is incremented and the process is repeated.
-
-
40. The method according to claim 39, wherein if the regeneration point is reached and Sk>
- c is not determined, then the current topic Tj is continued, and
wherein if for a present point Sk<
0 for every k, then the present point is determined as a new regeneration point.
- c is not determined, then the current topic Tj is continued, and
-
41. The method according to claim 25, wherein in said labeling, when a label in one direction Tj1 divides a segment of words between two labels from a different direction Ti12 and Ti22 and with the label Tj1 different from labels Ti12 and Ti22, first and second topic labels are introduced,
wherein said first label includes a label Tj,i11,2 for a segment S1,1 that is covered by both Tj1 and Ti12, and said second label includes a label Tj,i21,2 for a segment S1,2 that is covered by both Tj1 and Ti22. -
42. The method according to claim 41, wherein, for each of said first and second labels and a new segment that is covered by said first and second labels, a new set of probabilities is attached including a mixture of previous probability sets that are related to previous topics, such that letting Pa=P(wk . . . wr|Tj1) be a probability for a string of words in the segment S1,1 labelled by a topic Tj1 and Pb=P(wk . . . wr|Ti1) be a probability for said string of words labelled by a topic Ti1, a new probability is defined for said string of words as Pab=P(wk . . . wr|Tj,i11,2)=α
- Pa+β
Pb,wherein α
=l1/l and β
=l2/l, where l1 is a duration of a sub-segment S1,1, l2 is a duration of a sub-segment S1,2 and l=l1+l2.
- Pa+β
-
43. The method according to claim 16, further comprising detecting an onset of a new topic;
- and
wherein with P(wl, wl+1, . . . wr|Tl) representing a likelihood measure for identifying a given string of words wl, wl+1, . . . wr in a context of a topic Ti from a battery, and with Fi(wl, wl+1, . . . wr)−
max {j=1 . . . n, Tj≠
Ti;
Tj≠
T}P(w;
. . . wr|Tj) and that a moment of time 1 corresponds to a regeneration point and having analyzed words from wl to wm and found no evidence that the declared topic Ti changed, determining whether the next word Wm+1 belongs to the same topic or that the topic has changed.
- and
-
44. The method according to claim 43, wherein said verification includes:
taking a segment of length l, and determining whether
-
45. The method according to claim 44, wherein if it is determined that
-
( w m + 1 ) P ( w m + 1 T i ) > c , then it is declared that topic is not Ti, and m+1 is declared as the onset point of a new topic.
-
-
46. The method according to claim 45, wherein if it is determined
-
( w m + 1 ) P ( w m + 1 T i ) is ≤ c , then a segment of length 2 is taken and it is determined whether
-
-
47. The method according to claim 46, wherein if it determined that
-
( w m , w m + 1 ) P ( w m , w m + 1 T i ) > c , then said topic is declared as not Ti, and m is declared as the onset point of a new topic.
-
-
48. The method according to claim 47, wherein if it is determined that
-
( w m , w m + 1 ) P ( w m , w m + 1 T i ) ≤ c , then a segment of length 3 is taken and it is determined whether
-
-
49. The method according to claim 48, further comprising determining whether a segment going back in time until the regeneration point was found for which:
-
50. The method according to claim 49, wherein if it is determined that
-
( w m + 2 - k … w m - 1 , w m , w m + 1 ) P ( w m + 2 - k … w m - 1 , w m , w m + 1 T i ) > c , then it is declared that the topic is not Ti, and m−
k+2 is declared as the onset point of a new topic, and wherein if it is determined thatthen it is declared that there is no evidence in change of topic.
-
-
51. The method according to claim 50, wherein if l is reached and in all the previous steps, including a regeneration point
-
( w m + 2 - k … w m - 1 , w m , w m + 1 ) P ( w m + 2 - k … w m - 1 , w m , w m + 1 T i ) < 1 then wm+1 is declared as a new regeneration point.
-
-
-
52. A method for detection of textual topical changes and topic identification, comprising:
-
forming a battery of topics from training data;
detecting topic changes in said text, using said battery and a first threshold ratio;
identifying topics in said text, using said battery and a second threshold ratio, and performing said identification using different time direction;
segmenting textual data into homogenous segments, wherein said topic identification is performed off-line, and wherein text probabilities under each topic are evaluated, and a new topic is selected when one of said probabilities become larger than other probabilities, and wherein said different time directions include moving from a beginning of a text to the end of said text and vice versa.
-
-
53. A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method for computer-implemented off-line detection of textual topical changes and topic identification, said method comprising:
-
extracting a segment of a predetermined size from a text;
computing likelihood scores of a text in the segment for each topic;
computing likelihood ratios;
comparing said likelihood ratios to a threshold and defining whether there is a change point at a current last word in a window; and
repeating said method in a reverse time direction, wherein said method of extracting, computing and comparing is first performed in a forward direction.
-
Specification