Method and system for off-line detection of textual topical changes and topic identification via likelihood based methods for improved language modeling

US 6,529,902 B1
Filed: 11/08/1999
Issued: 03/04/2003
Est. Priority Date: 11/08/1999
Status: Expired due to Term

First Claim

Patent Images

1. A computer system, comprising:

at least one central processing unit (CPU);

at least one memory coupled to said at least one CPU;

a network connectable to said at least one CPU; and

a database, stored on said at least one memory, containing a plurality of textual data set of topics, wherein said at least one CPU executes first and second processes in forward and reverse directions, respectively, for extracting a segment having a predetermined size from a text, computing likelihood scores of a text in the segment for each topic, computing likelihood ratios based on said likelihood scores, comparing said likelihood ratios to a threshold and defining whether there is a change point at the current last word in the window.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system (and method) for off-line detection of textual topical changes includes at least one central processing unit (CPU), at least one memory coupled to the at least one CPU, a network connectable to the at least one CPU, and a database, stored on the at least one memory, containing a plurality of textual data set of topics. The CPU executes first and second processes in first and second directions, respectively, for extracting a segment having a predetermined size from a text, computing likelihood scores of a text in the segment for each topic, computing likelihood ratios, comparing them to a threshold, and defining whether there is a change point at the current last word in a window.

171 Citations

53 Claims

1. A computer system, comprising:
- at least one central processing unit (CPU);
  
  at least one memory coupled to said at least one CPU;
  
  a network connectable to said at least one CPU; and
  
  a database, stored on said at least one memory, containing a plurality of textual data set of topics, wherein said at least one CPU executes first and second processes in forward and reverse directions, respectively, for extracting a segment having a predetermined size from a text, computing likelihood scores of a text in the segment for each topic, computing likelihood ratios based on said likelihood scores, comparing said likelihood ratios to a threshold and defining whether there is a change point at the current last word in the window.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The system according to claim 1, wherein said at least one CPU further executes a process for extracting a segment having a predetermined size from a text, computing likelihood ratios, comparing the likelihood ratios to a threshold, and defining a new topic, or determining that the topic is not in a battery of topics.
  - 3. The system as in claim 1, wherein the likelihood score is computed using a T-gram language model.
  - 4. The system as in claim 1, wherein said at least one CPU further executes a process for translating segmented text using topic labeling of each segment.
  - 5. The system as in claim 1, wherein said at least one CPU further executes a process for labelling non-confusable segments.
  - 6. The system as in claim 1, wherein said at least one CPU executes the process for labeling of confusable segments using a mixture of language models.
  - 7. The method according to claim 1, wherein, with a text containing a topic T, said forward and reverse processes respectively produce corresponding segments T¹and T²for said forward and reverse directions,
- 8. The method according to claim 7, wherein when T¹covers more than one occurrence of T²such that T¹covers a sub-segment identified by a reverse search, surrounded by T²on both sides, said sub-segment is considered as part of T², and extraction of the edges of said topic by averaging edges of T¹and T²is performed.

9. A method of performing off-line detection of textual topical changes and topic identification, comprising:
- extracting a segment of a size one, two and more from a text;
  
  computing likelihood scores of a text in the segment for each topic;
  
  computing likelihood ratios based on said likelihood scores; and
  
  comparing said likelihood ratios to a threshold and defining whether there is a change point at a current last word in a window, wherein said method is executed in forward and reverse directions.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The method according to claim 9, wherein said method is executed by at least one central processing unit.
  - 11. The method according to claim 9, further comprising:
12. The method as in claim 9, wherein computing the likelihood score includes using a T-gram language model.
13. The method as in claim 9, further comprising:
- translating segmented text using a topic labeling of each segment.
14. The method as in claim 9, further comprising:
- labelling non-confusable segments.
15. The method as in claim 9, further comprising:
- labelling confusable segments including using a mixture of language models.

16. A method of detecting topical changes in a textual segment, comprising:
- evaluating text probabilities under each topic of a plurality of topics including segmenting textual data of said textual segment and identifying a topic of a current segment in different time direction; and
  
  selecting a new topic when one of said text probabilities becomes larger than others of said text probabilities wherein said topic detection is performed off-line, wherein said segmenting includes segmenting said textual segment into homogenous segments and wherein said different time directions include moving from a beginning of a text to the end of said text and vice versa.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
- - 17. The method according to claim 16, further comprising labeling said topic using different labeling strategies,
- 18. The method according to claim 17, further comprising:
  - estimating probability scores for topics using marks that obtained by said labeling.
- 19. The method according to claim 16, wherein a cumulative sum-based method is used for detecting change-points in textual data and for estimating probabilities of sample distribution for topic identification.
- 20. The method according to claim 19, further comprising:
  - applying a change-point detection for detection of homogenous segments of textual data while moving in two different time directions, thereby to identify hidden regularities of textual components that are obtained for each time direction.
- 21. The method according to claim 20, wherein if said regularities coincide for both directions, then said regularities are used for topic labeling, andwherein if said regularities do not coincide for both directions, then said regularities are used to build a mathematical model that reflect confusability in these regularities.
- 22. The method according to claim 16, further comprising:
  - performing labelling for each of a plurality of directions; and
    
    resolving confusable decisions obtained for different directions.
- 23. The method according to claim 16, further comprising:
  - performing off-line segmentation of textual data that uses a change-point detection such that off-line topic identification of textual data can be performed.
- 24. The method according to claim 16, wherein said method includes utilizing a change-point detection using a cumulative sum (CUSUM) technique.
- 25. The method according to claim 16, further comprising:
  - performing a one direction topic segmentation and labeling including a training procedure.
- 26. The method according to claim 25, wherein said training procedure comprises:
  - creating a battery of topics T₁, T₂, . . . , T_n, wherein each topic is characterized by one of frequencies of tokens, frequencies of combination of two words and three words, wherein a Kullback-Liebler distance between any two topics is defined as at least h, where h is a threshold, wherein probabilities of a tuple of words depend on time directions, and for each time direction, separate probabilities for a combination(s) of words are estimated;
    
    creating a neutral topic T derived from textual data that include all topics and other general texts; and
    
    finding typical durations of texts belonging to topics in the battery.
- 27. The method according to claim 26, wherein said training procedure further comprises:
  - examining estimated frequencies and comparing said estimated frequencies to all topics in the list, to determine a single topic as a predetermined topic and declaring said topic, wherein if a conclusion is reached that a current topic is not in the list, then {tilde over (T)} is declared as a current topic.
- 28. The method according to claim 27, wherein said training procedure further comprises:
  - once a topic {circumflex over (T)} has been declared, monitoring to detect a change in said topic, wherein during monitoring a likelihood of text is compared against {circumflex over (T)} until a regeneration point is reached.
- 29. The method according to claim 28, wherein said training procedure further comprises:
  - establishing a current topic including, once the topic change is detected, establishing a time of onset of the new topic;
    
    without declaring a new topic, continuing to accumulate words until a topic having a predetermined high score is found, said continuing including;
    
    finding a candidate topic T_ifor which the likelihood of the text is maximal, and then finding the closest competing topic representing a topic that has the maximum of likelihoods corresponding to remaining topics excluding {tilde over (T)};
    
    comparing a likelihood of the group of words in the candidate topic T_iagainst {T₁, . . . }, and if the likelihood is higher by a given factor than the likelihood of a text to the closest competing topic, then T_iis declared as the current topic, and wherein if a conclusion is reached that a topic is not in the list, then {tilde over (T)} is declared as the current topic.
- 30. The method according to claim 29, wherein said training procedure further comprises:
  - restarting monitoring the current topic to detect a change.
- 31. The method according to claim 29, wherein with topic T_jbe declared and an onset time thereof found, a new word is taken and k is set to 1 and, starting from a current word w_m+1, go k words back and find probabilities P_m+1−
  - i^(Tj)=P(w_m+1−
    
    i|Tj),i=0,1, . . . ,k−
    
    1, and a distribution {{circumflex over (P)}m−
    
    k+2, . . . {circumflex over (P)}_m+1} is constructed where ${\hat{P}}_{i} = \frac{# of times word i appeared in the current segment of size k}{k} .$
- 32. The method according to claim 27, further comprising:
  - identifying a topic of a given text segment w_l, . . . , w_m−
    
    1.
- 33. The method according to claim 32, wherein said identifying includes:
  - verifying whether said segment is covered by a new topic that is not in the battery of topics, wherein if it is determined that the segment is covered by some topic from the battery, then said topic is identified.
- 34. The method according to claim 33, wherein said testing whether said topic is in the battery comprises:
  - for a given k corresponding to the segment associated with the current topic, computing for every j in the battery $S (j) = \sum_{i = 0}^{k - 1} \ln \frac{{\hat{P}}_{m - i + 1}}{P_{m - i + 1}^{(Tj)}}$ wherein if S(j)>
    
    c for all j, then said topic is declared as not being in the battery and a switch is performed to the neutral topic {tilde over (T)}.
- 35. The method according to claim 34, wherein said topic identification further comprises:
  - establishing the topic by one of;
    
    determining if $\frac{P (w_{m - k + 2} \dots w_{m + 1} \rangle T_{i})}{\max_{j \neq i} P (w_{l} \dots w_{m + 1} \rangle T_{j})} > c_{1}$
    
    for some i, and if so, then T_iis declared as a new topic; and
    
    determining if $\frac{P (w_{l} \dots w_{m + 1} \rangle T_{i})}{\max_{j \neq i} P (w_{l} \dots w_{m + 1} \rangle T_{j})} > c_{1}$
    
    for some i and l=m+l, m, . . . , m−
    
    k+2, and if so, then T_iis declared as a new topic.
- 36. The method according to claim 35, wherein if it was neither established that the topic is not in the battery nor that the topic was identified, an additional point is taken by incrementing m, and it is retested of whether said topic is in the battery.
- 37. The method according to claim 27, further comprising:
  - resolving different labeling sets performed in different labelling procedures.
- 38. The method according to claim 37, wherein said resolving includes:
  - checking said labels which were obtained through different time directions, wherein if some labels from different time directions coincide and are located closely in a predetermined manner, then this label is chosen and is placed in a middle of a location of labels from two directions.
- 39. The method according to claim 37, wherein a sum is constructed of:
  - $S_{k} = \sum_{i = 0}^{k - 1} \ln \frac{\hat{P} m - i + 1}{P_{m - i + 1}^{(Tj)}}$ wherein if S_k>
    
    c, then a change in topic is determined to have occurred and m−
    
    k+2 is declared as the onset point, and wherein if S_k>
    
    c is not determined, then k is incremented and the process is repeated.
- 40. The method according to claim 39, wherein if the regeneration point is reached and S_k>
  - c is not determined, then the current topic T_jis continued, andwherein if for a present point S_k<
    
    0 for every k, then the present point is determined as a new regeneration point.
- 41. The method according to claim 25, wherein in said labeling, when a label in one direction T_j¹divides a segment of words between two labels from a different direction T_i1²and T_i2²and with the label T_j¹different from labels T_i1²and T_i2², first and second topic labels are introduced,wherein said first label includes a label T_j,i1^1,2for a segment S_1,1that is covered by both T_j¹and T_i1², and said second label includes a label T_j,i2^1,2for a segment S_1,2that is covered by both T_j¹and T_i2².
- 42. The method according to claim 41, wherein, for each of said first and second labels and a new segment that is covered by said first and second labels, a new set of probabilities is attached including a mixture of previous probability sets that are related to previous topics, such that letting P_a=P(w_k. . . w_r|T_j¹) be a probability for a string of words in the segment S_1,1labelled by a topic T_j¹and P_b=P(w_k. . . w_r|T_i¹) be a probability for said string of words labelled by a topic T_i¹, a new probability is defined for said string of words as P_ab=P(w_k. . . w_r|T_j,i1^1,2)=α
  - P_a+β
    
    P_b,wherein α
    
    =l₁/l and β
    
    =l₂/l, where l₁is a duration of a sub-segment S_1,1, l₂is a duration of a sub-segment S_1,2and l=l₁+l₂.
- 43. The method according to claim 16, further comprising detecting an onset of a new topic;
  - andwherein with P(wl, wl+1, . . . wr|Tl) representing a likelihood measure for identifying a given string of words wl, wl+1, . . . wr in a context of a topic Ti from a battery, and with Fi(wl, wl+1, . . . wr)−
    
    max {j=1 . . . n, Tj≠
    
    Ti;
    
    Tj≠
    
    T}P(w;
    
    . . . wr|Tj) and that a moment of time 1 corresponds to a regeneration point and having analyzed words from wl to wm and found no evidence that the declared topic Ti changed, determining whether the next word Wm+1 belongs to the same topic or that the topic has changed.
- 44. The method according to claim 43, wherein said verification includes:
  - taking a segment of length l, and determining whether $\frac{F_{i} (w_{m + 1})}{P (w_{m + 1} \rangle T_{i})} > c .$
- 45. The method according to claim 44, wherein if it is determined that $F_{i}$
  - (wm+1)P
    
    (wm+1
    
    
    
    Ti)>
    
    c,then it is declared that topic is not T_i, and m+1 is declared as the onset point of a new topic.
- 46. The method according to claim 45, wherein if it is determined $F_{i}$
  - (wm+1)P(wm+1
    
    
    
    Ti)
    
    
    
    is
    
    ≤
    
    c,then a segment of length 2 is taken and it is determined whether $\frac{F_{i} (w_{m}, w_{m + 1})}{P (w_{m}, w_{m + 1} \rangle T_{i})} > c .$
- 47. The method according to claim 46, wherein if it determined that $F_{i}$
  - (wm,wm+1)P(wm,wm+1
    
    
    
    Ti)>
    
    c,then said topic is declared as not T_i, and m is declared as the onset point of a new topic.
- 48. The method according to claim 47, wherein if it is determined that $F_{i}$
  - (wm,wm+1)P(wm,wm+1
    
    
    
    Ti)≤
    
    c,then a segment of length 3 is taken and it is determined whether $\frac{F_{i} (w_{m - 1}, w_{m}, w_{m + 1})}{P (w_{m - 1}, w_{m}, w_{m + 1} \rangle T_{i})} > c .$
- 49. The method according to claim 48, further comprising determining whether a segment going back in time until the regeneration point was found for which:
  - $\frac{F_{i} (w_{m + 2 - k} \dots w_{m - 1}, w_{m}, w_{m + 1})}{P (w_{m + 2 - k} \dots w_{m - 1}, w_{m}, w_{m + 1} \rangle T_{i})} > c .$
- 50. The method according to claim 49, wherein if it is determined that $F_{i}$
  - (wm+2-k
    
    
    
    …
    
    
    
    
    
    wm-1,wm,wm+1)P(wm+2-k
    
    
    
    …
    
    
    
    
    
    wm-1,wm,wm+1
    
    
    
    Ti)>
    
    c,then it is declared that the topic is not T_i, and m−
    
    k+2 is declared as the onset point of a new topic, and wherein if it is determined that $\frac{F_{i} (w_{m + 2 - k} \dots w_{m - 1}, w_{m}, w_{m + 1})}{P (w_{m + 2 - k} \dots w_{m - 1}, w_{m}, w_{m + 1} \rangle T_{i})} \leq c,$ then it is declared that there is no evidence in change of topic.
- 51. The method according to claim 50, wherein if l is reached and in all the previous steps, including a regeneration point $F_{i}$
  - (wm+2-k
    
    
    
    …
    
    
    
    
    
    wm-1,wm,wm+1)P(wm+2-k
    
    
    
    …
    
    
    
    
    
    wm-1,wm,wm+1
    
    
    
    Ti)<
    
    1then w_m+1is declared as a new regeneration point.

52. A method for detection of textual topical changes and topic identification, comprising:
- forming a battery of topics from training data;
  
  detecting topic changes in said text, using said battery and a first threshold ratio;
  
  identifying topics in said text, using said battery and a second threshold ratio, and performing said identification using different time direction;
  
  segmenting textual data into homogenous segments, wherein said topic identification is performed off-line, and wherein text probabilities under each topic are evaluated, and a new topic is selected when one of said probabilities become larger than other probabilities, and wherein said different time directions include moving from a beginning of a text to the end of said text and vice versa.

53. A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method for computer-implemented off-line detection of textual topical changes and topic identification, said method comprising:
- extracting a segment of a predetermined size from a text;
  
  computing likelihood scores of a text in the segment for each topic;
  
  computing likelihood ratios;
  
  comparing said likelihood ratios to a threshold and defining whether there is a change point at a current last word in a window; and
  
  repeating said method in a reverse time direction, wherein said method of extracting, computing and comparing is first performed in a forward direction.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Yashchm, Emmanuel, Kanevsky, Dimitri
Primary Examiner(s)
Homere, Jean R.
Assistant Examiner(s)
Wassum, Luke S

Application Number

US09/435,995
Time in Patent Office

1,212 Days
Field of Search

707/1-5, 707/531, 704/9, 704/248-249, 704/253, 704/257
US Class Current

1/1
CPC Class Codes

G06F 16/355 Class or cluster creation o...

Y10S 707/99935 Query augmenting and refini...

Method and system for off-line detection of textual topical changes and topic identification via likelihood based methods for improved language modeling

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

171 Citations

53 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for off-line detection of textual topical changes and topic identification via likelihood based methods for improved language modeling

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

171 Citations

53 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links