Unsupervised automated topic detection, segmentation and labeling of conversations

US 20180239822A1
Filed: 12/07/2017
Published: 08/23/2018
Est. Priority Date: 02/20/2017
Status: Active Grant

First Claim

Patent Images

1. A method for information processing, comprising:

receiving in a computer a corpus of recorded conversations, with two or more speakers participating in each conversation;

computing, by the computer, respective frequencies of occurrence of multiple words in each of a plurality of chunks in each of the recorded conversations;

based on the frequencies of occurrence of the words over the conversations in the corpus, deriving autonomously by the computer an optimal set of topics to which the chunks can be assigned such that the optimal set maximizes a likelihood that the chunks will be generated by the topics in the set;

segmenting a recorded conversation from the corpus, using the derived topics into a plurality of segments, such that each segment is classified as belonging to a particular topic in the optimal set; and

outputting a distribution of the segments and respective classifications of the segments into the topics over a duration of the recorded conversation.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for information processing includes receiving in a computer a corpus of recorded conversations, with two or more speakers participating in each conversation. Respective frequencies of occurrence of multiple words in each of a plurality of chunks in each of the recorded conversations are computed. Based on the frequencies of occurrence of the words over the conversations in the corpus, an optimal set of topics to which the chunks can be assigned is derived, such that the optimal set maximizes a likelihood that the chunks will be generated by the topics in the set. A recorded conversation from the corpus is segmented using the derived topics into a plurality of segments, such that each segment is classified as belonging to a particular topic in the optimal set.

31 Citations

42 Claims

1. A method for information processing, comprising:
- receiving in a computer a corpus of recorded conversations, with two or more speakers participating in each conversation;
  
  computing, by the computer, respective frequencies of occurrence of multiple words in each of a plurality of chunks in each of the recorded conversations;
  
  based on the frequencies of occurrence of the words over the conversations in the corpus, deriving autonomously by the computer an optimal set of topics to which the chunks can be assigned such that the optimal set maximizes a likelihood that the chunks will be generated by the topics in the set;
  
  segmenting a recorded conversation from the corpus, using the derived topics into a plurality of segments, such that each segment is classified as belonging to a particular topic in the optimal set; and
  
  outputting a distribution of the segments and respective classifications of the segments into the topics over a duration of the recorded conversation.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method according to claim 1, wherein deriving the optimal set of the topics comprises extracting the topics from the conversations by the computer without using a pre-classified training set.
  - 3. The method according to claim 1, wherein receiving the corpus comprises:
    - converting the conversations to a textual form;
      
      analyzing a syntax of the conversations in the textual form; and
      
      discarding from the corpus the conversations in which the analyzed syntax does not match syntactical rules of a target language.
  - 4. The method according to claim 1, wherein deriving the optimal set of topics comprises defining a target number of topics, and applying Latent Dirichlet Allocation to the corpus in order to derive the target number of the topics.
  - 5. The method according to claim 1, and comprising automatically assigning, by the computer, respective titles to the topics.
  - 6. The method according to claim 5, wherein automatically assigning the respective titles comprises, for each topic, extracting from the segments of the conversations in the corpus that are classified as belonging to the topic one or more n-grams that statistically differentiate the segments classified as belonging to the topic from the segments that belong to the remaining topics in the set, and selecting one of the extracted n-grams as a title for the topic.
  - 7. The method according to claim 1, wherein deriving the optimal set of the topics comprises computing, based on the frequencies of occurrence of the words in the chunks, respective probabilities of association between the words and the topics, andwherein segmenting the recorded conversation comprises classifying each segment according to the respective probabilities of association of the words occurring in the segment.
  - 8. The method according to claim 7, wherein computing the respective probabilities of association comprises computing respective word scores for each word with respect to each of the topics based on the probabilities of association, andwherein classifying each segment comprises:
    - for each chunk of the recorded conversation, deriving respective topic scores for the topics in the set by combining the word scores of the words occurring in the chunk with respect to each of the topics;
      
      classifying the chunks into topics based on the respective topic scores; and
      
      defining the segments by grouping together adjacent chunks that are classified as belonging to a common topic.
  - 9. The method according to claim 1, wherein outputting the distribution comprises displaying the distribution of the segments and respective classifications of the segments into the topics on a computer interface.
  - 10. The method according to claim 9, wherein displaying the distribution comprises presenting a timeline that graphically illustrates the respective classifications and durations of the segments during the recorded conversation.
  - 11. The method according to claim 10, wherein presenting the timeline comprises showing which of the speakers was speaking at each time during the recorded conversation.
  - 12. The method according to claim 1, wherein deriving the optimal set of topics comprises receiving seed words for one or more of the topics from a user of the computer.
  - 13. The method according to claim 1, and comprising automatically applying, by the computer, the distribution of the segments in predicting whether a given conversation is likely to result in a specified outcome.
  - 14. The method according to claim 1, and comprising automatically applying, by the computer, the distribution of the segments in assessing whether a given conversation follows a specified pattern.

15. An information processing system, comprising:
- a memory, which is configured to store a corpus of recorded conversations, with two or more speakers participating in each conversation; and
  
  a processor, which is configured to compute respective frequencies of occurrence of multiple words in each of a plurality of chunks in each of the recorded conversations, and to derive autonomously, based on the frequencies of occurrence of the words over the conversations in the corpus, an optimal set of topics to which the chunks can be assigned such that the optimal set maximizes a likelihood that any given chunk will be assigned to a single topic in the set, and to segment a recorded conversation from the corpus, using the derived topics into a plurality of segments, such that each segment is classified as belonging to a particular topic in the optimal set, and to output a distribution of the segments and respective classifications of the segments into the topics over a duration of the recorded conversation.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 16. The system according to claim 15, wherein the processor is configured to extract the topics from the conversations without using a pre-classified training set.
  - 17. The system according to claim 15, wherein the processor is configured to convert the conversations to a textual form, to analyze a syntax of the conversations in the textual form, and to discard from the corpus the conversations in which the analyzed syntax does not match syntactical rules of a target language.
  - 18. The system according to claim 15, wherein the processor is configured to accept a definition of a target number of topics, and to apply Latent Dirichlet Allocation to the corpus in order to derive the target number of the topics.
  - 19. The system according to claim 15, wherein the processor is configured to automatically assign respective titles to the topics.
  - 20. The system according to claim 19, wherein the processor is configured to assign the respective titles by extracting, for each topic, from the segments of the conversations in the corpus that are classified as belonging to the topic one or more n-grams that statistically differentiate the segments classified as belonging to the topic from the segments that belong to the remaining topics in the set, and selecting one of the extracted n-grams as a title for the topic.
  - 21. The system according to claim 15, wherein the processor is configured to compute, based on the frequencies of occurrence of the words in the chunks, respective probabilities of association between the words and the topics, and to classify each segment according to the respective probabilities of association of the words occurring in the segment.
  - 22. The system according to claim 21, wherein the processor is configured to compute respective word scores for each word with respect to each of the topics based on the probabilities of association, to derive, for each chunk of the recorded conversation, respective topic scores for the topics in the set by combining the word scores of the words occurring in the chunk with respect to each of the topics, to classify the chunks into topics based on the respective topic score, and to define the segments by grouping together adjacent chunks that are classified as belonging to a common topic.
  - 23. The system according to claim 15, and comprising a display, wherein the processor is configured to present the distribution of the segments on the display.
  - 24. The system according to claim 23, wherein the processor is configured to present a timeline that graphically illustrates the respective classifications and durations of the segments during the recorded conversation.
  - 25. The system according to claim 24, wherein the timeline further shows which of the speakers was speaking at each time during the recorded conversation.
  - 26. The system according to claim 15, wherein the processor is configured to receive seed words for one or more of the topics from a user of the system and to apply the seed words in deriving the optimal set of topics.
  - 27. The system according to claim 15, wherein the processor is configured to automatically apply the distribution of the segments in predicting whether a given conversation is likely to result in a specified outcome.
  - 28. The system according to claim 15, wherein the processor is configured to automatically apply the distribution of the segments in assessing whether a given conversation follows a specified pattern.

29. A computer software product, comprising a non-transitory computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to store a corpus of recorded conversations, with two or more speakers participating in each conversation, to compute respective frequencies of occurrence of multiple words in each of a plurality of chunks in each of the recorded conversations, and to derive autonomously, based on the frequencies of occurrence of the words over the conversations in the corpus, an optimal set of topics to which the chunks can be assigned such that the optimal set maximizes a likelihood that any given chunk will be assigned to a single topic in the set, and to segment a recorded conversation from the corpus, using the derived topics into a plurality of segments, such that each segment is classified as belonging to a particular topic in the optimal set, and to output a distribution of the segments and respective classifications of the segments into the topics over a duration of the recorded conversation.
- View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42)
- - 30. The product according to claim 29, wherein the instructions cause the computer to extract the topics from the conversations without using a pre-classified training set.
  - 31. The product according to claim 29, wherein the instructions cause the computer to convert the conversations to a textual form, to analyze a syntax of the conversations in the textual form, and to discard from the corpus the conversations in which the analyzed syntax does not match syntactical rules of a target language.
  - 32. The product according to claim 29, wherein the instructions cause the computer to accept a definition of a target number of topics, and to apply Latent Dirichlet Allocation to the corpus in order to derive the target number of the topics.
  - 33. The product according to claim 29, wherein the instructions cause the computer to automatically assign respective titles to the topics.
  - 34. The product according to claim 33, wherein the instructions cause the computer to assign the respective titles by extracting, for each topic, from the segments of the conversations in the corpus that are classified as belonging to the topic one or more n-grams that statistically differentiate the segments classified as belonging to the topic from the segments that belong to the remaining topics in the set, and selecting one of the extracted n-grams as a title for the topic.
  - 35. The product according to claim 29, wherein the instructions cause the computer to compute, based on the frequencies of occurrence of the words in the chunks, respective probabilities of association between the words and the topics, and to classify each segment according to the respective probabilities of association of the words occurring in the segment.
  - 36. The product according to claim 35, wherein the instructions cause the computer to compute respective word scores for each word with respect to each of the topics based on the probabilities of association, to derive, for each chunk of the recorded conversation, respective topic scores for the topics in the set by combining the word scores of the words occurring in the chunk with respect to each of the topics, to classify the chunks into topics based on the respective topic score, and to define the segments by grouping together adjacent chunks that are classified as belonging to a common topic.
  - 37. The product according to claim 29, wherein the instructions cause the computer to present the distribution of the segments on a display.
  - 38. The product according to claim 37, wherein the instructions cause the computer to display a timeline that graphically illustrates the respective classifications and durations of the segments during the recorded conversation.
  - 39. The product according to claim 38, wherein the timeline further shows which of the speakers was speaking at each time during the recorded conversation.
  - 40. The product according to claim 29, wherein the instructions cause the computer to receive seed words for one or more of the topics from a user of the computer and to apply the seed words in deriving the optimal set of topics.
  - 41. The product according to claim 29, wherein the instructions cause the computer to automatically apply the distribution of the segments in predicting whether a given conversation is likely to result in a specified outcome.
  - 42. The product according to claim 29, wherein the instructions cause the computer to automatically apply the distribution of the segments in assessing whether a given conversation follows a specified pattern.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Gong I.O Ltd. (Gong Io, Inc.)
Original Assignee
Gong I.O Ltd. (Gong Io, Inc.)
Inventors
Reshef, Eilon, Marx, Zvi

Granted Patent

US 10,642,889 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/358   Browsing; Visualisation the...

G06F 16/61   Indexing; Data structures t...

G06F 16/685   using automatically derived...

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/242   Dictionaries

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/35   Discourse or dialogue repre...

G06N 20/00   Machine learning

G06N 5/022   Knowledge engineering; Know...

G06N 7/01   Probabilistic graphical mod...

G10L 15/04   Segmentation; Word boundary...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/26   Speech to text systems G10L...

Unsupervised automated topic detection, segmentation and labeling of conversations

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

31 Citations

42 Claims

Specification

Solutions

Use Cases

Quick Links

Unsupervised automated topic detection, segmentation and labeling of conversations

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

31 Citations

42 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links