Multimodal stream processing-based cognitive collaboration system

US 10,257,241 B2
Filed: 12/21/2016
Issued: 04/09/2019
Est. Priority Date: 12/21/2016
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a stream processing engine including a plurality of processor modules configured to perform cognitive processing of multimodal input streams including input audio, input video, and input text, originated at one or more user devices associated with a communication session supported by a collaboration service, wherein the plurality of processor modules includes a speech-to-text module to convert speech in the input audio to converted text, and a natural language processor to derive user intent from the input text and the converted text is available, and wherein the stream processing engine is further configured to derive user requests associated with the communication session based on the user intent and to transmit the user requests over one or more networks; and

a Bot subsystem configured to communicate with the stream processing engine, the collaboration service, and the one or more user devices over the one or more networks, the Bot subsystem including a collection of Bots configured as computer programs that run automated tasks over the one or more networks to implement;

a stream receptor to receive the multimodal input streams from the one or more user devices and direct the multimodal input streams to an appropriate one or ones of the plurality of processor modules of the stream processing engine to enable the stream processing engine to derive the user requests;

a cognitive action interpreter to translate the user requests to corresponding action requests and issue the action requests to the collaboration service so as to initiate actions with respect to the communication session; and

a cognitive responder to transmit, in response to the user requests, multimodal user responses, including audio, video, and text user responses to the one or more user devices.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A collaboration system includes a stream processing engine and a Bot subsystem. The stream processing engine performs cognitive processing of multimodal input streams originated at one or more user devices in a communication session supported by a collaboration service to derive user-intent-based user requests and transmit the user requests over one or more networks. The Bot subsystem includes a stream receptor directs the multimodal input streams from the user devices to the stream processing engine to enable the stream processing engine to derive the user requests. The Bot subsystem also includes a cognitive action interpreter to translate the user requests to action requests and issue the action requests to the collaboration service so as to initiate actions with respect to the communication session. The Bot subsystem also includes a cognitive responder to transmit, in response to the user requests, multimodal user responses to the one or more user devices.

Citations

20 Claims

1. A system comprising:
- a stream processing engine including a plurality of processor modules configured to perform cognitive processing of multimodal input streams including input audio, input video, and input text, originated at one or more user devices associated with a communication session supported by a collaboration service, wherein the plurality of processor modules includes a speech-to-text module to convert speech in the input audio to converted text, and a natural language processor to derive user intent from the input text and the converted text is available, and wherein the stream processing engine is further configured to derive user requests associated with the communication session based on the user intent and to transmit the user requests over one or more networks; and
  
  a Bot subsystem configured to communicate with the stream processing engine, the collaboration service, and the one or more user devices over the one or more networks, the Bot subsystem including a collection of Bots configured as computer programs that run automated tasks over the one or more networks to implement;
  
  a stream receptor to receive the multimodal input streams from the one or more user devices and direct the multimodal input streams to an appropriate one or ones of the plurality of processor modules of the stream processing engine to enable the stream processing engine to derive the user requests;
  
  a cognitive action interpreter to translate the user requests to corresponding action requests and issue the action requests to the collaboration service so as to initiate actions with respect to the communication session; and
  
  a cognitive responder to transmit, in response to the user requests, multimodal user responses, including audio, video, and text user responses to the one or more user devices.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The system of claim 1, wherein the plurality of processor modules of the stream processing engine further include:
    - an entity extractor to access, based on the user intent from the natural language processor, information associated with the communication session and the collaboration service,wherein the stream processing engine is further configured to derive the user request based on the user intent and the information accessed by the entity extractor.
  - 3. The system of claim 1, wherein the plurality of processor modules of the stream processing engine further include:
    - a knowledge graph module to generate a knowledge graph associated with the user intent to resolve the user intent based on first information associated with the collaboration service if the user intent from the natural language processor is incomplete; and
      
      a disambiguator to disambiguate the user intent based on second information associated with the collaboration service if the user intent from the natural language processor would otherwise result in multiple ambiguous user requests.
  - 4. The system of claim 1, wherein the plurality of processor modules of the stream processing engine further include:
    - an object recognition module to detect faces in the input video and determine whether the faces that are detected are known based on a database of faces associated with the collaboration service,wherein the stream processing engine is configured to perform the cognitive processing to derive the user requests based on the user intent and the detected faces that are determined to be recognized.
  - 5. The system of claim 1, wherein the stream processing engine and the cognitive responder cooperate to convert speech requests in the input audio to speech responses, and the Bot subsystem further includes a conversation manager to determine a contextual conversation sequence associated with the communication session and classify as the contextual conversation sequence a sequence including at least some of the speech requests and at least some of the speech responses that share the same conversation context.
  - 6. The system of claim 1, wherein the processor modules of the stream processing engine further include a machine learning module configured to receive the user requests and learn rules with which to derive the user requests.
  - 7. The system of claim 1, wherein:
    - the stream processing engine is configured to identify for each user request a mode of the multimodal input stream, as audio, video, or text, from which the user request was primarily derived and transmit to the cognitive responder an indication of the mode with the user request; and
      
      the cognitive responder is configured to transmit the user response in response to the user request using audio, video, or text using the mode indicated in the user request.
  - 8. The system of claim 1, wherein:
    - the communication session is a video conference session and the collaboration service is a video conference service;
      
      the stream processing engine is configured to derive from one of the multimodal input streams one of the user requests as a start video conference request; and
      
      the cognitive action interpreter is configured to translate the start video conference request to a corresponding action request that is able to be understood by the video conference service and transmit the request to the video conference service so as to cause the video conference service to start the video conference session.
  - 9. The system of claim 1, wherein:
    - the communication session is a document retrieval session and the collaboration service is a document management system;
      
      the stream processing engine is configured to derive from one of the multimodal input streams one of the user requests as a document retrieval request identifying a specific document to be retrieved; and
      
      the cognitive action interpreter is configured to translate the document retrieval request to a corresponding action request that is able to be understood by the document management system and transmit the action request to the documentation management system so as to cause the document management system to retrieve the specific document.
  - 10. The system of claim 1, wherein:
    - the communication session is a web-based meeting in a personal meeting room (PMR) and the collaboration service is a web-based meeting service;
      
      the stream processing engine is configured to derive from one of the multimodal input streams one of the user requests as a request to start the web-based meeting; and
      
      the cognitive action interpreter is configured to translate the request to a corresponding action request that is able to be understood by the web-based meeting service and transmit the action request to the web-based meeting service so as to cause the web-based meeting service to start the web-based meeting.
  - 11. The system of claim 1, wherein each of the user requests includes:
    - a source identifier for the stream processing engine, a destination identifier for the Bot subsystem, an identifier of one of the one or more user devices serviced by the user request, information for a user response or information for an action request.

12. A method comprising:
- at a plurality of processor modules of a stream processing engine, performing cognitive processing of multimodal input streams including input audio, input video, and input text, originated at one or more user devices associated with a communication session supported by a collaboration service, wherein the performing cognitive processing includes converting speech in the input audio to converted text, deriving user intent from the input text and the converted text is available, deriving user requests associated with the communication session based on the user intent, and transmitting the user requests over one or more networks; and
  
  at a Bot subsystem configured to communicate with the stream processing engine, the collaboration service, and the one or more user devices over the one or more networks, the Bot subsystem including a collection of Bots configured as computer programs that run automated tasks over the one or more networks;
  
  receiving the multimodal input streams from the one or more user devices and directing the multimodal input streams to an appropriate one or ones of the plurality of processor modules of the stream processing engine to enable the stream processing engine to derive the user requests;
  
  translating the user requests to corresponding action requests and issuing the action requests to the collaboration service so as to initiate actions with respect to the communication session; and
  
  transmitting, in response to the user requests, multimodal user responses, including audio, video, and text user responses to the one or more user devices.
- View Dependent Claims (13, 14, 15, 18)
- - 13. The method of claim 12, further comprising, at the stream processing engine:
    - accessing, based on the user intent, information associated with the communication session and the collaboration service,wherein the deriving the user requests further includes deriving the user requests based on the user intent and the accessed information.
  - 14. The method of claim 12, further comprising, at the stream processing engine:
    - detecting faces in the input video and determine whether the faces that are detected are known based on a database of faces associated with the collaboration service,wherein the deriving the user requests further includes deriving the user requests based on the user intent and the detected faces that are determined to be recognized.
  - 15. The method of claim 12, further comprising:
    - at the stream processing engine, identifying for each user request a mode of the multimodal input stream, as audio, video, or text, from which the user request was primarily derived and transmitting to the cognitive responder an indication of the mode via the user request; and
      
      at the cognitive responder, transmitting the user response in response to the user request using audio, video, or text using the mode indicated in the user request.
  - 18. The method of claim 12, wherein the processor modules of the stream processing engine further include a machine learning module configured to receive the user requests and learn rules with which to derive the user requests.

16. One or more non-transitory processor readable media storing instructions that, when executed by a processor, cause the processor to:
- implement a stream processing engine including a plurality of processor modules, the instructions to cause the processor to implement the stream processing engine including instructions to cause the processor to perform cognitive processing of multimodal input streams including input audio, input video, and input text, originated at one or more user devices associated with a communication session supported by a collaboration service, wherein the instructions to cause the processor to implement the stream processing engine include instructions to cause the processor to convert speech in the input audio to converted text, derive user intent from the input text and the converted text is available, derive user requests associated with the communication session based on the user intent, and transmit the user requests; and
  
  implement a collection of Bots of a Bot subsystem configured to communicate with the stream processing engine, the collaboration service, and the one or more user devices, the instructions to cause the processor to implement the collection of Bots configured as computer programs that run automated tasks over the one or more networks, including instructions to cause the processor to;
  
  receive the multimodal input streams from the one or more user devices and direct the multimodal input streams to an appropriate one or ones of the plurality of processor modules of the stream processing engine to enable the stream processing engine to derive the user requests;
  
  translate the user requests to corresponding action requests and issue the action requests to the collaboration service so as to initiate actions with respect to the communication session; and
  
  transmit, in response to the user requests, multimodal user responses, including audio, video, and text user responses to the one or more user devices.
- View Dependent Claims (17, 19, 20)
- - 17. The non-transitory processor readable media of claim 16, wherein the instructions to cause the processor to implement the stream processing engine include further instructions to cause the processor to:
    - access, based on the user intent, information associated with the communication session and the collaboration service; and
      
      derive the user requests based on the user intent and the accessed information.
  - 19. The non-transitory processor readable media of claim 16, further comprising instructions to implement a machine learning module of the stream processing engine configured to receive the user requests and learn rules with which to derive the user requests.
  - 20. The non-transitory processor readable media of claim 16, wherein the instructions to implement the stream processing engine include instructions to cause the processor to:
    - detect faces in the input video and determine whether the faces that are detected are known based on a database of faces associated with the collaboration service,wherein the instructions to cause the processor to derive the user requests further include instructions to cause the processor to derive the user requests based on the user intent and the detected faces that are determined to be known.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Original Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Inventors
Griffin, Keith
Primary Examiner(s)
Yohannes, Tesfay

Application Number

US15/386,857
Publication Number

US 20180176269A1
Time in Patent Office

839 Days
Field of Search

709204, 706 18, 706 46
US Class Current
CPC Class Codes

G10L 15/26   Speech to text systems G10L...

H04L 65/403   Arrangements for multi-part...

H04L 65/70   Media network packetisation

H04L 67/02   based on web technology, e....

Multimodal stream processing-based cognitive collaboration system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Multimodal stream processing-based cognitive collaboration system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links