Systems and methods for using latent variable modeling for multi-modal video indexing

US 9,542,934 B2
Filed: 02/27/2014
Issued: 01/10/2017
Est. Priority Date: 02/27/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method performed in connection with a computerized system comprising a processing unit and a memory, the computer-implemented method comprising:

a. using the processing unit to generate a multi-modal language model for co-occurrence of spoken words in the plurality of videos and an external text associated with the plurality of videos;

b. selecting at least a portion of a first video;

c. extracting a plurality of spoken words from the selected portion of the first video;

d. obtaining a first external text associated with the selected portion of the first video, wherein the obtained first external text is separate and distinct from a representation of the extracted plurality of spoken words; and

e. using the processing unit and the generated multi-modal language model to rank the extracted plurality of spoken words based on probability of occurrence conditioned on the obtained first external text.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method performed in connection with a computerized system incorporating a processing unit and a memory, the computer-implemented method involving: using the processing unit to generate a multi-modal language model for co-occurrence of spoken words and displayed text in the plurality of videos; selecting at least a portion of a first video; extracting a plurality of spoken words from the selected portion of the first video; extracting a first displayed text from the selected portion of the first video; and using the processing unit and the generated multi-modal language model to rank the extracted plurality of spoken words based on probability of occurrence conditioned on the extracted first displayed text.

5 Citations

View as Search Results

19 Claims

1. A computer-implemented method performed in connection with a computerized system comprising a processing unit and a memory, the computer-implemented method comprising:
- a. using the processing unit to generate a multi-modal language model for co-occurrence of spoken words in the plurality of videos and an external text associated with the plurality of videos;
  
  b. selecting at least a portion of a first video;
  
  c. extracting a plurality of spoken words from the selected portion of the first video;
  
  d. obtaining a first external text associated with the selected portion of the first video, wherein the obtained first external text is separate and distinct from a representation of the extracted plurality of spoken words; and
  
  e. using the processing unit and the generated multi-modal language model to rank the extracted plurality of spoken words based on probability of occurrence conditioned on the obtained first external text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 11, 12, 13, 14)
- - 2. The computer-implemented method of claim 1, wherein the obtaining the first external text comprises extracting the first external text from a text displayed in the selected portion of the first video.
  - 3. The computer-implemented method of claim 1, wherein the external text is displayed in at least one of the plurality of videos.
  - 4. The computer-implemented method of claim 1, wherein the external text is contained in a content associated with at least one of the plurality of videos.
  - 5. The computer-implemented method of claim 1, wherein each of the plurality of videos comprises a plurality of presentation slides comprising the external text.
  - 6. The computer-implemented method of claim 1, wherein generating the multi-modal language model comprises extracting all spoken words from all of the plurality of videos and extracting the external text displayed in the plurality of videos and calculating a plurality of probabilities of co-occurrence of each of the extracted spoken words and each of the extracted external text.
  - 7. The computer-implemented method of claim 1, wherein the multi-modal language model is stored in a matrix form.
  - 11. The computer-implemented method of claim 1, further comprising providing the ranked plurality of spoken words to the user;
    - receiving from the user a selection of at least one of the provided plurality of spoken words; and
      
      using the received selection of the at least one of the provided plurality of spoken words as an annotation for the first video.
  - 12. The computer-implemented method of claim 11, further comprising using the annotation to index at least some of the plurality of videos.
  - 13. The computer-implemented method of claim 1, further comprising using the ranked extracted plurality of spoken words to index at least some of the plurality of videos.
  - 14. The computer-implemented method of claim 1, further comprising using top ranked words from the ranked extracted plurality of spoken words to index at least some of the plurality of videos.

8. The computer-implemented method of 1, wherein the plurality of spoken words is extracted from the selected portion of the first video using automated speech recognition (ASR).

9. The computer-implemented method of 1, wherein the plurality of spoken words are extracted from the selected portion of the first video using close captioning (CC) information associated with the first video.

10. The computer-implemented method of 1, wherein obtaining the first external text associated with the selected portion of the first video comprises detecting slides in the selected portion of the first video and extracting the first external text from the detected slides using optical character recognition (OCR).

15. The computer-implemented method of 1, wherein the extracted plurality of spoken words comprise a phrase.

16. The computer-implemented method of 1, wherein the extracted plurality of spoken words comprise a sentence.

17. The computer-implemented method of 1, wherein the selected portion of the first video comprises a contextually meaningful segment of the first video.

18. A non-transitory computer-readable medium embodying a set of computer-executable instructions, which, when executed in connection with a computerized system comprising a processing unit and a memory, cause the computerized system to perform a method comprising:
- a. using the processing unit to generate a multi-modal language model for co-occurrence of spoken words in the plurality of videos and an external text associated with the plurality of videos;
  
  b. selecting at least a portion of a first video;
  
  c. extracting a plurality of spoken words from the selected portion of the first video;
  
  d. obtaining a first external text associated with the selected portion of the first video, wherein the obtained first external text is separate and distinct from a representation of the extracted plurality of spoken words; and
  
  e. using the processing unit and the generated multi-modal language model to rank the extracted plurality of spoken words based on probability of occurrence conditioned on the obtained first external text.

19. A computerized system comprising a processing unit and a memory storing a set of instructions, the set of instructions comprising instructions for:
- a. using the processing unit to generate a multi-modal language model for co-occurrence of spoken words in the plurality of videos and an external text associated with the plurality of videos;
  
  b. selecting at least a portion of a first video;
  
  c. extracting a plurality of spoken words from the selected portion of the first video;
  
  d. obtaining a first external text associated with the selected portion of the first video, wherein the obtained first external text is separate and distinct from a representation of the extracted plurality of spoken words; and
  
  e. using the processing unit and the generated multi-modal language model to rank the extracted plurality of spoken words based on probability of occurrence conditioned on the obtained first external text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fujifilm Business Innovation Corp. (Fujifilm Holdings Corporation)
Original Assignee
Fuji Xerox Company Limited (Xerox Holdings Corp.)
Inventors
Cooper, Matthew L., Joshi, Dhiraj, Chen, Huizhong
Primary Examiner(s)
Pham, Thierry L

Application Number

US14/192,861
Publication Number

US 20150243276A1
Time in Patent Office

1,048 Days
Field of Search

704/231, 704/235, 704/251, 704/255, 704/257
US Class Current

1/1
CPC Class Codes

G06F 16/7834   using audio features

G06N 7/01   Probabilistic graphical mod...

G06V 20/40   in video content extracting...

G06V 20/635   Overlay text, e.g. embedded...

G10L 15/05   Word boundary detection

G11B 27/00   Editing; Indexing; Addressi...

G11B 27/28   by using information signal...

H04N 21/234336   by media transcoding, e.g. ...

H04N 21/440236   by media transcoding, e.g. ...

H04N 7/0882   for the transmission of cha...

Systems and methods for using latent variable modeling for multi-modal video indexing

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

5 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for using latent variable modeling for multi-modal video indexing

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

5 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links