SYSTEMS AND METHODS FOR USING LATENT VARIABLE MODELING FOR MULTI-MODAL VIDEO INDEXING
First Claim
1. A computer-implemented method performed in connection with a computerized system comprising a processing unit and a memory, the computer-implemented method comprising:
- a. using the processing unit to generate a multi-modal language model for co-occurrence of spoken words in the plurality of videos and an external text associated with the plurality of videos;
b. selecting at least a portion of a first video;
c. extracting a plurality of spoken words from the selected portion of the first video;
d. obtaining a first external text associated with the selected portion of the first video; and
e. using the processing unit and the generated multi-modal language model to rank the extracted plurality of spoken words based on probability of occurrence conditioned on the obtained first external text.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented method performed in connection with a computerized system incorporating a processing unit and a memory, the computer-implemented method involving: using the processing unit to generate a multi-modal language model for co-occurrence of spoken words and displayed text in the plurality of videos; selecting at least a portion of a first video; extracting a plurality of spoken words from the selected portion of the first video; extracting a first displayed text from the selected portion of the first video; and using the processing unit and the generated multi-modal language model to rank the extracted plurality of spoken words based on probability of occurrence conditioned on the extracted first displayed text.
-
Citations
19 Claims
-
1. A computer-implemented method performed in connection with a computerized system comprising a processing unit and a memory, the computer-implemented method comprising:
-
a. using the processing unit to generate a multi-modal language model for co-occurrence of spoken words in the plurality of videos and an external text associated with the plurality of videos; b. selecting at least a portion of a first video; c. extracting a plurality of spoken words from the selected portion of the first video; d. obtaining a first external text associated with the selected portion of the first video; and e. using the processing unit and the generated multi-modal language model to rank the extracted plurality of spoken words based on probability of occurrence conditioned on the obtained first external text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 11, 12, 13, 14)
-
-
8. The computer-implemented method of 1, wherein the plurality of spoken words is extracted from the selected portion of the first video using automated speech recognition (ASR).
-
9. The computer-implemented method of 1, wherein the plurality of spoken words are extracted from the selected portion of the first video using close captioning (CC) information associated with the first video.
-
10. The computer-implemented method of 1, wherein obtaining the first external text associated with the selected portion of the first video comprises detecting slides in the selected portion of the first video and extracting the first external text from the detected slides using optical character recognition (OCR).
-
15. The computer-implemented method of 1, wherein the extracted plurality of spoken words comprise a phrase.
-
16. The computer-implemented method of 1, wherein the extracted plurality of spoken words comprise a sentence.
-
17. The computer-implemented method of 1, wherein the selected portion of the first video comprises a contextually meaningful segment of the first video.
-
18. A non-transitory computer-readable medium embodying a set of computer-executable instructions, which, when executed in connection with a computerized system comprising a processing unit and a memory, cause the computerized system to perform a method comprising:
-
a. using the processing unit to generate a multi-modal language model for co-occurrence of spoken words in the plurality of videos and an external text associated with the plurality of videos; b. selecting at least a portion of a first video; c. extracting a plurality of spoken words from the selected portion of the first video; d. obtaining a first external text associated with the selected portion of the first video; and e. using the processing unit and the generated multi-modal language model to rank the extracted plurality of spoken words based on probability of occurrence conditioned on the obtained first external text.
-
-
19. A computerized system comprising a processing unit and a memory storing a set of instructions, the set of instructions comprising instructions for:
-
a. using the processing unit to generate a multi-modal language model for co-occurrence of spoken words in the plurality of videos and an external text associated with the plurality of videos; b. selecting at least a portion of a first video; c. extracting a plurality of spoken words from the selected portion of the first video; d. obtaining a first external text associated with the selected portion of the first video; and e. using the processing unit and the generated multi-modal language model to rank the extracted plurality of spoken words based on probability of occurrence conditioned on the obtained first external text.
-
Specification