Simultaneous tracking and text recognition in video frames

US 9,064,174 B2
Filed: 10/18/2012
Issued: 06/23/2015
Est. Priority Date: 10/18/2012
Status: Active Grant

First Claim

Patent Images

1. A system, comprising:

a text recognition component configured for recognition of text on a sequence of video frames, the text recognition component configured to receive a selected frame of the sequence of video frames and perform text recognition processing of the selected frame to output a selected frame result;

a tracker component configured to select a keyframe from the sequence of video frames based on stability criteria applied to incoming frames and to establish a reference coordinate system relative to the selected keyframe, the selected frame result mapped back to the reference coordinate system of the keyframe, the tracker component configured to apply keyframe coordinates to subsequent video frames to enable accumulation of best results for text recognition rendering and viewing; and

a microprocessor configured to execute computer-executable instructions associated with at least one of the text recognition component or the tracker component.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Architecture that enables optical character recognition (OCR) of text in video frames at the rate at which the frames are received. Additionally, conflation is performed on multiple text recognition results in the frame sequence. The architecture comprises an OCR text recognition engine and a tracker system; the tracker system establishes a common coordinate system in which OCR results from different frames may be compared and/or combined. From a set of sequential video frames, a keyframe is chosen from which the reference coordinate system is established. An estimated transformation from keyframe coordinates to subsequent video frames is computed using the tracker system. When text recognition is completed for any subsequent frame, the result coordinates can be related to the keyframe using the inverse transformation from the processed frame to the reference keyframe. The results can be rendered for viewing as the results are obtained.

Citations

20 Claims

1. A system, comprising:
- a text recognition component configured for recognition of text on a sequence of video frames, the text recognition component configured to receive a selected frame of the sequence of video frames and perform text recognition processing of the selected frame to output a selected frame result;
  
  a tracker component configured to select a keyframe from the sequence of video frames based on stability criteria applied to incoming frames and to establish a reference coordinate system relative to the selected keyframe, the selected frame result mapped back to the reference coordinate system of the keyframe, the tracker component configured to apply keyframe coordinates to subsequent video frames to enable accumulation of best results for text recognition rendering and viewing; and
  
  a microprocessor configured to execute computer-executable instructions associated with at least one of the text recognition component or the tracker component.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The system of claim 1, wherein the tracker component is configured to estimate a transformation between the reference coordinate system and the selected frame result.
  - 3. The system of claim 1, wherein the reference coordinate system relates recognized text coordinates of the text of the selected frame back to the keyframe based on an estimated transformation established between the keyframe and the selected frame.
  - 4. The system of claim 1, further comprising a selection component configured to select a frame result for rendering with the keyframe.
  - 5. The system of claim 1, further comprising a conflation component configured to combine the selected frame result with a previously-accumulated frame result of another frame of the sequence of video frames.
  - 6. The system of claim 5, wherein the conflation component employs statistical error correction to improve the conflated frame results.
  - 7. The system of claim 5, wherein the conflated frame result is rendered on a display.
  - 8. The system of claim 1, wherein the selected frame result is rendered directly into the selected frame.
  - 9. The system of claim 1, wherein the selected frame result and a word bounding box are stored according to the reference coordinate system relative to the keyframe or the selected frame.

10. A method performed by a computer system executing machine-readable instructions in a hardware memory, the method comprising acts of:
- receiving a selected frame of a sequence of video frames for text recognition processing;
  
  choosing a keyframe from the sequence of video frames based on an application of stability criteria to incoming images;
  
  establishing a reference coordinate system relative to the keyframe for applying keyframe coordinates to subsequent video frames;
  
  recognition processing the selected frame to output a selected frame result;
  
  computing an estimated transformation between the keyframe and the selected frame result based on the reference coordinate system to create a keyframe result;
  
  storing the keyframe result of the selected frame for presentation to enable accumulation of best results for rendering and viewing; and
  
  configuring at least one processor to perform the acts of receiving, choosing, establishing, recognition processing, computing, and storing.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The method of claim 10, further comprising combining the keyframe result with a previously-accumulated keyframe result to create new accumulated keyframe results.
  - 12. The method of claim 11, further comprising presenting the new accumulated keyframe results after each frame.
  - 13. The method of claim 10, further comprising concurrently performing the acts of choosing, establishing, recognition processing, and computing.
  - 14. The method of claim 10, further comprising tracking features in the selected frame to compute the estimated transformation.
  - 15. The method of claim 10, further comprising rendering frame results directly into tracked video frames.

16. A method performed by a computer system executing machine-readable instructions in a hardware memory, the method comprising acts of:
- selecting a keyframe based on an application of stability criteria to incoming images;
  
  establishing a common coordinate system and a transformation based on the keyframe that relate subsequent video frames of a sequence of video frames to coordinates of the keyframe;
  
  concurrently with the act of establishing, performing text recognition processing of the video frames to compute frame text results;
  
  relating the frame text results back to the coordinates of the keyframe using the transformation;
  
  conflating the frame text results to determine an optimum frame text result for presentation; and
  
  storing the keyframe result of the selected frame for presentation to enable accumulation of best results for rendering and viewing; and
  
  configuring at least one processor to perform the acts of selecting, establishing, performing, relating, and conflating.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The method of claim 16, further comprising an act of registering the recognized frame text results to realtime video by applying the transformation from the keyframe coordinates to a latest frame being processed.
  - 18. The method of claim 16, further comprising an act of combining a selected frame text result with previously-accumulated frame text result of another frame, as part of the act of conflating.
  - 19. The method of claim 16, further comprising an act of establishing a transformation for each video frame to relate associated frame text results to the keyframe.
  - 20. The method of claim 16, further comprising an act of tracking the recognized frame text results while asynchronously performing text recognition processing over the video frames.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Nister, David, Schaffalitzky, Frederik, Grabner, Michael, Ashman, Matthew S., Vugdelija, Milan, Stojiljkovic, Ivan
Primary Examiner(s)
Abdi, Amara

Application Number

US13/654,841
Publication Number

US 20140112527A1
Time in Patent Office

978 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06V 20/635 Overlay text, e.g. embedded...

G06V 30/10 Character recognition

Simultaneous tracking and text recognition in video frames

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Simultaneous tracking and text recognition in video frames

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links