Information processing device, information processing method and program

US 9,280,709 B2
Filed: 08/02/2011
Issued: 03/08/2016
Est. Priority Date: 08/11/2010
Status: Active Grant

First Claim

Patent Images

1. An information processing device, comprising:

one or more processors configured to;

extract an image feature amount of each frame of an image of learning content;

extract word frequency information regarding frequency of appearance of each word in a description text describing a content of the image of the learning content as a text feature amount of the description text;

learn an annotation model, which is a multi-stream HMM (hidden Markov model), by using an annotation sequence for annotation, which is a multi-stream including the image feature amount and the text feature amount andobtain an inter-state distance from one state to another state of the annotation model such that an error is minimized between i) the inter-state distance and ii) a Euclidean distance from the one state to the another state on a model map on which states of the annotation model are arranged.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to an information processing device, an information processing method, and a program capable of easily adding an annotation to content.

A feature amount extracting unit 21 extracts an image feature amount of each frame of an image of learning content and extracts word frequency information regarding frequency of appearance of each word in a description text describing a content of the image of the learning content (for example, a text of a caption) as a text feature amount of the description text. A model learning unit 22 learns an annotation model, which is a multi-stream HMM, by using an annotation sequence for annotation, which is a multi-stream including the image feature amount of each frame and the text feature amount. The present invention may be applied when adding the annotation to the content such as a television broadcast program, for example.

Citations

20 Claims

1. An information processing device, comprising:
- one or more processors configured to;
  
  extract an image feature amount of each frame of an image of learning content;
  
  extract word frequency information regarding frequency of appearance of each word in a description text describing a content of the image of the learning content as a text feature amount of the description text;
  
  learn an annotation model, which is a multi-stream HMM (hidden Markov model), by using an annotation sequence for annotation, which is a multi-stream including the image feature amount and the text feature amount andobtain an inter-state distance from one state to another state of the annotation model such that an error is minimized between i) the inter-state distance and ii) a Euclidean distance from the one state to the another state on a model map on which states of the annotation model are arranged.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The information processing device according to claim 1, wherein the learning content includes a text of a caption, and the description text is the text of the caption included in the learning content.
  - 3. The information processing device according to claim 2, wherein the one or more processors are configured to:
    - extract words included in the text of the caption displayed in a window as one document while shifting the window of a predetermined time length at regular intervals, andextract multinomial distribution, which represents a frequency of appearance of each word in the document, as the text feature amount.
  - 4. The information processing device according to claim 2, wherein the one or more processors are configured to add an annotation to target content by using the annotation model.
  - 5. The information processing device according to claim 4, wherein the one or more processors are configured to:
    - extract words included in the text of the caption displayed in a window as one document while shifting the window of a predetermined time length at regular intervals;
      
      extract multinomial distribution, which represents a frequency of appearance of each word in the document, as the text feature amount;
      
      extract the image feature amount of each frame of the image of the target content;
      
      compose the annotation sequence by using the image feature amount;
      
      obtain a maximum likelihood state sequence in which the annotation sequence is observed in the annotation model; and
      
      select a word with a highest frequency in the multinomial distribution observed in a state corresponding to a target frame out of states of the maximum likelihood state sequence as the annotation to be added to the target frame.
  - 6. The information processing device according to claim 2, wherein the one or more processors are configured to search a keyword frame from target content from which the keyword frame, which is a frame with a predetermined keyword, is to be searched by using the annotation model.
  - 7. The information processing device according to claim 6, wherein the one or more processors are configured to:
    - extract words included in the text of the caption displayed in a window as one document while shifting the window of a predetermined time length at regular intervals;
      
      extract multinomial distribution, which represents a frequency of appearance of each word in the document, as the text feature amount;
      
      extract the image feature amount of each frame of the image of the target content;
      
      compose the annotation sequence by using the image feature amount;
      
      obtain a maximum likelihood state sequence in which the annotation sequence is observed in the annotation model; and
      
      select, when a frequency of the predetermined keyword is highest in the multinomial distribution observed in a state corresponding to a target frame of the target content out of states of the maximum likelihood state sequence, the target frame as the keyword frame.
  - 8. The information processing device according to claim 2, wherein the one or more processors are configured to display an annotation to be added to a frame of target content to which the annotation is to be added by using the annotation model.
  - 9. The information processing device according to claim 8, wherein the one or more processors are configured to:
    - extract words included in the text of the caption displayed in a window as one document while shifting the window of a predetermined time length at regular intervals;
      
      extract multinomial distribution, which represents a frequency of appearance of each word in the document, as the text feature amount;
      
      extract the image feature amount of each frame of the image of the target content;
      
      compose the annotation sequence by using the image feature amount;
      
      obtain a state corresponding to each frame of the target content by obtaining a maximum likelihood state sequence in which the annotation sequence is observed in the annotation model;
      
      obtain the annotation to be added to the frame corresponding to the state based on the multinomial distribution; and
      
      display the annotation to be added to the each frame of the target content corresponding to each state of the annotation model.
  - 10. The information processing device according to claim 9, wherein the one or more processors are configured to:
    - obtain the inter-state distance from the one state to the another state of the annotation model based on state transition probability from the one state to the another state;
      
      obtain a state coordinate, which is a coordinate of a position of a state on the model map;
      
      display the model map, on which the corresponding state is arranged at the state coordinate; and
      
      display a representative image, which represents the frame corresponding to each state of the annotation model, and the annotation to be added to the frame corresponding to each state of the annotation model on the model map.
  - 11. The information processing device according to claim 2, wherein the one or more processors are configured to:
    - perform dimension reduction to reduce a dimension of the image feature amount and the text feature amount; and
      
      learn the annotation model by using the multi-stream, including the image feature amount and the text feature amount after the dimension reduction, as the annotation sequence.
  - 12. The information processing device according to claim 11, wherein the one or more processors are configured to:
    - obtain basis space data of a basis space for an image which has a dimension lower than a dimension of the image feature amount for mapping the image feature amount;
      
      perform the dimension reduction of the image feature amount based on the basis space data of the basis space;
      
      obtain basis space data of a basis space for text of which dimension is lower than a dimension of the text feature amount for mapping the text feature amount; and
      
      perform the dimension reduction of the text feature amount based on the basis space data of the basis space for text.
  - 13. The information processing device according to claim 12, wherein the one or more processors are configured to:
    - obtain a code book used for vector quantization as the basis space data of the basis space for image by using the image feature amount; and
      
      obtain a code representing a centroid vector as the image feature amount after the dimension reduction by performing the vector quantization of the image feature amount by using the code book.
  - 14. The information processing device according to claim 12, wherein one or more processors are configured to:
    - extract words included in the text of the caption displayed in a window as one document while shifting the window of a predetermined time length at regular intervals;
      
      extract a frequency of appearance of each word in the document as the text feature amount;
      
      obtain a parameter of LDA (latent Dirichlet allocation) as the basis space data of the basis space for text by learning the LDA by using the document obtained from the learning content; and
      
      convert the text feature amount obtained from the document to topic likelihood, which is likelihood of each latent topic of the LDA for the document to obtain a topic label representing the latent topic for which topic likelihood is maximum after the dimension reduction.
  - 15. The information processing device according to claim 14, wherein the one or more processors are configured to:
    - add an annotation to target content to by using the annotation model;
      
      generate a word dictionary of the words appearing in the document by using the document obtained from the learning content;
      
      create a topic-to-frequently appearing word table of each word with an appearance frequency greater than or equal to a predetermined threshold in the latent topic of the LDA and the corresponding appearance frequency of each word by using occurrence probability of each word in the word dictionary in each latent topic of the LDA;
      
      extract the image feature amount of each frame of the image of the target content;
      
      perform the dimension reduction;
      
      compose the annotation sequence by using the image feature amount after the dimension reduction;
      
      obtain a maximum likelihood state sequence in which the annotation sequence is observed in the annotation model;
      
      select the latent topic represented by the topic label with the maximum topic likelihood in a state corresponding to a target frame of out of states of the maximum likelihood state sequence as a frame topic representing a content of the target; and
      
      select a word with an appearance frequency greater than or equal to the predetermined threshold in the frame topic as the annotation to be added to the target frame based on the topic-to-frequently appearing word table.
  - 16. The information processing device according to claim 14, wherein the one or more processors are configured to:
    - search a keyword frame from target content from which the keyword frame, which is a frame with a predetermined keyword, is to be searched by using the annotation model;
      
      generate a word dictionary of the words appearing in the document by using the document obtained from the learning content;
      
      create a topic-to-frequently appearing word table of each word with an appearance frequency greater than or equal to a predetermined threshold in the latent topic of the LDA and the corresponding appearance frequency by using occurrence probability of each word in the word dictionary in each latent topic of the LDA;
      
      extract the image feature amount of each frame of the image of the target content;
      
      perform the dimension reduction;
      
      compose the annotation sequence by using the image feature amount after the dimension reduction;
      
      obtain a maximum likelihood state sequence in which the annotation sequence is observed in the annotation model;
      
      select the latent topic represented by the topic label with the maximum topic likelihood in a state corresponding to a target frame out of states of the maximum likelihood state sequence as a frame topic representing a content of the target frame;
      
      obtain an appearance frequency of the predetermined keyword in the frame topic based on the topic-to-frequently appearing word table; and
      
      select, when the appearance frequency of the predetermined keyword is greater than or equal to the predetermined threshold, the target frame as the keyword frame.
  - 17. The information processing device according to claim 14, wherein the one or more processors are configured to:
    - display an annotation to be added to a frame of target content by using the annotation model;
      
      generate a word dictionary of the words appearing in the document by using the document obtained from the learning content;
      
      create a topic-to-frequently appearing word table of each word with an appearance frequency greater than or equal to a predetermined threshold in the latent topic of the LDA and the corresponding appearance frequency of each word by using occurrence probability of each word in the word dictionary in each latent topic of the LDA;
      
      extract the image feature amount of each frame of the image of the target content;
      
      perform the dimension reduction;
      
      compose the annotation sequence by using the image feature amount after the dimension reduction;
      
      obtain a state corresponding to each frame of the target content by obtaining a maximum likelihood state sequence in which the annotation sequence is observed in the annotation model;
      
      select the latent topic represented by the topic label with the maximum topic likelihood as a frame topic the frame corresponding to the state of the target content;
      
      obtain a word with an appearance frequency greater than or equal to the predetermined threshold in the frame topic as the annotation to be added to the frame of which content is represented by the frame topic based on the topic-to-frequently appearing word table; and
      
      display the annotation to be added to the frame corresponding to the state for each state of the annotation model.
  - 18. The information processing device according to claim 14, wherein the one or more processors are configured to:
    - display an annotation to be added to a frame of target content by using the annotation model;
      
      generate a word dictionary of the words appearing in the document by using the document obtained from the learning content;
      
      create a topic-to-frequently appearing word table of each word with an appearance frequency greater than or equal to a predetermined threshold in the latent topic of the LDA and the corresponding appearance frequency of each word by using occurrence probability of occurrence of each word in the word dictionary in each latent topic of the LDA;
      
      extract the image feature amount of each frame of the image of the target content;
      
      perform the dimension reductionl;
      
      compose the annotation sequence by using the image feature amount after the dimension reduction;
      
      obtain a state corresponding to each frame of the target content by obtaining a maximum likelihood state sequence in which the annotation sequence is observed in the annotation model;
      
      select the latent topic represented by the topic label with the maximum topic likelihood as a frame topic of the frame corresponding to the state of the target content;
      
      obtain a word with an appearance frequency greater than or equal to the predetermined threshold in the latent topic as the annotation to be added to the frame of which frame topic is the latent topic based on the topic-to-frequently appearing word table; and
      
      display the annotation to be added to the frame of which the frame topic is the latent topic.

19. An information processing method to be performed by an information processing device, the information processing method comprising:
- extracting an image feature amount of each frame of an image of learning content;
  
  extracting word frequency information regarding frequency of appearance of each word in a description text describing a content of the image of the learning content as a text feature amount of the description text;
  
  learning an annotation model, which is a multi-stream HMM (hidden Markov model), by using an annotation sequence for annotation, which is a multi-stream including the image feature amount and the text feature amount; and
  
  obtaining an inter-state distance from one state to another state of the annotation model such that an error is minimized between i) the inter-state distance and ii) a Euclidean distance from the one state to the another state on a model map on which states of the annotation model are arranged.

20. A non-transitory computer-readable medium having stored thereon, a set of computer-executable instructions for causing a computer to perform steps comprising:
- extracting an image feature amount of each frame of an image of learning content;
  
  extracting word frequency information regarding frequency of appearance of each word in a description text describing a content of the image of the learning content as a text feature amount of the description text;
  
  learning an annotation model, which is a multi-stream HMM (hidden Markov model), by using an annotation sequence for annotation, which is a multi-stream including the image feature amount and the text feature amount; and
  
  obtaining an inter-state distance from one state to another state of the annotation model such that an error is minimized between i) the inter-state distance and ii) a Euclidean distance from the one state to the another state on a model map on which states of the annotation model are arranged.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sony Corporation (Sony Group Corp.)
Original Assignee
Sony Corporation (Sony Group Corp.)
Inventors
Suzuki, Hirotaka, Ito, Masato
Primary Examiner(s)
Mills, Frank D

Application Number

US13/814,170
Publication Number

US 20130163860A1
Time in Patent Office

1,680 Days
Field of Search

715/230
US Class Current

1/1
CPC Class Codes

G06F 16/739   in form of a video summary,...

G06F 16/745   the internal structure of a...

G06F 16/7844   using original textual cont...

G06V 20/41   Higher-level, semantic clus...

G06V 2201/10   Recognition assisted with m...

G06V 30/224   of printed characters havin...

Information processing device, information processing method and program

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Information processing device, information processing method and program

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links