CONTEXT-BASED METADATA GENERATION AND AUTOMATIC ANNOTATION OF ELECTRONIC MEDIA IN A COMPUTER NETWORK

US 20160034512A1
Filed: 08/04/2015
Published: 02/04/2016
Est. Priority Date: 08/04/2014
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

generating, by a computing device and based at least in part on information about an input item, a context of the input item;

comparing, by the computing device, the context of the input item to respective contexts of a plurality of other items to determine respective levels of similarity between the input item and each of the plurality of other items; and

annotating the input item with information derived from at least one of the plurality of other items based at least in part on the respective levels of similarity between the input item and each of the plurality of other items.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Computerized systems for automating content annotation (e.g., tag creation and/or expansion) for low-content items within a computer network by leveraging intelligence of other data sources within a network to generate secondary content (e.g., a “context”) for items (e.g., documents) for use in a tagging process. For example, based on user assigned tags for an item, secondary content information can be generated and used to determine a new list of candidate tags for the item. Additionally, the context of an input item may be compared against the respective contexts of a plurality of other items to determine respective levels of similarity between the input item and each of the plurality of other items in order to annotate the input item. Techniques involving web-distance based clustering and leveraging crowd-sourced information sources to remove noisy data from annotated results are also described.

37 Citations

31 Claims

1. A method comprising:
- generating, by a computing device and based at least in part on information about an input item, a context of the input item;
  
  comparing, by the computing device, the context of the input item to respective contexts of a plurality of other items to determine respective levels of similarity between the input item and each of the plurality of other items; and
  
  annotating the input item with information derived from at least one of the plurality of other items based at least in part on the respective levels of similarity between the input item and each of the plurality of other items.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 2. The method of claim 1, further comprising:
    - generating, by the computing device, the respective contexts of the plurality of other items.
  - 3. The method of claim 1, wherein generating the context of the input item comprises:
    - performing a query for the input item using the information about the input item; and
      
      adding, to the context of the input item, at least one item from a response to the query for the input item.
  - 4. The method of claim 3, wherein performing the query for the input item comprises searching an academic database that includes at least one of:
    - books, research papers, or journal articles.
  - 5. The method of claim 3, wherein performing the query for the input item comprises searching the Internet using a publicly available search engine.
  - 6. The method of claim 1, wherein comparing the context of the input item to the respective contexts of the plurality of other items to determine the respective levels of similarity between the input item and each of the plurality of other items comprises:
    - generating, based at least in part on the context of the input item, a vector representing a semantic makeup of the context of the input item; and
      
      determining respective cosine similarities between the vector representing the semantic makeup of the context of the input item and each of a plurality of respective vectors representing respective semantic makeups of the respective contexts of the plurality of other items.
  - 7. The method of claim 6, wherein the vector representing the semantic makeup of the context of the input item comprises at least one of:
    - a Term Frequency-Inverse Document Frequency (TF-IDF) vector, a Latent Dirichlet Allocation TF-IDF (LDA-TFIDF) vector, or a Latent Semantic Indexing TF-IDF (LSI-TFIDF) vector.
  - 8. The method of claim 1, wherein the plurality of other items comprises a plurality of topics, and wherein annotating the input item with information derived from at least one of the plurality of other items comprises tagging the input item with at least one topic from the plurality of topics.
  - 9. The method of claim 1, wherein the plurality of other items comprises a plurality of topics, the method further comprising generating the plurality of topics by:
    - performing a query for each potential topic from a set of potential topics using the potential topic; and
      
      adding, to the plurality of topics, at least one item from a respective response to the query for the potential topic.
  - 10. The method of claim 9, further comprising obtaining the set of potential topics from a database of research requests.
  - 11. The method of claim 9, wherein performing the respective query for the potential topic comprises searching a crowd-sourced information site using the potential topic.
  - 12. The method of claim 1, wherein the plurality of other items comprises a plurality of content items, each of the plurality of content items being tagged with one or more existing tags.
  - 13. The method of claim 12, wherein annotating the input item with information derived from the at least one of the plurality of content items comprises tagging the input item with at least one existing tag with which a content item from the plurality of content items is tagged.
  - 14. The method of claim 12, wherein the plurality of content items comprises a subset of content items, the method further comprising:
    - determining, as part of the subset of content items, each content item that is tagged with at least one existing tag with which the input item is also tagged;
      
      generating, based at least in part on the one or more existing tags with which each of the subset of content items is tagged, a set of candidate tags; and
      
      ranking, based at least in part on the respective levels of similarity between the input item and each of the subset of content items, the set of candidate tags to form a ranked set of candidate tags,wherein annotating the input item with the information derived from the at least one of the subset of content items is based at least in part on the ranked set of candidate tags.
  - 15. The method of claim 14, further comprising:
    - clustering the set of candidate tags in a semantic space; and
      
      pruning, based at least on the clustering, the set of candidate tags by retaining only candidate tags that are contained within a particular cluster, the particular cluster containing existing tags of the input item.
  - 16. The method of claim 15, wherein clustering the set of candidate tags comprises clustering the set of candidate tags using a web-distance metric.
  - 17. The method of claim 1, wherein the input item comprises a research paper.
  - 18. The method of claim 1, wherein the information about the input item comprises a document title.
  - 19. The method of claim 1, wherein the information about the input item comprises an existing tag with which the input item is tagged.
  - 20. The method of claim 1, further comprising:
    - generating, by the computing device, a topic database for use in annotating input items, the topic database specifying a plurality of topics;
      
      constructing, by the computing device, a global context for each topic in the topic database; and
      
      ranking, by the computing device, each topic in the topic database by comparing the respective context for each of the topics to the context for the input item.
  - 21. The method of claim 20, wherein constructing a global context comprises:
    - accessing a plurality of external data sources to retrieve potential topics;
      
      aggregating the topics into a topic database;
      
      conducting a query in a publicly available search engine for each topic in the topic database;
      
      constructing the global context based at least in part on the results of the query.
  - 22. The method of claim 20, wherein performing the normalization process includes conducting a query for the one or more topics in the topic database using a crowd-sourced information source, and updating the topic database based at least in part on the results of the search query.

23. A method comprising:
- identifying, by a computing device, one or more external data sources;
  
  retrieving, by the computing device, topics from the identified one or more external data sources;
  
  aggregating, by the computing device, the topics into a topic database;
  
  generalizing, by the computing device, the topic database by performing a normalization process;
  
  conducting a query, by the computing device in a publicly available search engine, for each topic in the generalized topic database;
  
  constructing, by the computing device and based at least in part on the results of the query, a topic context database;
  
  identifying, by the computing device, information about one or more input items, wherein the information about the input item comprises a title;
  
  conducting a query, by the computing device in a publicly available search engine, for content in the title for the one or more input items;
  
  constructing, by the computing device and based at least in part on the results of the query, a respective title context for the one or more input items;
  
  comparing, by the computing device, the title context and the one or more topics in the topic context database using a text similarity computation model;
  
  determine, by the computing device and based at least in part on the results of the comparison, one or more topics from the topic context database with which to annotate the one or more input items;
  
  annotating, by the computing device, the one or more input items based on the determined topics.
- View Dependent Claims (24)
- - 24. The method of claim 23, wherein performing the normalization process includes conducting a query for the one or more topics in the topic database using a crowd-sourced information source, and updating the topic database based at least in part on the results of the search query.

25. A method comprising:
- identifying, by a computing device, tags that have been previously assigned to one or more input items;
  
  generating, by a computing device, a database of secondary content based on the identified tags;
  
  generating, by a computing device and based on the contents of the secondary content database, a list of candidate new tags;
  
  remove, by a computing device, noisy tags from the list;
  
  determine, by a computing device, a final list of tags with which to annotate the one or more input items.
- View Dependent Claims (26)
- - 26. The method of claim 25, wherein the process of removing noisy tags includes clustering the list of candidate new tags and pruning the list based at least in part on the results of the clustering.

27. A computing device having a processor configured to:
- generate, based at least in part on information about an input item, a context of the input item;
  
  compare the context of the input item to respective contexts of a plurality of other items to determine respective levels of similarity between the input item and each of the plurality of other items; and
  
  annotate the input item with information derived from at least one of the plurality of other items based at least in part on the respective levels of similarity between the input item and each of the plurality of other items.
- View Dependent Claims (28, 29, 30)
- - 28. The computing device of claim 27 further configured to:
    - perform a query for the input item using the information about the input item; and
      
      add, to the context of the input item, at least one item from a response to the query for the input item.
  - 29. The computing device of claim 27, wherein comparing the context of the input item to the respective contexts of the plurality of other items to determine the respective levels of similarity between the input item and each of the plurality of other items comprises:
    - generating, based at least in part on the context of the input item, a vector representing a semantic makeup of the context of the input item; and
      
      determining respective cosine similarities between the vector representing the semantic makeup of the context of the input item and each of a plurality of respective vectors representing respective semantic makeups of the respective contexts of the plurality of other items.
  - 30. The computing device of claim 27 further configured to:
    - generate a topic database for use in annotating input items, the topic database specifying a plurality of topics;
      
      construct a global context for each topic in the topic database; and
      
      rank each topic in the topic database by comparing the respective context for each of the topics to the context for the input item.

31. A computer-readable storage medium encoded with instructions that, when executed, cause at least one processor to:
- generate, based at least in part on information about an input item, a context of the input item;
  
  compare the context of the input item to respective contexts of a plurality of other items to determine respective levels of similarity between the input item and each of the plurality of other items; and
  
  annotate the input item with information derived from at least one of the plurality of other items based at least in part on the respective levels of similarity between the input item and each of the plurality of other items.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Regents of The University of Minnesota (University of Minnesota)
Original Assignee
Regents of The University of Minnesota (University of Minnesota)
Inventors
Srivastava, Jaideep, Singhal, Ayush, Kasturi, Ravindra

Granted Patent

US 10,146,862 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/345 Summarisation for human users

G06F 16/35 Clustering; Classification

CONTEXT-BASED METADATA GENERATION AND AUTOMATIC ANNOTATION OF ELECTRONIC MEDIA IN A COMPUTER NETWORK

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

37 Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

CONTEXT-BASED METADATA GENERATION AND AUTOMATIC ANNOTATION OF ELECTRONIC MEDIA IN A COMPUTER NETWORK

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

37 Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links