Scalable Incremental Semantic Entity and Relatedness Extraction from Unstructured Text

US 20110264997A1
Filed: 04/21/2010
Published: 10/27/2011
Est. Priority Date: 04/21/2010
Status: Abandoned Application

First Claim

Patent Images

1. A method performed on a computer processor, said method comprising:

receiving a item comprising text strings;

determining an item identifier for said item;

processing said text strings with a statistical language model to;

identify text elements;

determining text element identifiers for said text elements; and

assign an entropy value to each of said elements;

selecting a first subset of said text elements, each of said text elements in said first subset having an entropy value greater than a first predefined entropy value;

adding each of said text elements to a first data structure, said first data structure comprising said text element identifiers and said item identifier;

creating an adjacency matrix representing a graph comprising vertices representing said text elements and edges representing weighted relationships, said weighted relationships being determined from said first data structure; and

receiving a search query for a first text element and responding with search results derived from said adjacency matrix.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A search engine for documents containing text may process text using a statistical language model, classify the text based on entropy, and create suffix trees or other mappings of the text for each classification. From the suffix trees or mappings, a graph may be constructed with relationship strengths between different words or text strings. The graph may be used to determine search results, and may be browsed or navigated before viewing search results. As new documents are added, they may be processed and added to the suffix trees, then the graph may be created on demand in response to a search request. The graph may be represented as a adjacency matrix, and a transitive closure algorithm may process the adjacency matrix as a background process.

24 Citations

View as Search Results

20 Claims

1. A method performed on a computer processor, said method comprising:
- receiving a item comprising text strings;
  
  determining an item identifier for said item;
  
  processing said text strings with a statistical language model to;
  
  identify text elements;
  
  determining text element identifiers for said text elements; and
  
  assign an entropy value to each of said elements;
  
  selecting a first subset of said text elements, each of said text elements in said first subset having an entropy value greater than a first predefined entropy value;
  
  adding each of said text elements to a first data structure, said first data structure comprising said text element identifiers and said item identifier;
  
  creating an adjacency matrix representing a graph comprising vertices representing said text elements and edges representing weighted relationships, said weighted relationships being determined from said first data structure; and
  
  receiving a search query for a first text element and responding with search results derived from said adjacency matrix.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 further comprising:
    - performing transitive closure on said adjacency matrix using a first algorithm to populate said adjacency matrix with additional values.
  - 3. The method of claim 2, said first algorithm being the Floyd-Warshall algorithm.
  - 4. The method of claim 1, said first data structure comprising a suffix tree comprising edges representing said text elements and nodes comprising said item identifier.
  - 5. The method of claim 1, said first data structure comprising a phrase inverted index data structure.
  - 6. The method of claim 1 further comprising:
    - selecting a second subset of said text elements, each of said text elements in said second subset having an entropy value greater than a second predefined entropy value;
      
      adding each of said second subset of text elements to a second data structure, said second data structure comprising said text elements and said item identifier; and
      
      said edges in said graph being further determined from said first data structure and said second data structure.
  - 7. The method of claim 6 further comprising:
    - said edges being determined in part by applying a first weighting to said first data structure and a second weighting to said second data structure prior to determining said edges.
  - 8. The method of claim 1 further comprising:
    - performing noise reduction on said item prior to said processing.
  - 9. The method of claim 1, said text elements comprising at least one of a group composed of:
    - unigrams;
      
      bigrams; and
      
      trigrams.
  - 10. The method of claim 1 further comprising:
    - identifying a first text element;
      
      determining a synonym for said first text element; and
      
      adding said synonym to said first subset of text elements.
  - 11. The method of claim 1 further comprising:
    - examining said item to determine a formatting characteristic for a first text item; and
      
      weighting said first text item based on said formatting characteristic.
  - 12. The method of claim 11, said formatting characteristic comprising at least one of:
    - a title;
      
      a heading;
      
      a font effect; and
      
      a font modifier.

13. A system comprising:
- a document adapter that;
  
  receives an item comprising text elements; and
  
  creates an item identifier for said item;
  
  an input adapter that;
  
  parses said item into text elements; and
  
  for each of said text elements, assigns a text element identifier;
  
  a language model processor that;
  
  assigns an entropy value to each of said text element based on a statistical language model;
  
  a database engine that;
  
  selects a first subset of said text elements, each of said text elements in said first subset having an entropy value greater than a first predefined entropy value;
  
  adds each of said text elements to a first data structure, said first data structure comprising said text element identifiers and said item identifier; and
  
  creates an adjacency matrix representing a graph comprising vertices representing said text elements and edges representing weighted relationships, said weighted relationships being determined from said first data structure;
  
  a query engine that;
  
  receives a first query comprising a first text element; and
  
  returns results derived from said adjacency matrix, said results comprising observed results.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The system of claim 13 further comprising:
    - a background processor that;
      
      locks a first row of said adjacency matrix;
      
      while said first row is locked, performs transitive closure on said first row of said adjacency matrix using a first algorithm that determines a shortest path between two of said vertices in said graph; and
      
      unlocks said first row when said transitive closure is completed on said first row.
  - 15. The system of claim 14, said language model processor using a plurality of said statistical language models to determine said entropy value.
  - 16. The system of claim 15, one of said statistical language models being a specialized language model.
  - 17. The system of claim 13, said item being at least one of a group composed of:
    - a group of documents;
      
      a document; and
      
      a subsection of a document.

18. A method performed on a computer processor, said method comprising:
- receiving a item comprising text strings;
  
  determining an item identifier for said item;
  
  processing said text strings with a statistical language model to;
  
  identify text elements;
  
  determining text element identifiers for said text elements; and
  
  assign an entropy value to each of said elements;
  
  determining a plurality of entropy level cutoffs;
  
  creating a plurality of groups of said text elements, each of said plurality of groups having an entropy value greater than one of said plurality of entropy level cutoffs;
  
  adding each of said group of text elements to a corresponding data structure comprising said text element identifiers and said item identifier;
  
  creating a graph comprising vertices representing said text elements and edges representing weighted relationships, said weighted relationships being determined from each of said corresponding data structures; and
  
  receiving a search query for a first text element and responding with search results derived from said graph, said search results being observed search results.
- View Dependent Claims (19, 20)
- - 19. The method of claim 18 further comprising:
    - applying a first weighting to a first corresponding data structure and a second weighting to a second corresponding data structure when creating said graph.
  - 20. The method of claim 19 further comprising:
    - generating an adjacency matrix from said graph using a first algorithm that determines a shortest path between two of said vertices in said graph; and
      
      in response to said search query, responding with second search results derived from said adjacency matrix, said second search results comprising inferred search results.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Gherman, Sorin, Mukerjee, Kunal

Application Number

US12/764,107
Publication Number

US 20110264997A1
Time in Patent Office

Days
Field of Search
US Class Current

715/256
CPC Class Codes

G06F 16/3334 Selection or weighting of t...

Scalable Incremental Semantic Entity and Relatedness Extraction from Unstructured Text

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

24 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Scalable Incremental Semantic Entity and Relatedness Extraction from Unstructured Text

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

24 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links