Scalable Incremental Semantic Entity and Relatedness Extraction from Unstructured Text
First Claim
1. A method performed on a computer processor, said method comprising:
- receiving a item comprising text strings;
determining an item identifier for said item;
processing said text strings with a statistical language model to;
identify text elements;
determining text element identifiers for said text elements; and
assign an entropy value to each of said elements;
selecting a first subset of said text elements, each of said text elements in said first subset having an entropy value greater than a first predefined entropy value;
adding each of said text elements to a first data structure, said first data structure comprising said text element identifiers and said item identifier;
creating an adjacency matrix representing a graph comprising vertices representing said text elements and edges representing weighted relationships, said weighted relationships being determined from said first data structure; and
receiving a search query for a first text element and responding with search results derived from said adjacency matrix.
2 Assignments
0 Petitions
Accused Products
Abstract
A search engine for documents containing text may process text using a statistical language model, classify the text based on entropy, and create suffix trees or other mappings of the text for each classification. From the suffix trees or mappings, a graph may be constructed with relationship strengths between different words or text strings. The graph may be used to determine search results, and may be browsed or navigated before viewing search results. As new documents are added, they may be processed and added to the suffix trees, then the graph may be created on demand in response to a search request. The graph may be represented as a adjacency matrix, and a transitive closure algorithm may process the adjacency matrix as a background process.
24 Citations
20 Claims
-
1. A method performed on a computer processor, said method comprising:
-
receiving a item comprising text strings; determining an item identifier for said item; processing said text strings with a statistical language model to; identify text elements; determining text element identifiers for said text elements; and assign an entropy value to each of said elements; selecting a first subset of said text elements, each of said text elements in said first subset having an entropy value greater than a first predefined entropy value; adding each of said text elements to a first data structure, said first data structure comprising said text element identifiers and said item identifier; creating an adjacency matrix representing a graph comprising vertices representing said text elements and edges representing weighted relationships, said weighted relationships being determined from said first data structure; and receiving a search query for a first text element and responding with search results derived from said adjacency matrix. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A system comprising:
-
a document adapter that; receives an item comprising text elements; and creates an item identifier for said item; an input adapter that; parses said item into text elements; and for each of said text elements, assigns a text element identifier; a language model processor that; assigns an entropy value to each of said text element based on a statistical language model; a database engine that; selects a first subset of said text elements, each of said text elements in said first subset having an entropy value greater than a first predefined entropy value; adds each of said text elements to a first data structure, said first data structure comprising said text element identifiers and said item identifier; and creates an adjacency matrix representing a graph comprising vertices representing said text elements and edges representing weighted relationships, said weighted relationships being determined from said first data structure; a query engine that; receives a first query comprising a first text element; and returns results derived from said adjacency matrix, said results comprising observed results. - View Dependent Claims (14, 15, 16, 17)
-
-
18. A method performed on a computer processor, said method comprising:
-
receiving a item comprising text strings; determining an item identifier for said item; processing said text strings with a statistical language model to; identify text elements; determining text element identifiers for said text elements; and assign an entropy value to each of said elements; determining a plurality of entropy level cutoffs; creating a plurality of groups of said text elements, each of said plurality of groups having an entropy value greater than one of said plurality of entropy level cutoffs; adding each of said group of text elements to a corresponding data structure comprising said text element identifiers and said item identifier; creating a graph comprising vertices representing said text elements and edges representing weighted relationships, said weighted relationships being determined from each of said corresponding data structures; and receiving a search query for a first text element and responding with search results derived from said graph, said search results being observed search results. - View Dependent Claims (19, 20)
-
Specification