Concept indexing among database of documents using machine learning techniques
First Claim
1. A computing system for identifying concepts of interests to a user in specific segments of a plurality of documents each having one or more separate segments, the computing system including:
- one or more hardware computer processors configured to execute software instructions; and
one or more storage devices storing software instructions configured for execution by the one or more hardware computer processors in order to cause the computing system to;
identify a plurality of segments within the plurality of documents, wherein at least some of the plurality of documents each include two or more segments, wherein identifying segments includes analyzing the plurality of documents for features indicative of possible section headings, including at least one of;
casing, spacing, punctuation, common words, or groups of words;
access a concept hierarchy including a plurality of concepts of interest to the user, the concept hierarchy further including concept keywords associated with respective concepts;
for each concept, determine statistical likelihoods that respective identified segments are associated with the concept, the statistical likelihoods each based on at least one of, for each combination of a particular concept and a particular segment;
a density of particular concept keywords in the particular segment, wherein the density is based at least on a ratio of a quantity of particular concept keywords in the particular segment to a quantity of words in the particular segment;
ora distribution of particular concept keywords within the particular segment, wherein the distribution is based on at least one of a longest span in the particular segment without any mention of particular concept keywords or a median gap between consecutive mentions of respective concept keywords in the particular segment; and
store in a concept indexing database the plurality of concepts and the statistical likelihoods that respective concepts are in each of the determined respective segments, wherein the concept indexing database is usable to identify, in response to a user query for a specific concept, a ranked listing of one or more segments having highest statistical likelihoods of being associated with the specific concept.
8 Assignments
0 Petitions
Accused Products
Abstract
Systems and techniques for indexing and/or querying a database are described herein. Discrete sections and/or segments from documents may be determined by a concept indexing system. The segments may be indexed by concept and/or higher-level category of interest to a user. A user may query the segments by one or more concepts. The segments may be analyzed to rank the segments by statistical accuracy and/or relatedness to one or more particular concepts. The rankings may be used for presentation of search results in a user interface. Furthermore, segments and/or documents may be ranked based on recency decay functions that distinguish between segments that maintain their relevance over time in contrast with temporal segments whose relevance decays quicker over time, for example.
287 Citations
20 Claims
-
1. A computing system for identifying concepts of interests to a user in specific segments of a plurality of documents each having one or more separate segments, the computing system including:
-
one or more hardware computer processors configured to execute software instructions; and one or more storage devices storing software instructions configured for execution by the one or more hardware computer processors in order to cause the computing system to; identify a plurality of segments within the plurality of documents, wherein at least some of the plurality of documents each include two or more segments, wherein identifying segments includes analyzing the plurality of documents for features indicative of possible section headings, including at least one of;
casing, spacing, punctuation, common words, or groups of words;access a concept hierarchy including a plurality of concepts of interest to the user, the concept hierarchy further including concept keywords associated with respective concepts; for each concept, determine statistical likelihoods that respective identified segments are associated with the concept, the statistical likelihoods each based on at least one of, for each combination of a particular concept and a particular segment; a density of particular concept keywords in the particular segment, wherein the density is based at least on a ratio of a quantity of particular concept keywords in the particular segment to a quantity of words in the particular segment;
ora distribution of particular concept keywords within the particular segment, wherein the distribution is based on at least one of a longest span in the particular segment without any mention of particular concept keywords or a median gap between consecutive mentions of respective concept keywords in the particular segment; and store in a concept indexing database the plurality of concepts and the statistical likelihoods that respective concepts are in each of the determined respective segments, wherein the concept indexing database is usable to identify, in response to a user query for a specific concept, a ranked listing of one or more segments having highest statistical likelihoods of being associated with the specific concept. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computing system for information retrieval, the computing system comprising:
one or more hardware computer processors programmed, via software instructions, to; access a plurality of documents, each document from the plurality of documents associated with one or more words; identify, from the plurality of documents, a plurality of segments, wherein each segment of the plurality of segments is identified based at least on analyzing one or more features of each document from the plurality of documents, the one or more features comprising at least one of casing, spacing, punctuation, or common words, and wherein each segment of the plurality of segments is at least associated with a portion of a respective document; access a plurality of concepts of interest for identification within the plurality of segments; access a mapping from respective ones of the plurality of concepts to respective keywords from an initial keyword set; determine a first set of segments from the plurality of segments based at least on the initial keyword set, respective ones from the initial keyword set corresponding to respective words from the first set of segments; determine a related keyword set based at least on identifying respective words from the first set of segments that were not present in the initial set of keywords; update the mapping to include associations between respective ones of the plurality of concepts and respective related keywords; determine a second set of segments from the plurality of segments based at least on the related keyword set, respective ones from the related keyword set corresponding to respective words from the second set of segments; index the plurality of concepts, wherein respective ones of the plurality of concepts are associated with at least one segment from the first set of segments or the second set of segments, wherein the association between respective ones of the plurality of concepts and the at least one segment is based at least on the mapping; determine a ranking associated with a first segment from the first set of segments or the second set of segments, and a first concept from the plurality of concepts, wherein the ranking is based on at least one of; a density of first concept keywords in the first segment, wherein the density is based at least on a ratio of a quantity of first concept keywords in the first segment to a quantity of words in the first segment;
ora distribution of first concept keywords within the first segment, wherein the distribution is based on at least one of a longest span in the first segment without any mention of first concept keywords or a median gap between consecutive mentions of respective first concept keywords in the first segment; and store the index and the ranking in a non-transitory computer storage. - View Dependent Claims (9, 10, 11)
-
12. A computer-implemented method for information retrieval,
the computer-implemented method comprising: -
identifying a plurality of segments within a plurality of documents, wherein identifying segments includes analyzing the plurality of documents for features indicative of possible section headings, including at least one of;
casing, spacing, punctuation, common words, or groups of words;accessing a concept hierarchy including a plurality of concepts of interest to the user, the concept hierarchy further including concept keywords associated with respective concepts; for each concept, determining statistical likelihoods that respective identified segments are associated with the concept, the statistical likelihoods each based on at least one of, for each combination of a particular concept and a particular segment; a density of particular concept keywords in the particular segment, wherein the density is based at least on a ratio of a quantity of particular concept keywords in the particular segment to a quantity of words in the particular segment;
ora distribution of particular concept keywords within the particular segment, wherein the distribution is based on at least one of a longest span in the particular segment without any mention of particular concept keywords or a median gap between consecutive mentions of respective concept keywords in the particular segment; generating an index from the plurality of concepts and the statistical likelihoods that respective concepts are in each of the determined respective segments; and storing the index in a non-transitory computer storage. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
Specification