Concept indexing among database of documents using machine learning techniques

US 9,348,920 B1
Filed: 06/22/2015
Issued: 05/24/2016
Est. Priority Date: 12/22/2014
Status: Active Grant

First Claim

Patent Images

1. A computing system for identifying concepts of interests to a user in specific segments of a plurality of documents each having one or more separate segments, the computing system including:

one or more hardware computer processors configured to execute software instructions; and

one or more storage devices storing software instructions configured for execution by the one or more hardware computer processors in order to cause the computing system to;

identify a plurality of segments within the plurality of documents, wherein at least some of the plurality of documents each include two or more segments, wherein identifying segments includes analyzing the plurality of documents for features indicative of possible section headings, including at least one of;

casing, spacing, punctuation, common words, or groups of words;

access a concept hierarchy including a plurality of concepts of interest to the user, the concept hierarchy further including concept keywords associated with respective concepts;

for each concept, determine statistical likelihoods that respective identified segments are associated with the concept, the statistical likelihoods each based on at least one of, for each combination of a particular concept and a particular segment;

a density of particular concept keywords in the particular segment, wherein the density is based at least on a ratio of a quantity of particular concept keywords in the particular segment to a quantity of words in the particular segment;

ora distribution of particular concept keywords within the particular segment, wherein the distribution is based on at least one of a longest span in the particular segment without any mention of particular concept keywords or a median gap between consecutive mentions of respective concept keywords in the particular segment; and

store in a concept indexing database the plurality of concepts and the statistical likelihoods that respective concepts are in each of the determined respective segments, wherein the concept indexing database is usable to identify, in response to a user query for a specific concept, a ranked listing of one or more segments having highest statistical likelihoods of being associated with the specific concept.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and techniques for indexing and/or querying a database are described herein. Discrete sections and/or segments from documents may be determined by a concept indexing system. The segments may be indexed by concept and/or higher-level category of interest to a user. A user may query the segments by one or more concepts. The segments may be analyzed to rank the segments by statistical accuracy and/or relatedness to one or more particular concepts. The rankings may be used for presentation of search results in a user interface. Furthermore, segments and/or documents may be ranked based on recency decay functions that distinguish between segments that maintain their relevance over time in contrast with temporal segments whose relevance decays quicker over time, for example.

287 Citations

20 Claims

1. A computing system for identifying concepts of interests to a user in specific segments of a plurality of documents each having one or more separate segments, the computing system including:
- one or more hardware computer processors configured to execute software instructions; and
  
  one or more storage devices storing software instructions configured for execution by the one or more hardware computer processors in order to cause the computing system to;
  
  identify a plurality of segments within the plurality of documents, wherein at least some of the plurality of documents each include two or more segments, wherein identifying segments includes analyzing the plurality of documents for features indicative of possible section headings, including at least one of;
  
  casing, spacing, punctuation, common words, or groups of words;
  
  access a concept hierarchy including a plurality of concepts of interest to the user, the concept hierarchy further including concept keywords associated with respective concepts;
  
  for each concept, determine statistical likelihoods that respective identified segments are associated with the concept, the statistical likelihoods each based on at least one of, for each combination of a particular concept and a particular segment;
  
  a density of particular concept keywords in the particular segment, wherein the density is based at least on a ratio of a quantity of particular concept keywords in the particular segment to a quantity of words in the particular segment;
  
  ora distribution of particular concept keywords within the particular segment, wherein the distribution is based on at least one of a longest span in the particular segment without any mention of particular concept keywords or a median gap between consecutive mentions of respective concept keywords in the particular segment; and
  
  store in a concept indexing database the plurality of concepts and the statistical likelihoods that respective concepts are in each of the determined respective segments, wherein the concept indexing database is usable to identify, in response to a user query for a specific concept, a ranked listing of one or more segments having highest statistical likelihoods of being associated with the specific concept.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computing system of claim 1, wherein the concept keywords associated with respective concepts include user generated initial keywords, and the software instructions are further configured to:
    - identify related keywords using machine learning analysis of at least some of the identified plurality of segments; and
      
      add any identified related keywords to the concept keywords associated with the respective concept.
  - 3. The computing system of claim 1, wherein the ranked listing is based at least on recency scores associated with the one or more segments.
  - 4. The computing system of claim 3, wherein execution of the software instructions by the one or more hardware computer processors cause the computing system to:
    - determine a quantity of temporal words within a first segment, wherein a first recency score, from the recency scores, for the first segment is based at least on a logistic function and the quantity of temporal words.
  - 5. The computing system of claim 1, wherein the distribution of particular concept keywords within the particular segment is further based at least on a relationship G, wherein relationship G is defined substantially as:
  - 6. The computing system of claim 1, wherein identifying the plurality of segments within the plurality of documents further comprises merging a first segment and a second segment, wherein a determination to merge the first segment and the second segment is based on a cosine distance calculation above a threshold, the cosine distance calculation based at least on a first word vector associated with the first segment and a second word vector associated with the second segment.
  - 7. The computing system of claim 1, wherein execution of the software instructions by the one or more hardware computer processors cause the computing system to:
    - access a first concept and a second concept;
      
      generate a ranking of an association between a first segment and an intersection of the first concept and the second concept, wherein the ranking is based at least on a relationship R, wherein relationship R is defined substantially as;
      
      ∝
      
      *geometric mean(P)*quantity of P−
      
      (1−
      
      ∝
      
      )*sum(0),where ∝
      
      is a constant,P comprises a first density of at least the first concept and the second concept in the first segment, andO comprises a second density of one or more other concepts in the first segment, wherein the one or more other concepts do not include at least the first concept and the second concept.

8. A computing system for information retrieval, the computing system comprising:
- one or more hardware computer processors programmed, via software instructions, to;
  
  access a plurality of documents, each document from the plurality of documents associated with one or more words;
  
  identify, from the plurality of documents, a plurality of segments, wherein each segment of the plurality of segments is identified based at least on analyzing one or more features of each document from the plurality of documents, the one or more features comprising at least one of casing, spacing, punctuation, or common words, and wherein each segment of the plurality of segments is at least associated with a portion of a respective document;
  
  access a plurality of concepts of interest for identification within the plurality of segments;
  
  access a mapping from respective ones of the plurality of concepts to respective keywords from an initial keyword set;
  
  determine a first set of segments from the plurality of segments based at least on the initial keyword set, respective ones from the initial keyword set corresponding to respective words from the first set of segments;
  
  determine a related keyword set based at least on identifying respective words from the first set of segments that were not present in the initial set of keywords;
  
  update the mapping to include associations between respective ones of the plurality of concepts and respective related keywords;
  
  determine a second set of segments from the plurality of segments based at least on the related keyword set, respective ones from the related keyword set corresponding to respective words from the second set of segments;
  
  index the plurality of concepts, wherein respective ones of the plurality of concepts are associated with at least one segment from the first set of segments or the second set of segments, wherein the association between respective ones of the plurality of concepts and the at least one segment is based at least on the mapping;
  
  determine a ranking associated with a first segment from the first set of segments or the second set of segments, and a first concept from the plurality of concepts, wherein the ranking is based on at least one of;
  
  a density of first concept keywords in the first segment, wherein the density is based at least on a ratio of a quantity of first concept keywords in the first segment to a quantity of words in the first segment;
  
  ora distribution of first concept keywords within the first segment, wherein the distribution is based on at least one of a longest span in the first segment without any mention of first concept keywords or a median gap between consecutive mentions of respective first concept keywords in the first segment; and
  
  store the index and the ranking in a non-transitory computer storage.
- View Dependent Claims (9, 10, 11)
- - 9. The computing system of claim 8, wherein the one or more hardware processors are further programmed, via the software instructions, to:
    - receive input comprising at least the first concept;
      
      in response to the input, query the non-transitory computer storage to retrieve a result set based on the first concept and the index, the result set comprising at least the first segment and a second segment; and
      
      transmit the retrieved result set for presentation in a user interface, wherein the ranking affects the presentation of the first segment and the second segment in the user interface.
  - 10. The computing system of claim 9, wherein the ranking is further based at least on a relationship G, wherein relationship G is defined substantially as:
  - 11. The computing system of claim 9, wherein the input further comprises a second concept, and wherein the ranking is further based at least on a relationship R, wherein relationship R is defined substantially as:
    - ∝
      
      *geometric mean(P)*quantity of P−
      
      (1−
      
      ∝
      
      )*sum(0),where ∝
      
      is a constant,P comprises a first density of at least the first concept and the second concept in the first segment, andO comprises a second density of one or more other concepts in the first segment, wherein the one or more other concepts do not include at least the first concept and the second concept.

12. A computer-implemented method for information retrieval,the computer-implemented method comprising:
- identifying a plurality of segments within a plurality of documents, wherein identifying segments includes analyzing the plurality of documents for features indicative of possible section headings, including at least one of;
  
  casing, spacing, punctuation, common words, or groups of words;
  
  accessing a concept hierarchy including a plurality of concepts of interest to the user, the concept hierarchy further including concept keywords associated with respective concepts;
  
  for each concept, determining statistical likelihoods that respective identified segments are associated with the concept, the statistical likelihoods each based on at least one of, for each combination of a particular concept and a particular segment;
  
  a density of particular concept keywords in the particular segment, wherein the density is based at least on a ratio of a quantity of particular concept keywords in the particular segment to a quantity of words in the particular segment;
  
  ora distribution of particular concept keywords within the particular segment, wherein the distribution is based on at least one of a longest span in the particular segment without any mention of particular concept keywords or a median gap between consecutive mentions of respective concept keywords in the particular segment;
  
  generating an index from the plurality of concepts and the statistical likelihoods that respective concepts are in each of the determined respective segments; and
  
  storing the index in a non-transitory computer storage.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
- - 13. The computer-implemented method of claim 12, further comprising:
    - receiving input comprising at least one search concept;
      
      in response to the input, query the non-transitory computer storage to retrieve a result set based on the at least one search concept and the index, the result set comprising at least one segment; and
      
      transmit the retrieved result set for presentation in a user interface.
  - 14. The computer-implemented method of claim 13, wherein the result set comprises at least a first segment and a second segment, the computer-implemented method further comprising:
    - determining a first ranking associated with at least the first segment; and
      
      transmitting the first ranking, wherein the first ranking affects the presentation of the first segment relative to the second segment in the user interface.
  - 15. The computer-implemented method of claim 14, wherein the at least one search concept comprises a first search concept and a second search concept, and wherein determining the first ranking associated with at least the first segment comprises:
    - determining a first quantity of the first search concept in the first segment, and a second quantity of the second search concept in the first segment;
      
      accessing first histogram data associated with the first search concept, and second histogram data associated with the second search concept; and
      
      determining a first percentile of the first quantity from the first histogram data, and a second percentile of the second quantity from the second histogram data, wherein the first ranking comprises first and second weightings based at least on the first and second percentiles.
  - 16. The computer-implemented method of claim 15, wherein the first weighting is further based at least on a relationship W, wherein relationship W is defined substantially as:
    - T*C, where T comprises the first percentile, andC comprises the first quantity of the first search concept in the first segment.
  - 17. The computer-implemented method of claim 14, wherein determining the first ranking associated with at least the first segment comprises:
    - determining a quantity of temporal words within the first segment; and
      
      determining a recency score for the first segment based at least on a logistic function and the quantity of temporal words, wherein the first ranking comprises the recency score.
  - 18. The computer-implemented method of claim 17, further comprising:
    - accessing histogram data associated with the first search concept, wherein the histogram data indicates segments per a time unit associated with the first search concept; and
      
      selecting the logistic function from a set of logistic functions where the histogram data is within a threshold associated with the logistic function.
  - 19. The computer-implemented method of claim 14, wherein the input further comprises a second concept, and wherein the first ranking is further based at least on a relationship R, wherein relationship R is defined substantially as:
    - ∝
      
      *geometric mean(P)*quantity of P−
      
      (1−
      
      ∝
      
      )*sum(0),where ∝
      
      is a constant,P comprises a first density of at least the first concept and the second concept in the first segment, andO comprises a second density of one or more other concepts in the first segment, wherein the one or more other concepts do not include at least the first concept and the second concept.
  - 20. The computer-implemented method of claim 12, wherein the distribution of particular concept keywords within the particular segment is further based at least on a relationship G, wherein relationship G is defined substantially as:

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Palantir Technologies Incorporated
Original Assignee
Palantir Technologies Incorporated
Inventors
Kesin, Max
Primary Examiner(s)
Alam, Shahid

Application Number

US14/746,671
Time in Patent Office

337 Days
Field of Search

707/669, 707/812
US Class Current

1/1
CPC Class Codes

G06F 16/248   Presentation of query results

G06F 16/282   Hierarchical databases, e.g...

G06F 16/31   Indexing; Data structures t...

G06F 16/334   Query execution G06F16/335 ...

G06F 16/338   Presentation of query results

G06F 16/353   into predefined classes

G06F 16/367   Ontology

G06F 16/40   of multimedia data, e.g. sl...

G06F 16/93   Document management systems

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9535   Search customisation based ...

G06F 16/9538   Presentation of query results

G06N 20/00   Machine learning

Concept indexing among database of documents using machine learning techniques

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

287 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Concept indexing among database of documents using machine learning techniques

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

287 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links