Corpus search systems and methods

US 10,102,274 B2
Filed: 03/17/2014
Issued: 10/16/2018
Est. Priority Date: 03/17/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for searching a corpus of texts relating to a domain of knowledge, the method comprising:

determining, by said computer, a multiplicity of noun-pair proximity scores measuring associations between pairs of nouns that appear in said corpus of texts and that are semantically related to said domain of knowledge;

obtaining, by said computer, a search term related to said domain of knowledge;

identifying, by said computer based at least in part on said multiplicity of noun-pair proximity scores, a related noun that is strongly associated with said search term within said corpus of texts;

selecting, by said computer, a plurality of texts from said corpus of texts, wherein, in each of said plurality of texts, said search term and said related noun appear near each other in at least one place; and

providing, by said computer, data associated with said plurality of texts for presentation as search results; and

wherein determining said multiplicity of noun-pair proximity scores comprises, for a given text of said corpus of texts;

parsing said given text to identify an independent clause that appears in said given text and that includes at least a first noun and a second noun;

determining a measure of intra-clause proximity based at least in part on said first noun'"'"'s relationship to said second noun within said independent clause; and

assigning said determined measure of intra-clause proximity to a noun-pair-score data structure corresponding to said first noun and said second noun.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A corpus of texts relating to a domain of knowledge may be searched by determining noun-pair proximity scores measuring associations between pairs of nouns that appear in the corpus and that are semantically related to the domain of knowledge. When a search term is received, the noun-pair proximity scores may be used (at least in part) to identify one or more related nouns that are strongly associated with the search term within the corpus. One or more texts may be selected from the corpus, texts in which the search term and the related nouns appear near each other in one or more places. The selected texts may be categorized and/or clustered based on the related nouns before being returned for presentation as SearchResults.

26 Citations

View as Search Results

18 Claims

1. A computer-implemented method for searching a corpus of texts relating to a domain of knowledge, the method comprising:
- determining, by said computer, a multiplicity of noun-pair proximity scores measuring associations between pairs of nouns that appear in said corpus of texts and that are semantically related to said domain of knowledge;
  
  obtaining, by said computer, a search term related to said domain of knowledge;
  
  identifying, by said computer based at least in part on said multiplicity of noun-pair proximity scores, a related noun that is strongly associated with said search term within said corpus of texts;
  
  selecting, by said computer, a plurality of texts from said corpus of texts, wherein, in each of said plurality of texts, said search term and said related noun appear near each other in at least one place; and
  
  providing, by said computer, data associated with said plurality of texts for presentation as search results; and
  
  wherein determining said multiplicity of noun-pair proximity scores comprises, for a given text of said corpus of texts;
  
  parsing said given text to identify an independent clause that appears in said given text and that includes at least a first noun and a second noun;
  
  determining a measure of intra-clause proximity based at least in part on said first noun'"'"'s relationship to said second noun within said independent clause; and
  
  assigning said determined measure of intra-clause proximity to a noun-pair-score data structure corresponding to said first noun and said second noun.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The computer-implemented method of claim 1, wherein parsing said given text to identify said independent clause comprises:
    - parsing said given text using a general-purpose grammatical parser to determine a multiplicity of part-of-speech tags corresponding respectively to a multiplicity of words of said given text; and
      
      correcting errors in said multiplicity of part-of-speech tags according to a domain-of-knowledge-specific correction algorithm.
  - 3. The computer-implemented method of claim 1, wherein determining said determined measure of intra-clause proximity comprises:
    - indicating a high measure when said first noun is adjacent to said second noun within said independent clause;
      
      indicating a medium measure when said first noun is separated from said second noun within said independent clause by a linking word; and
      
      otherwise indicating a low measure.
  - 4. The computer-implemented method of claim 1, wherein determining said multiplicity of noun-pair proximity scores further comprises, for a given text of said corpus of texts:
    - determining a second independent clause that appears in said given text and that includes at least a third noun and a fourth noun;
      
      determining a synonym or hypernym corresponding to said third noun;
      
      determining a second measure of intra-clause proximity based at least in part on said third noun'"'"'s relationship to said fourth noun within a fourth independent clause; and
      
      assigning said second determined measure of intra-clause proximity to a second noun-pair-score data structure corresponding to said synonym or hypernym and said fourth noun.
  - 5. The method of claim 1, wherein identifying said related noun comprises:
    - identifying a plurality of proximate nouns that appear near said search term in at least some of said texts;
      
      ranking said plurality of proximate nouns based at least in part on ranking factors including how frequently and how proximately each of said plurality of proximate nouns appear in relation to said search term across said at least some of said texts; and
      
      selecting a high-ranking one of said plurality of proximate nouns.
  - 6. The computer-implemented method of claim 5, wherein said ranking factors further comprise how statistically likely it is that each of said plurality of proximate nouns relates to a subject of a text in which it appears.
  - 7. The computer-implemented method of claim 1, wherein providing data associated with said plurality of texts comprises:
    - for each of said plurality of texts, identifying a noun phrase that includes said related noun and that appears in said at least one place; and
      
      providing said noun phrase for each of said plurality of texts for presentation as search results.
  - 8. The computer-implemented method of claim 7, further comprising generating a user interface that categorizes said search results into a plurality of categories according to said related noun and one or more other related nouns that are also strongly associated with said search term within said corpus of texts.
  - 9. The computer-implemented method of claim 8, wherein for at least one of said plurality of categories, said user interface sub-categorizes said search results according to said noun phrase and one or more other noun phrases, each of which includes said related noun and appears near said search term in one of said plurality of texts.
  - 10. The computer-implemented method of claim 1, wherein providing data associated with said plurality of texts comprises:
    - for each of said plurality of texts, obtaining at least one contextual snippet of text surrounding said at least one place, said at least one contextual snippet including said search term and said related noun; and
      
      providing said at least one contextual snippet for each of said plurality of texts for presentation as search results.

11. A computing apparatus for searching a corpus of texts relating to a domain of knowledge, the apparatus comprising a processor and a memory storing instructions that, when executed by the processor, configure the apparatus to:
- determine a multiplicity of noun-pair proximity scores measuring associations between pairs of nouns that appear in said corpus of texts and that are semantically related to said domain of knowledge;
  
  obtain a search term related to said domain of knowledge;
  
  identify, based at least in part on said multiplicity of noun-pair proximity scores, a related noun that is strongly associated with said search term within said corpus of texts;
  
  select from said corpus of texts a plurality of texts, wherein, in each of said plurality of texts, said search term and said related noun appear near each other in at least one place; and
  
  provide data associated with said plurality of texts for presentation as search results, andwherein the instructions that configure the apparatus to determine said multiplicity of noun-pair proximity scores further comprise instructions configuring the apparatus to, for a given text of said corpus of texts;
  
  parse said given text to identify an independent clause that appears in said given text and that includes at least a first noun and a second noun;
  
  determine a measure of intra-clause proximity based at least in part on said first noun'"'"'s relationship to said second noun within said independent clause; and
  
  assign said determined measure of intra-clause proximity to a noun-pair-score data structure corresponding to said first noun and said second noun.
- View Dependent Claims (12, 13, 14)
- - 12. The apparatus of claim 11, wherein the instructions that configure the apparatus to parse said given text to identify said independent clause further comprise instructions configuring the apparatus to:
    - parse said given text using a general-purpose grammatical parser to determine a multiplicity of part-of-speech tags corresponding respectively to a multiplicity of words of said given text; and
      
      correct errors in said multiplicity of part-of-speech tags according to a domain-of-knowledge-specific correction algorithm.
  - 13. The apparatus of claim 11, wherein the instructions that configure the apparatus to determine said determined measure of intra-clause proximity further comprise instructions configuring the apparatus to:
    - indicate a high measure when said first noun is adjacent to said second noun within said independent clause;
      
      indicate a medium measure when said first noun is separated from said second noun within said independent clause by a linking word; and
      
      otherwise indicate a low measure.
  - 14. The apparatus of claim 11, wherein the instructions that configure the apparatus to determine said multiplicity of noun-pair proximity scores further comprise instructions configuring the apparatus to, for a given text of said corpus of texts:
    - determine a second independent clause that appears in said given text and that includes at least a third noun and a fourth noun;
      
      determine a synonym or hypernym corresponding to said third noun;
      
      determine a second measure of intra-clause proximity based at least in part on said third noun'"'"'s relationship to said fourth noun within a fourth independent clause; and
      
      assign said second determined measure of intra-clause proximity to a second noun-pair-score data structure corresponding to said synonym or hypernym and said fourth noun.

15. A non-transitory computer-readable storage medium having stored thereon instructions including instructions that, when executed by a processor, configure the processor to:
- determine a multiplicity of noun-pair proximity scores measuring associations between pairs of nouns that appear in texts and that are semantically related to a domain of knowledge;
  
  obtain a search term related to said domain of knowledge;
  
  identify, based at least in part on said multiplicity of noun-pair proximity scores, a related noun that is strongly associated with said search term within a corpus of texts;
  
  select from said corpus of texts a plurality of texts, wherein, in each of said plurality of texts, said search term and said related noun appear near each other in at least one place; and
  
  provide data associated with said plurality of texts for presentation as search results, andwherein the instructions that configure the processor to determine said multiplicity of noun-pair proximity scores further comprise instructions configuring the processor to, for a given text of said corpus of texts;
  
  parse said given text to identify an independent clause that appears in said given text and that includes at least a first noun and a second noun;
  
  determine a measure of intra-clause proximity based at least in part on said first noun'"'"'s relationship to said second noun within said independent clause; and
  
  assign said determined measure of intra-clause proximity to a noun-pair-score data structure corresponding to said first noun and said second noun.
- View Dependent Claims (16, 17, 18)
- - 16. The non-transitory computer-readable storage medium of claim 15, wherein the instructions that configure the processor to parse said given text to identify said independent clause further comprise instructions configuring the processor to:
    - parse said given text using a general-purpose grammatical parser to determine a multiplicity of part-of-speech tags corresponding respectively to a multiplicity of words of said given text; and
      
      correct errors in said multiplicity of part-of-speech tags according to a domain-of-knowledge-specific correction algorithm.
  - 17. The non-transitory computer-readable storage medium of claim 15, wherein the instructions that configure the processor to determine said determined measure of intra-clause proximity further comprise instructions configuring the processor to:
    - indicate a high measure when said first noun is adjacent to said second noun within said independent clause;
      
      indicate a medium measure when said first noun is separated from said second noun within said independent clause by a linking word; and
      
      otherwise indicate a low measure.
  - 18. The non-transitory computer-readable storage medium of claim 15, wherein the instructions that configure the processor to determine said multiplicity of noun-pair proximity scores further comprise instructions configuring the processor to, for a given text of said corpus of texts:
    - determine a second independent clause that appears in said given text and that includes at least a third noun and a fourth noun;
      
      determine a synonym or hypernym corresponding to said third noun;
      
      determine a second measure of intra-clause proximity based at least in part on said third noun'"'"'s relationship to said fourth noun within a fourth independent clause; and
      
      assign said second determined measure of intra-clause proximity to a second noun-pair-score data structure corresponding to said synonym or hypernym and said fourth noun.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NLPCore LLC
Original Assignee
NLPCore LLC
Inventors
Mittal, Varun
Primary Examiner(s)
Vital, Pierre M

Application Number

US14/216,059
Publication Number

US 20150261850A1
Time in Patent Office

1,674 Days
Field of Search

707724
US Class Current
CPC Class Codes

G06F 16/24578   using ranking

G06F 16/285   Clustering or classification

G06F 16/3344   using natural language anal...

Corpus search systems and methods

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

26 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Corpus search systems and methods

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

26 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links