CORPUS SEARCH SYSTEMS AND METHODS

US 20150261850A1
Filed: 03/17/2014
Published: 09/17/2015
Est. Priority Date: 03/17/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for searching a corpus of texts relating to a domain of knowledge, the method comprising:

determining, by said computer, a multiplicity of noun-pair proximity scores measuring associations between pairs of nouns that appear in said texts and that are semantically related to said domain of knowledge;

obtaining, by said computer, a search term related to said domain of knowledge;

identifying, by said computer based at least in part on said multiplicity of noun-pair proximity scores, a related noun that is strongly associated with said search term within said corpus;

selecting, by said computer, from said corpus a plurality of texts, in each of which said search term and said related noun appear near each other in at least one place; and

providing, by said computer, data associated with said plurality of selected texts for presentation as search results.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A corpus of texts relating to a domain of knowledge may be searched by determining noun-pair proximity scores measuring associations between pairs of nouns that appear in the corpus and that are semantically related to the domain of knowledge. When a search term is received, the noun-pair proximity scores may be used (at least in part) to identify one or more related nouns that are strongly associated with the search term within the corpus. One or more texts may be selected from the corpus, texts in which the search term and the related nouns appear near each other in one or more places. The selected texts may be categorized and/or clustered based on the related nouns before being returned for presentation as SearchResults.

142 Citations

21 Claims

1. A computer-implemented method for searching a corpus of texts relating to a domain of knowledge, the method comprising:
- determining, by said computer, a multiplicity of noun-pair proximity scores measuring associations between pairs of nouns that appear in said texts and that are semantically related to said domain of knowledge;
  
  obtaining, by said computer, a search term related to said domain of knowledge;
  
  identifying, by said computer based at least in part on said multiplicity of noun-pair proximity scores, a related noun that is strongly associated with said search term within said corpus;
  
  selecting, by said computer, from said corpus a plurality of texts, in each of which said search term and said related noun appear near each other in at least one place; and
  
  providing, by said computer, data associated with said plurality of selected texts for presentation as search results.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein determining said multiplicity of noun-pair proximity scores comprises, for a given text of said corpus:
    - determining an independent clause that appears in said given text and that includes at least a first noun and a second noun;
      
      determining a measure of intra-clause proximity based at least in part on said first noun'"'"'s relationship to said second noun within said independent clause; and
      
      assigning said determined measure of intra-clause proximity to a noun-pair-score data structure corresponding to said first noun and said second noun.
  - 3. The method of claim 2, wherein determining said independent clause comprises:
    - parsing said given text using a general-purpose grammatical parser to determine a multiplicity of part-of-speech tags corresponding respectively to a multiplicity of words of said given text; and
      
      correcting errors in said multiplicity of part-of-speech tags according to a domain-of-knowledge-specific correction algorithm.
  - 4. The method of claim 2, wherein determining said determined measure of intra-clause proximity comprises:
    - indicating a high measure when said first noun is adjacent to said second noun within said independent clause;
      
      indicating a medium measure when said first noun is separated from said second noun within said independent clause by a linking word; and
      
      otherwise indicating a low measure.
  - 5. The method of claim 1, wherein determining said multiplicity of noun-pair proximity scores further comprises, for a given text of said corpus:
    - determining a second independent clause that appears in said given text and that includes at least a third noun and a fourth noun;
      
      determining a synonym or hypernym corresponding to said third noun;
      
      determining a second measure of intra-clause proximity based at least in part on said third noun'"'"'s relationship to said fourth noun within a fourth independent clause; and
      
      assigning said second determined measure of intra-clause proximity to a second noun-pair-score data structure corresponding to said synonym or hypernym and said fourth noun.
  - 6. The method of claim 1, wherein identifying said related noun comprises:
    - identifying a plurality of proximate nouns that appear near said search term in at least some of said texts;
      
      ranking said plurality of proximate nouns based at least in part on ranking factors including how frequently and how proximately each of said plurality of proximate nouns appear in relation to said search term across said at least some of said texts; and
      
      selecting a high-ranking one of said plurality of proximate nouns.
  - 7. The method of claim 6, wherein said ranking factors further comprise how statistically likely it is that each of said plurality of proximate nouns relates to a subject of a text in which it appears.
  - 8. The method of claim 1, wherein providing data associated with said plurality of selected texts comprises:
    - for each of said plurality of selected texts, identifying a noun phrase that includes said related noun and that appears in said at least one place; and
      
      providing said noun phrase for each of said plurality of selected texts for presentation as search results.
  - 9. The method of claim 8, further comprising generating a user interface that categorizes said search results into a plurality of categories according to said related noun and one or more other related nouns that are also strongly associated with said search term within said corpus.
  - 10. The method of claim 9, wherein for at least one of said plurality of categories, said user interface sub-categorizes said search results according to said noun phrase and one or more other noun phrases, each of which includes said related noun and appears near said search term in one of said plurality of selected texts.
  - 11. The method of claim 1, wherein providing data associated with said plurality of selected texts comprises:
    - for each of said plurality of selected texts, obtaining at least one contextual snippet of text surrounding said at least one place, said at least one contextual snippet including said search term and said related noun; and
      
      providing said at least one contextual snippet for each of said plurality of selected texts for presentation as search results.

12. A computing apparatus for searching a corpus of texts relating to a domain of knowledge, the apparatus comprising a processor and a memory storing instructions that, when executed by the processor, configure the apparatus to:
- determine a multiplicity of noun-pair proximity scores measuring associations between pairs of nouns that appear in said texts and that are semantically related to said domain of knowledge;
  
  obtain a search term related to said domain of knowledge;
  
  identify, based at least in part on said multiplicity of noun-pair proximity scores, a related noun that is strongly associated with said search term within said corpus;
  
  select from said corpus a plurality of texts, in each of which said search term and said related noun appear near each other in at least one place; and
  
  provide data associated with said plurality of selected texts for presentation as search results.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The apparatus of claim 12, wherein the instructions that configure the apparatus to determine said multiplicity of noun-pair proximity scores further comprise instructions configuring the apparatus to, for a given text of said corpus:
    - determine an independent clause that appears in said given text and that includes at least a first noun and a second noun;
      
      determine a measure of intra-clause proximity based at least in part on said first noun'"'"'s relationship to said second noun within said independent clause; and
      
      assign said determined measure of intra-clause proximity to a noun-pair-score data structure corresponding to said first noun and said second noun.
  - 14. The apparatus of claim 13, wherein the instructions that configure the apparatus to determine said independent clause further comprise instructions configuring the apparatus to:
    - parse said given text using a general-purpose grammatical parser to determine a multiplicity of part-of-speech tags corresponding respectively to a multiplicity of words of said given text; and
      
      correct errors in said multiplicity of part-of-speech tags according to a domain-of-knowledge-specific correction algorithm.
  - 15. The apparatus of claim 13, wherein the instructions that configure the apparatus to determine said determined measure of intra-clause proximity further comprise instructions configuring the apparatus to:
    - indicate a high measure when said first noun is adjacent to said second noun within said independent clause;
      
      indicate a medium measure when said first noun is separated from said second noun within said independent clause by a linking word; and
      
      otherwise indicate a low measure.
  - 16. The apparatus of claim 12, wherein the instructions that configure the apparatus to determine said multiplicity of noun-pair proximity scores further further comprise instructions configuring the apparatus to, for a given text of said corpus:
    - determine a second independent clause that appears in said given text and that includes at least a third noun and a fourth noun;
      
      determine a synonym or hypernym corresponding to said third noun;
      
      determine a second measure of intra-clause proximity based at least in part on said third noun'"'"'s relationship to said fourth noun within a fourth independent clause; and
      
      assign said second determined measure of intra-clause proximity to a second noun-pair-score data structure corresponding to said synonym or hypernym and said fourth noun.

17. A non-transitory computer-readable storage medium having stored thereon instructions including instructions that, when executed by a processor, configure the processor to:
- determine a multiplicity of noun-pair proximity scores measuring associations between pairs of nouns that appear in texts and that are semantically related to a domain of knowledge;
  
  obtain a search term related to said domain of knowledge;
  
  identify, based at least in part on said multiplicity of noun-pair proximity scores, a related noun that is strongly associated with said search term within a corpus;
  
  select from said corpus a plurality of texts, in each of which said search term and said related noun appear near each other in at least one place; and
  
  provide data associated with said plurality of selected texts for presentation as search results.
- View Dependent Claims (18, 19, 20, 21)
- - 18. The non-transitory computer-readable storage medium of claim 17, wherein the instructions that configure the processor to determine said multiplicity of noun-pair proximity scores further comprise instructions configuring the processor to, for a given text of said corpus:
    - determine an independent clause that appears in said given text and that includes at least a first noun and a second noun;
      
      determine a measure of intra-clause proximity based at least in part on said first noun'"'"'s relationship to said second noun within said independent clause; and
      
      assign said determined measure of intra-clause proximity to a noun-pair-score data structure corresponding to said first noun and said second noun.
  - 19. The non-transitory computer-readable storage medium of claim 18, wherein the instructions that configure the processor to determine said independent clause further comprise instructions configuring the processor to:
    - parse said given text using a general-purpose grammatical parser to determine a multiplicity of part-of-speech tags corresponding respectively to a multiplicity of words of said given text; and
      
      correct errors in said multiplicity of part-of-speech tags according to a domain-of-knowledge-specific correction algorithm.
  - 20. The non-transitory computer-readable storage medium of claim 18, wherein the instructions that configure the processor to determine said determined measure of intra-clause proximity further comprise instructions configuring the processor to:
    - indicate a high measure when said first noun is adjacent to said second noun within said independent clause;
      
      indicate a medium measure when said first noun is separated from said second noun within said independent clause by a linking word; and
      
      otherwise indicate a low measure.
  - 21. The non-transitory computer-readable storage medium of claim 17, wherein the instructions that configure the processor to determine said multiplicity of noun-pair proximity scores further comprise instructions configuring the processor to, for a given text of said corpus:
    - determine a second independent clause that appears in said given text and that includes at least a third noun and a fourth noun;
      
      determine a synonym or hypernym corresponding to said third noun;
      
      determine a second measure of intra-clause proximity based at least in part on said third noun'"'"'s relationship to said fourth noun within a fourth independent clause; and
      
      assign said second determined measure of intra-clause proximity to a second noun-pair-score data structure corresponding to said synonym or hypernym and said fourth noun.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NLPCore LLC
Original Assignee
NLPCore LLC
Inventors
MITTAL, Varun

Granted Patent

US 10,102,274 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/24578   using ranking

G06F 16/285   Clustering or classification

G06F 16/3344   using natural language anal...

CORPUS SEARCH SYSTEMS AND METHODS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

142 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

CORPUS SEARCH SYSTEMS AND METHODS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

142 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links