Automated collective term and phrase index

US 9,864,741 B2
Filed: 09/23/2015
Issued: 01/09/2018
Est. Priority Date: 09/23/2014
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

obtaining, by a computer system, data files from a knowledge corpus of an enterprise;

identifying, by the computer system, key terms within the data files;

determining, by the computer system, for each identified key term, a frequency of occurrence and location within the data files;

generating, by the computer system, knowledge units from the data files based on the determined frequencies of occurrence and the determined locations of the key terms in the data files;

selecting, by computing system, a knowledge unit from the generated knowledge units for extraction of n-grams;

deriving, by the computing system, a term vector for the knowledge unit based at least on the determined frequencies of occurrence and the determined locations of the key terms in the knowledge unit;

identifying, by the computing system, the key terms in the term vector based at least on the frequency of occurrence of each key term in the knowledge unit;

extracting, by the computing system, n-grams using the key terms in the term vector;

scoring, by the computing system, each of the extracted n-grams as a function of at least a frequency of occurrence of each of the n-grams across the knowledge corpus of the enterprise; and

adding, by the computing system, one or more of the extracted n-grams to an index based on the scoring.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Knowledge automation techniques may include selecting a knowledge element from a knowledge corpus of an enterprise for extraction of n-grams, and deriving a term vector comprising terms in the knowledge element. Based at least on a frequency of occurrence of each term in the knowledge element, key terms are identified in the term vector. Thereafter, the identified key terms are used to extract one or more n-grams from the knowledge element. Each of the extracted n-grams is scored as a function of at least a frequency of occurrence of each of the n-grams across the knowledge corpus of the enterprise, and based on the scoring, one or more of the n-grams is added to a collective term and phrase index.

34 Citations

View as Search Results

20 Claims

1. A method comprising:
- obtaining, by a computer system, data files from a knowledge corpus of an enterprise;
  
  identifying, by the computer system, key terms within the data files;
  
  determining, by the computer system, for each identified key term, a frequency of occurrence and location within the data files;
  
  generating, by the computer system, knowledge units from the data files based on the determined frequencies of occurrence and the determined locations of the key terms in the data files;
  
  selecting, by computing system, a knowledge unit from the generated knowledge units for extraction of n-grams;
  
  deriving, by the computing system, a term vector for the knowledge unit based at least on the determined frequencies of occurrence and the determined locations of the key terms in the knowledge unit;
  
  identifying, by the computing system, the key terms in the term vector based at least on the frequency of occurrence of each key term in the knowledge unit;
  
  extracting, by the computing system, n-grams using the key terms in the term vector;
  
  scoring, by the computing system, each of the extracted n-grams as a function of at least a frequency of occurrence of each of the n-grams across the knowledge corpus of the enterprise; and
  
  adding, by the computing system, one or more of the extracted n-grams to an index based on the scoring.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the deriving the term vector includesmodeling the key terms identified in the knowledge unit as the term vector, andwherein a value for each key term in the term vector is calculated as a function of at least the frequency of occurrence of the key term and the position of each occurrence of the key term in the knowledge unit.
  - 3. The method of claim 2, wherein the deriving the term vector further includes performing natural language processing on the key terms in the knowledge unit, and filtering the key terms in the knowledge unit based on the natural language processing.
  - 4. The method of claim 1, wherein the extracting the one or more n-grams using the identified key terms includes:
    - identifying one or more terms adjacent to each key term in the knowledge unit;
      
      performing natural language processing on the one or more terms adjacent to each key term and the key terms;
      
      calculating a probability of the one or more terms adjacent to each key term in the knowledge unit as preceding or following the key term based on a function of the natural language processing;
      
      when the probability of the one or more terms being adjacent to the key term is greater than minimum threshold probability, extracting an n-gram comprising the one or more terms and the key term; and
      
      when the probability of the one or more terms being adjacent to the key term is less than the minimum threshold probability, extracting an n-gram comprising only the key term.
  - 5. The method of claim 4, wherein the calculating the probability of the one or more terms adjacent to each key term as preceding or following the key term in the knowledge unit as preceding or following the key term is based on the function of the natural language processing and a frequency of occurrence of the one or more terms adjacent to each key term.
  - 6. The method of claim 1, wherein the scoring each of the extracted n-grams is a function of the frequency of occurrence of the n-gram, a recency of the n-gram, and a commonality of the n-gram across the knowledge corpus of the enterprise.
  - 7. The method of claim 1, wherein the adding the one or more of the extracted n-grams to the index includes:
    - determining a total number of n-grams extracted for the knowledge unit;
      
      determining a top percentage of the n-grams; and
      
      adding the top percentage of the n-grams to the index.
  - 8. The method of claim 1, wherein the adding the one or more of the extracted n-grams to the index includes:
    - setting a minimum threshold score;
      
      determining whether the score for each of the extracted n-grams is above the minimum threshold score; and
      
      when the score for an n-gram is above the minimum threshold score, adding the n-gram to the index.
  - 9. The method of claim 1, wherein the index is a corporate dictionary comprising a set of n-grams that identify each knowledge unit within the knowledge corpus of the enterprise, and the set of n-grams comprises the one or more of the extracted n-grams added to the index.

10. A non-transitory computer-readable storage memory storing a plurality of instructions executable by one or more processors, the plurality of instructions comprising:
- instructions that cause the one or more processors to obtain data files from a knowledge corpus of an enterprise;
  
  instructions that cause the one or more processors to identify key terms within the data files;
  
  instructions that cause the one or more processors to determine, for each identified key term, a frequency of occurrence and location within the data files;
  
  instructions that cause the one or more processors to generate knowledge units from the data files based on the determined frequencies of occurrence and the determined locations of the key terms in the data files;
  
  instructions that cause the one or more processors to select a knowledge unit from the generated knowledge units for extraction of n-grams;
  
  instructions that cause the one or more processors to derive a term vector comprising terms in the knowledge unit based at least on the determined frequencies of occurrence and the determined locations of the key terms in the knowledge unit;
  
  instructions that cause the one or more processors to identify the key terms in the term vector based at least on the frequency of occurrence of each key term in the knowledge unit;
  
  instructions that cause the one or more processors to calculate a probability of one or more terms adjacent to each key term in the knowledge unit as preceding or following the key term based on a function of natural language processing;
  
  instructions that cause the one or more processors to extract an n-gram comprising the one or more terms and the key term when the probability of the one or more terms being adjacent to the key term is greater than a minimum threshold probability;
  
  instructions that cause the one or more processors to extract an n-gram comprising only the key term when the probability of the one or more terms being adjacent to the key term is less than the minimum threshold probability;
  
  instructions that cause the one or more processors to score each of the extracted n-grams as a function of at least a frequency of occurrence of each of the n-grams across the knowledge corpus of the enterprise; and
  
  instructions that cause the one or more processors to add one or more of the extracted n-grams to an index based on the scoring.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
- - 11. The non-transitory computer-readable storage memory of claim 10, wherein the plurality of instructions further comprise:
    - instructions that cause the one or more processors to model the key terms identified in the knowledge unit as the term vector; and
      
      wherein a value for each key term in the term vector is calculated as a function of at least the frequency of occurrence of the term and the position of each occurrence of the key term in the knowledge unit.
  - 12. The non-transitory computer-readable storage memory of claim 11, wherein the plurality of instructions further comprise:
    - instructions that cause the one or more processors to perform natural language processing on the key terms in the knowledge unit; and
      
      instructions that cause the one or more processors to filter the key terms in the knowledge unit based on the natural language processing.
  - 13. The non-transitory computer-readable storage memory of claim 10, wherein the calculating the probability of the one or more terms adjacent to each key term as preceding or following the key term in the knowledge unit as preceding or following the key term is based on the function of the natural language processing and a frequency of occurrence of the one or more terms adjacent to each key term.
  - 14. The non-transitory computer-readable storage memory of claim 10, wherein the scoring each of the extracted n-grams is a function of the frequency of occurrence of the n-gram, a recency of the n-gram, and a commonality of the n-gram across the knowledge corpus of the enterprise.
  - 15. The non-transitory computer-readable storage memory of claim 10, wherein the adding the one or more of the extracted n-grams to the index includes:
    - determining a total number of n-grams extracted for the knowledge unit;
      
      determining a top percentage of the n-grams; and
      
      adding the top percentage of the n-grams to the index.
  - 16. The non-transitory computer-readable storage memory of claim 10, wherein the adding the one or more of the extracted n-grams to the index includes:
    - setting a minimum threshold score;
      
      determining whether the score for each of the extracted n-grams is above the minimum threshold score; and
      
      when the score for an n-gram is above the minimum threshold score, adding the n-gram to the index.
  - 17. The non-transitory computer-readable storage memory of claim 10, wherein the index is a corporate dictionary comprising a set of n-grams that identify each knowledge unit within the knowledge corpus of the enterprise, and the set of n-grams comprises the one or more of the extracted n-grams added to the index.

18. A system comprising:
- one or more processors; and
  
  a memory coupled with and readable by the one or more processors, the memory configured to store a set of instructions which, when executed by the one or more processors, causes the one or more processors to;
  
  obtain data files from a knowledge corpus of an enterprise;
  
  identify key terms within the data files;
  
  determine, for each identified key term, a frequency of occurrence and location within the data files;
  
  generate knowledge units from the data files based on the determined frequencies of occurrence and the determined locations of the key terms in the data files;
  
  select a knowledge unit from the generated knowledge units;
  
  identify one or more knowledge units from the knowledge units that are similar to the selected knowledge unit;
  
  combine the identified one or more knowledge units and the selected knowledge unit into a knowledge pack;
  
  select a knowledge unit or the knowledge pack for extraction of n-grams;
  
  derive a term vector comprising terms in the knowledge unit or the knowledge pack based at least on the determined frequencies of occurrence and the determined locations of the key terms in the knowledge unit or the knowledge pack;
  
  identify the key terms in the term vector based at least on a frequency of occurrence of each key term in the knowledge unit or the knowledge pack;
  
  extract n-grams using the identified key terms in the term vector;
  
  score each of the extracted n-grams as a function of at least a frequency of occurrence of each of the n-grams across the knowledge corpus of the enterprise; and
  
  add one or more of the extracted n-grams to an index based on the scoring.
- View Dependent Claims (19, 20)
- - 19. The system of claim 18, wherein the extracting the one or more n-grams using the identified key terms includes:
    - identifying one or more terms adjacent to each key term in the knowledge unit or the knowledge pack;
      
      performing natural language processing on the one or more terms adjacent to each key term and the key terms;
      
      calculating a probability of the one or more terms adjacent to each key term in the knowledge unit or the knowledge pack as preceding or following the key term based on a function of the natural language processing;
      
      when the probability of the one or more terms being adjacent to the key term is greater than minimum threshold probability, extracting an n-gram comprising the one or more terms and the key term; and
      
      when the probability of the one or more terms being adjacent to the key term is less than the minimum threshold probability, extracting an n-gram comprising only the key term.
  - 20. The system of claim 18, wherein the one or more processors are further caused to provide an interface through which a user can interact with the index, the interface including links to the knowledge unit or the knowledge pack that includes the key terms for the one or more n-grams added to the index.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Social Communications Company
Original Assignee
Prysm, Inc.
Inventors
Mahmud, Gazi, Banda, Seenu, Liang, Deanna
Primary Examiner(s)
Guerra-Erazo, Edgar

Application Number

US14/862,621
Publication Number

US 20160085742A1
Time in Patent Office

839 Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/242 Dictionaries

G06F 40/289 Phrasal analysis, e.g. fini...

Automated collective term and phrase index

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

34 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Automated collective term and phrase index

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

34 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links