Automated collective term and phrase index
First Claim
Patent Images
1. A method comprising:
- obtaining, by a computer system, data files from a knowledge corpus of an enterprise;
identifying, by the computer system, key terms within the data files;
determining, by the computer system, for each identified key term, a frequency of occurrence and location within the data files;
generating, by the computer system, knowledge units from the data files based on the determined frequencies of occurrence and the determined locations of the key terms in the data files;
selecting, by computing system, a knowledge unit from the generated knowledge units for extraction of n-grams;
deriving, by the computing system, a term vector for the knowledge unit based at least on the determined frequencies of occurrence and the determined locations of the key terms in the knowledge unit;
identifying, by the computing system, the key terms in the term vector based at least on the frequency of occurrence of each key term in the knowledge unit;
extracting, by the computing system, n-grams using the key terms in the term vector;
scoring, by the computing system, each of the extracted n-grams as a function of at least a frequency of occurrence of each of the n-grams across the knowledge corpus of the enterprise; and
adding, by the computing system, one or more of the extracted n-grams to an index based on the scoring.
4 Assignments
0 Petitions
Accused Products
Abstract
Knowledge automation techniques may include selecting a knowledge element from a knowledge corpus of an enterprise for extraction of n-grams, and deriving a term vector comprising terms in the knowledge element. Based at least on a frequency of occurrence of each term in the knowledge element, key terms are identified in the term vector. Thereafter, the identified key terms are used to extract one or more n-grams from the knowledge element. Each of the extracted n-grams is scored as a function of at least a frequency of occurrence of each of the n-grams across the knowledge corpus of the enterprise, and based on the scoring, one or more of the n-grams is added to a collective term and phrase index.
34 Citations
20 Claims
-
1. A method comprising:
-
obtaining, by a computer system, data files from a knowledge corpus of an enterprise; identifying, by the computer system, key terms within the data files; determining, by the computer system, for each identified key term, a frequency of occurrence and location within the data files; generating, by the computer system, knowledge units from the data files based on the determined frequencies of occurrence and the determined locations of the key terms in the data files; selecting, by computing system, a knowledge unit from the generated knowledge units for extraction of n-grams; deriving, by the computing system, a term vector for the knowledge unit based at least on the determined frequencies of occurrence and the determined locations of the key terms in the knowledge unit; identifying, by the computing system, the key terms in the term vector based at least on the frequency of occurrence of each key term in the knowledge unit; extracting, by the computing system, n-grams using the key terms in the term vector; scoring, by the computing system, each of the extracted n-grams as a function of at least a frequency of occurrence of each of the n-grams across the knowledge corpus of the enterprise; and adding, by the computing system, one or more of the extracted n-grams to an index based on the scoring. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A non-transitory computer-readable storage memory storing a plurality of instructions executable by one or more processors, the plurality of instructions comprising:
-
instructions that cause the one or more processors to obtain data files from a knowledge corpus of an enterprise; instructions that cause the one or more processors to identify key terms within the data files; instructions that cause the one or more processors to determine, for each identified key term, a frequency of occurrence and location within the data files; instructions that cause the one or more processors to generate knowledge units from the data files based on the determined frequencies of occurrence and the determined locations of the key terms in the data files; instructions that cause the one or more processors to select a knowledge unit from the generated knowledge units for extraction of n-grams; instructions that cause the one or more processors to derive a term vector comprising terms in the knowledge unit based at least on the determined frequencies of occurrence and the determined locations of the key terms in the knowledge unit; instructions that cause the one or more processors to identify the key terms in the term vector based at least on the frequency of occurrence of each key term in the knowledge unit; instructions that cause the one or more processors to calculate a probability of one or more terms adjacent to each key term in the knowledge unit as preceding or following the key term based on a function of natural language processing; instructions that cause the one or more processors to extract an n-gram comprising the one or more terms and the key term when the probability of the one or more terms being adjacent to the key term is greater than a minimum threshold probability; instructions that cause the one or more processors to extract an n-gram comprising only the key term when the probability of the one or more terms being adjacent to the key term is less than the minimum threshold probability; instructions that cause the one or more processors to score each of the extracted n-grams as a function of at least a frequency of occurrence of each of the n-grams across the knowledge corpus of the enterprise; and instructions that cause the one or more processors to add one or more of the extracted n-grams to an index based on the scoring. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
-
-
18. A system comprising:
-
one or more processors; and a memory coupled with and readable by the one or more processors, the memory configured to store a set of instructions which, when executed by the one or more processors, causes the one or more processors to; obtain data files from a knowledge corpus of an enterprise; identify key terms within the data files; determine, for each identified key term, a frequency of occurrence and location within the data files; generate knowledge units from the data files based on the determined frequencies of occurrence and the determined locations of the key terms in the data files; select a knowledge unit from the generated knowledge units; identify one or more knowledge units from the knowledge units that are similar to the selected knowledge unit; combine the identified one or more knowledge units and the selected knowledge unit into a knowledge pack; select a knowledge unit or the knowledge pack for extraction of n-grams; derive a term vector comprising terms in the knowledge unit or the knowledge pack based at least on the determined frequencies of occurrence and the determined locations of the key terms in the knowledge unit or the knowledge pack; identify the key terms in the term vector based at least on a frequency of occurrence of each key term in the knowledge unit or the knowledge pack; extract n-grams using the identified key terms in the term vector; score each of the extracted n-grams as a function of at least a frequency of occurrence of each of the n-grams across the knowledge corpus of the enterprise; and add one or more of the extracted n-grams to an index based on the scoring. - View Dependent Claims (19, 20)
-
Specification