Segmenting text for searching
First Claim
Patent Images
1. A computer-implemented method comprising:
- receiving, at a computer system comprising a processor, text;
segmenting, at the computer system, the text into one or more unigrams;
filtering, at the computer system, the one or more unigrams to identify one or more core unigrams; and
generating, at the computer system, a searchable data structure, wherein the generating includes, for each specific unigram of the one or more core unigrams;
identifying a stem of the specific unigram,indexing the stem,obtaining (i) grammar information for the specific unigram, (ii) language information for the specific unigram, and (iii) description information for the specific unigram, andassociating one or more n-grams with the indexed stem,wherein each of the one or more n-grams is derived from the text, the grammar information for the specific unigram, the language information for the specific unigram, and the description information for the specific unigram, andwherein each of the one or more n-grams includes a core unigram that is related to the indexed stem.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer program products, for segmenting text for searching are disclosed. In one implementation, a method is provided. The method includes receiving text; segmenting the text into one or more unigrams; filtering the one or more unigrams to identify one or more core unigrams; and generating a searchable resource, including: for each of the one or more core unigrams: identifying a stem, indexing the stem, and associating one or more second n-grams with the indexed stem. Each of the one or more second n-grams is derived from the text and includes a core unigram that is related to the indexed stem.
-
Citations
20 Claims
-
1. A computer-implemented method comprising:
-
receiving, at a computer system comprising a processor, text; segmenting, at the computer system, the text into one or more unigrams; filtering, at the computer system, the one or more unigrams to identify one or more core unigrams; and generating, at the computer system, a searchable data structure, wherein the generating includes, for each specific unigram of the one or more core unigrams; identifying a stem of the specific unigram, indexing the stem, obtaining (i) grammar information for the specific unigram, (ii) language information for the specific unigram, and (iii) description information for the specific unigram, and associating one or more n-grams with the indexed stem, wherein each of the one or more n-grams is derived from the text, the grammar information for the specific unigram, the language information for the specific unigram, and the description information for the specific unigram, and wherein each of the one or more n-grams includes a core unigram that is related to the indexed stem. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer-implemented method comprising:
-
receiving, at a computer system comprising a processor, text; segmenting, at the computer system, the text into one or more unigrams; filtering, at the computer system, the one or more unigrams to identify one or more core unigrams, wherein the filtering includes, for each specific unigram of the one or more core unigrams; identifying a stem of the specific unigram, searching an index to identify one or more n-grams that are related to the stem, wherein the index includes (i) grammar information for the specific unigram, (ii) language information for the specific unigram, and (iii) description information for the specific unigram, and comparing the one or more n-grams to the text to identify a group of the one or more n-grams that are included in the text; searching, at the computer system, a searchable data structure for each n-gram in the group; and providing, at the computer system, a content associated with each of the n-grams found in the searchable data structure, wherein the content includes an image, audio, video, text, or a document. - View Dependent Claims (7, 8)
-
-
9. A computer program product, encoded on a tangible program carrier, operable to cause a data processing apparatus to perform operations comprising:
-
receiving text; segmenting the text into one or more unigrams; filtering the one or more unigrams to identify one or more core unigrams; and generating a searchable data structure, wherein the generating includes, for each specific unigram of the one or more core unigrams; identifying a stem of the specific unigram, indexing the stem, obtaining (i) grammar information for the specific unigram, (ii) language information for the specific unigram, and (iii) description information for the specific unigram, and associating one or more n-grams with the indexed stem, wherein each of the one or more n-grams is derived from the text, the grammar information for the specific unigram, the language information for the specific unigram, and the description information for the specific unigram, and wherein each of the one or more n-grams includes a core unigram that is related to the indexed stem. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A computer program product, encoded on a tangible program carrier, operable to cause a data processing apparatus to perform operations comprising:
-
receiving text; segmenting the text into one or more unigrams; filtering the one or more unigrams to identify one or more core unigrams, wherein the filtering includes, for each specific unigram of the one or more core unigrams; identifying a stem of the specific unigram, searching an index to identify one or more n-grams that are related to the stem, wherein the index includes (i) grammar information for the specific unigram, (ii) language information for the specific unigram, and (iii) description information for the specific unigram, and comparing the one or more n-grams to the text to identify a group of the one or more n-grams that are included in the text; searching a searchable data structure for each n-gram in the group; and providing a content associated with each of the n-grams found in the searchable data structure, wherein the content includes an image, audio, video, text, or a document. - View Dependent Claims (15, 16)
-
-
17. A computer-implemented method, comprising:
-
receiving, at a computer system including a processor, a training data including a plurality of n-grams, wherein each of the plurality of n-grams includes one or more words; filtering, at the computer system, the training data by removing stop-words to obtain a modified training data; generating, at the computer system, a first data structure using the modified training data, the data structure including one or more sets of n-grams, each having one or more subsets of n-grams, wherein the data structure further includes, for each specific n-gram; (i) a stem of the n-gram, (ii) grammar information for the specific n-gram, (iii) language information for the specific n-gram, and (iv) description information for the specific n-gram; indexing, at the computer system, the first data structure according to the stems of the n-grams to obtain a searchable data structure; receiving, at the computer system, a first n-gram and a request to search the searchable data structure using the first n-gram; searching, at the computer system, the searchable data structure to obtain one or more second n-grams related to the first n-gram; and outputting, at the computer system, the one or more second n-grams. - View Dependent Claims (18, 19, 20)
-
Specification