Conceptual document analysis and characterization
First Claim
1. A method comprising:
- receiving, by at least one data processor, a plurality of data files from a plurality of data sources that comprise textual content;
categorizing, by the at least one data processor, the plurality of data files into a taxonomy of categories in which each category has associated sample textual content defining a concept for the category and each category associated with a memory-optimized structure that comprises a collection of at least one identification corresponding to at least one of the plurality of data files, the categorizing comprising, for each category;
comparing, by the at least one data processor, for each of the plurality of data files, the textual content of the data file with the sample textual content for the category;
calculating, by the at least one data processor, based on the comparing and for each of the plurality of data files, a file score corresponding to the degree of similarity between the defined concept of the category and a determined concept for the data file; and
generating, by the at least one data processor, the identification stored in the memory-optimized structure that comprises the collection by at least associating, for each of the plurality of data files, the data file with the category if the file score is equal to or greater than a pre-determined minimum score for the category; and
providing, by the at least one data processor, at least a portion of the data file and/or the associated file score.
10 Assignments
0 Petitions
Accused Products
Abstract
Data files are received from data sources that include textual content. The data files are categorized using a taxonomy of categories, where each category has sample textual content that defines a concept for the category. The categorizing includes comparing the textual content of the data file with the sample textual content for the category. A file score is calculated for each data file to compare the degree of similarity between the defined concept of the category and a determined concept for the data file. Each data file is associated with the category if the file score is equal to or greater than a pre-determined minimum score for the category. A portion of the data file and/or file score is be provided.
-
Citations
20 Claims
-
1. A method comprising:
-
receiving, by at least one data processor, a plurality of data files from a plurality of data sources that comprise textual content; categorizing, by the at least one data processor, the plurality of data files into a taxonomy of categories in which each category has associated sample textual content defining a concept for the category and each category associated with a memory-optimized structure that comprises a collection of at least one identification corresponding to at least one of the plurality of data files, the categorizing comprising, for each category; comparing, by the at least one data processor, for each of the plurality of data files, the textual content of the data file with the sample textual content for the category; calculating, by the at least one data processor, based on the comparing and for each of the plurality of data files, a file score corresponding to the degree of similarity between the defined concept of the category and a determined concept for the data file; and generating, by the at least one data processor, the identification stored in the memory-optimized structure that comprises the collection by at least associating, for each of the plurality of data files, the data file with the category if the file score is equal to or greater than a pre-determined minimum score for the category; and providing, by the at least one data processor, at least a portion of the data file and/or the associated file score. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A non-transitory computer program product storing instructions that, when executed by at least one programmable processor, cause the at least one programmable processor to perform operations comprising:
-
receiving a plurality of data files from a plurality of data sources that comprise textual content; categorizing the plurality of data files into a taxonomy of categories in which each category has associated sample textual content defining a concept for the category and each category associated with a memory-optimized structure that comprises a collection of at least one identification corresponding to at least one of the plurality of data files, the categorizing comprising, for each category; comparing, for each of the plurality of data files, the textual content of the data file with the sample textual content for the category; calculating, based on the comparing and for each of the plurality of data files, a file score corresponding to the degree of similarity between the defined concept of the category and a determined concept for the data file; and generating the identification stored in the memory-optimized structure that comprises the collection by at least associating, for each of the plurality of data files, the data file with the category if the file score is equal to or greater than a pre-determined minimum score for the category; and providing at least a portion of the data file and/or the associated file score. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A system comprising:
-
at least one programmable data processor device; and memory storing instructions which, when executed by the at least one programmable data processor, result in operations comprising; receiving a plurality of data files from a plurality of data sources that comprise textual content; categorizing the plurality of data files into a taxonomy of categories in which each category has associated sample textual content defining a concept for the category and each category associated with a memory-optimized structure that comprises a collection of at least one identification corresponding to at least one of the plurality of data files, the categorizing comprising, for each category; comparing, for each of the plurality of data files, the textual content of the data file with the sample textual content for the category; calculating, based on the comparing and for each of the plurality of data files, a file score corresponding to the degree of similarity between the defined concept of the category and a determined concept for the data file; and generating the identification stored in the memory-optimized structure that comprises the collection by at least associating, for each of the plurality of data files, the data file with the category if the file score is equal to or greater than a pre-determined minimum score for the category; and providing at least a portion of the data file and/or the associated file score. - View Dependent Claims (18, 19, 20)
-
Specification