Identifying categories within textual data
First Claim
Patent Images
1. A computer-implemented method, comprising:
- identifying a plurality of documents associated with a predetermined subject, where;
each of the plurality of documents contains textual data, andthe predetermined subject includes one or more terms identifying common subject matter shared by each of the plurality of documents;
analyzing the textual data of each of the plurality of documents to identify one or more categories within the plurality of the documents, the analyzing including;
refining the textual data by removing one or more words from the textual data that have a predetermined frequency and a predetermined significance, to create refined textual data,transforming the refined textual data into an array, anddetermining the one or more categories from the array, where each of the one or more categories includes a plurality of topic vectors that each include one or more identified keywords and a frequency of the one or more keywords within the refined textual data;
linking each of the one or more categories to the predetermined subject;
returning the one or more categories identified within the plurality of the documents as categories indicative of the predetermined subject; and
classifying additional textual data, utilizing the one or more categories, including comparing the additional textual data to the one or more categories to determine a probability that the additional textual data is associated with the predetermined subject linked to the one or more categories.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-implemented method according to one embodiment includes identifying a plurality of documents associated with a predetermined subject, where each of the plurality of documents contains textual data, analyzing the textual data of each of the plurality of documents to identify one or more categories within the plurality of the documents, and returning the one or more categories identified within the plurality of the documents.
-
Citations
20 Claims
-
1. A computer-implemented method, comprising:
-
identifying a plurality of documents associated with a predetermined subject, where; each of the plurality of documents contains textual data, and the predetermined subject includes one or more terms identifying common subject matter shared by each of the plurality of documents; analyzing the textual data of each of the plurality of documents to identify one or more categories within the plurality of the documents, the analyzing including; refining the textual data by removing one or more words from the textual data that have a predetermined frequency and a predetermined significance, to create refined textual data, transforming the refined textual data into an array, and determining the one or more categories from the array, where each of the one or more categories includes a plurality of topic vectors that each include one or more identified keywords and a frequency of the one or more keywords within the refined textual data; linking each of the one or more categories to the predetermined subject; returning the one or more categories identified within the plurality of the documents as categories indicative of the predetermined subject; and classifying additional textual data, utilizing the one or more categories, including comparing the additional textual data to the one or more categories to determine a probability that the additional textual data is associated with the predetermined subject linked to the one or more categories. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer program product for identifying one or more categories within textual data of each of a plurality of documents, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, the program instructions executable by a processor to cause the processor to perform a method comprising:
-
identifying, by the processor, a plurality of documents associated with a predetermined subject, where; each of the plurality of documents contains the textual data, and the predetermined subject includes one or more terms identifying common subject matter shared by each of the plurality of documents; analyzing, by the processor, the textual data of each of the plurality of documents to identify the one or more categories within the plurality of the documents, the analyzing including; refining the textual data by removing one or more words from the textual data that have a predetermined frequency and a predetermined significance, to create refined textual data, transforming the refined textual data into an array, and determining the one or more categories from the array, where each of the one or more categories includes a plurality of topic vectors that each include one or more identified keywords and a frequency of the one or more keywords within the refined textual data; linking each of the one or more categories to the predetermined subject; returning, by the processor, the one or more categories identified within the plurality of the documents as categories indicative of the predetermined subject; and classifying, by the processor, additional textual data, utilizing the one or more categories, including comparing the additional textual data to the one or more categories to determine a probability that the additional textual data is associated with the predetermined subject linked to the one or more categories. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A system, comprising:
-
a processor; and logic integrated with the processor, executable by the processor, or integrated with and executable by the processor, the logic being configured to; identify a plurality of documents associated with a predetermined subject, where; each of the plurality of documents contains textual data, and the predetermined subject includes one or more terms identifying common subject matter shared by each of the plurality of documents; analyze the textual data of each of the plurality of documents to identify one or more categories within the plurality of the documents, the analyzing including; refining the textual data by removing one or more words from the textual data that have a predetermined frequency and a predetermined significance, to create refined textual data, transforming the refined textual data into an array, and determining the one or more categories from the array, where each of the one or more categories includes a plurality of topic vectors that each include one or more identified keywords and a frequency of the one or more keywords within the refined textual data; link each of the one or more categories to the predetermined subject; return the one or more categories identified within the plurality of the documents as categories indicative of the predetermined subject; and classify additional textual data, utilizing the one or more categories, including comparing the additional textual data to the one or more categories to determine a probability that the additional textual data is associated with the predetermined subject linked to the one or more categories.
-
Specification