Identifying categories within textual data

US 10,157,178 B2
Filed: 02/05/2016
Issued: 12/18/2018
Est. Priority Date: 02/06/2015
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

identifying a plurality of documents associated with a predetermined subject, where;

each of the plurality of documents contains textual data, andthe predetermined subject includes one or more terms identifying common subject matter shared by each of the plurality of documents;

analyzing the textual data of each of the plurality of documents to identify one or more categories within the plurality of the documents, the analyzing including;

refining the textual data by removing one or more words from the textual data that have a predetermined frequency and a predetermined significance, to create refined textual data,transforming the refined textual data into an array, anddetermining the one or more categories from the array, where each of the one or more categories includes a plurality of topic vectors that each include one or more identified keywords and a frequency of the one or more keywords within the refined textual data;

linking each of the one or more categories to the predetermined subject;

returning the one or more categories identified within the plurality of the documents as categories indicative of the predetermined subject; and

classifying additional textual data, utilizing the one or more categories, including comparing the additional textual data to the one or more categories to determine a probability that the additional textual data is associated with the predetermined subject linked to the one or more categories.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method according to one embodiment includes identifying a plurality of documents associated with a predetermined subject, where each of the plurality of documents contains textual data, analyzing the textual data of each of the plurality of documents to identify one or more categories within the plurality of the documents, and returning the one or more categories identified within the plurality of the documents.

Citations

20 Claims

1. A computer-implemented method, comprising:
- identifying a plurality of documents associated with a predetermined subject, where;
  
  each of the plurality of documents contains textual data, andthe predetermined subject includes one or more terms identifying common subject matter shared by each of the plurality of documents;
  
  analyzing the textual data of each of the plurality of documents to identify one or more categories within the plurality of the documents, the analyzing including;
  
  refining the textual data by removing one or more words from the textual data that have a predetermined frequency and a predetermined significance, to create refined textual data,transforming the refined textual data into an array, anddetermining the one or more categories from the array, where each of the one or more categories includes a plurality of topic vectors that each include one or more identified keywords and a frequency of the one or more keywords within the refined textual data;
  
  linking each of the one or more categories to the predetermined subject;
  
  returning the one or more categories identified within the plurality of the documents as categories indicative of the predetermined subject; and
  
  classifying additional textual data, utilizing the one or more categories, including comparing the additional textual data to the one or more categories to determine a probability that the additional textual data is associated with the predetermined subject linked to the one or more categories.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The computer-implemented method of claim 1, wherein the plurality of documents include one or more of web page content and scanned document content.
  - 3. The computer-implemented method of claim 1, wherein the plurality of documents is included within a grouping that includes a database that groups identifiers of a location of each of the plurality of documents within a centralized location.
  - 4. The computer-implemented method of claim 1, wherein analyzing the textual data for each of the plurality of documents includes performing automatic language detection on the textual data to determine a language in which the textual data is written, where only textual data written in a predetermined language is included within the refined textual data.
  - 5. The computer-implemented method of claim 1, further comprising identifying and removing from one or more of the plurality of topic vectors textual data that is included in a number of topic vectors below a threshold number.
  - 6. The computer-implemented method of claim 1, wherein analyzing the textual data for each of the plurality of documents includes removing one or more duplicate documents within the plurality of documents.
  - 7. The computer-implemented method of claim 1, further comprising stemming one or more words within the textual data by removing one or more plural or verb conjugation endings.
  - 8. The computer-implemented method of claim 1, further comprising:
    - randomizing an order of the plurality of documents;
      
      dividing the plurality of documents into training documents and test documents;
      
      analyzing the textual data of the training documents, including determining the plurality of topic vectors for the training documents;
      
      identifying and removing from one or more of the plurality of topic vectors textual data that is included in a number of topic vectors below a threshold number; and
      
      empirically verifying the threshold number against the test documents.
  - 9. The computer-implemented method of claim 1, wherein analyzing the textual data for each of the plurality of documents includes performing a latent dirichlet allocation (LDA) analysis on the refined textual data to identify the one or more categories.
  - 10. The computer-implemented method of claim 9, wherein performing the LDA analysis on the textual data includes transforming the textual data into a bag-of-words array and determining the one or more categories from the bag-of-words array.

11. A computer program product for identifying one or more categories within textual data of each of a plurality of documents, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, the program instructions executable by a processor to cause the processor to perform a method comprising:
- identifying, by the processor, a plurality of documents associated with a predetermined subject, where;
  
  each of the plurality of documents contains the textual data, andthe predetermined subject includes one or more terms identifying common subject matter shared by each of the plurality of documents;
  
  analyzing, by the processor, the textual data of each of the plurality of documents to identify the one or more categories within the plurality of the documents, the analyzing including;
  
  refining the textual data by removing one or more words from the textual data that have a predetermined frequency and a predetermined significance, to create refined textual data,transforming the refined textual data into an array, anddetermining the one or more categories from the array, where each of the one or more categories includes a plurality of topic vectors that each include one or more identified keywords and a frequency of the one or more keywords within the refined textual data;
  
  linking each of the one or more categories to the predetermined subject;
  
  returning, by the processor, the one or more categories identified within the plurality of the documents as categories indicative of the predetermined subject; and
  
  classifying, by the processor, additional textual data, utilizing the one or more categories, including comparing the additional textual data to the one or more categories to determine a probability that the additional textual data is associated with the predetermined subject linked to the one or more categories.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The computer program product of claim 11, wherein the plurality of documents include one or more of web page content and scanned document content.
  - 13. The computer program product of claim 11, wherein the plurality of documents is included within a grouping that includes a database that groups identifiers of a location of each of the plurality of documents within a centralized location.
  - 14. The computer program product of claim 11, wherein analyzing, by the processor, the textual data for each of the plurality of documents includes performing, by the processor, automatic language detection on the textual data to determine a language in which the textual data is written, where only textual data written in a predetermined language is included within the refined textual data.
  - 15. The computer program product of claim 11, further comprising identifying and removing from one or more of the plurality of topic vectors textual data that is included in a number of topic vectors below a threshold number, by the processor.
  - 16. The computer program product of claim 11, wherein analyzing, by the processor, the textual data for each of the plurality of documents includes removing, by the processor, one or more duplicate documents within the plurality of documents.
  - 17. The computer program product of claim 11, further comprising stemming one or more words within the textual data by removing one or more plural or verb conjugation endings, by the processor.
  - 18. The computer program product of claim 11, further comprising:
    - randomizing, by the processor, an order of the plurality of documents;
      
      dividing, by the processor, the plurality of documents into training documents and test documents;
      
      analyzing, by the processor, the textual data of the training documents, including determining the plurality of topic vectors for the training documents;
      
      identifying and removing from one or more of the plurality of topic vectors textual data that is included in a number of topic vectors below a threshold number, by the processor; and
      
      empirically verifying the threshold number against the test documents, by the processor.
  - 19. The computer program product of claim 11, wherein analyzing, by the processor, the textual data for each of the plurality of documents includes performing a latent dirichlet allocation (LDA) analysis on the refined textual data to identify the one or more categories.

20. A system, comprising:
- a processor; and
  
  logic integrated with the processor, executable by the processor, or integrated with and executable by the processor, the logic being configured to;
  
  identify a plurality of documents associated with a predetermined subject, where;
  
  each of the plurality of documents contains textual data, andthe predetermined subject includes one or more terms identifying common subject matter shared by each of the plurality of documents;
  
  analyze the textual data of each of the plurality of documents to identify one or more categories within the plurality of the documents, the analyzing including;
  
  refining the textual data by removing one or more words from the textual data that have a predetermined frequency and a predetermined significance, to create refined textual data,transforming the refined textual data into an array, anddetermining the one or more categories from the array, where each of the one or more categories includes a plurality of topic vectors that each include one or more identified keywords and a frequency of the one or more keywords within the refined textual data;
  
  link each of the one or more categories to the predetermined subject;
  
  return the one or more categories identified within the plurality of the documents as categories indicative of the predetermined subject; and
  
  classify additional textual data, utilizing the one or more categories, including comparing the additional textual data to the one or more categories to determine a probability that the additional textual data is associated with the predetermined subject linked to the one or more categories.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
McManis, Jr., Charles E., Smith, Douglas A.
Primary Examiner(s)
DANG, THANH HA T

Application Number

US15/017,403
Publication Number

US 20160232226A1
Time in Patent Office

1,047 Days
Field of Search

707740
US Class Current
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/35   Clustering; Classification

G06F 16/353   into predefined classes

G06F 16/381   using identifiers, e.g. bar...

G06F 40/216   using statistical methods

G06F 40/263   Language identification

G06F 40/30   Semantic analysis

Identifying categories within textual data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Identifying categories within textual data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links