Method for automatic deduction of rules for matching content to categories

US 7,047,236 B2
Filed: 12/31/2002
Issued: 05/16/2006
Est. Priority Date: 12/31/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method of classifying document content within a strange taxonomy, the strange taxonomy comprising a plurality of first categories in a computer document storage organizational scheme and a plurality of first documents, each first document tagged with at least one first category according to the strange taxonomy, the method comprising the steps of:

spidering the strange taxonomy to generate at least one pairing of each first document with each first category with which the each first document is tagged, said strange taxonomy having an internal organizational structure that cannot be viewed by a user who is interacting with the strange taxonomy;

creating a rule generation document representing each of the at least one pairings;

parsing a second document according to the rule generation document; and

classifying the parsed second document into a particular first category, said classifying comprising submitting the parsed second document to a classification engine.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Accordingly, the invention is a method for automatic deduction of rules for matching document content to a category within a strange taxonomy, which allows the document to be automatically classified into a proper category for storage in that strange taxonomy. The method includes the steps of spidering the taxonomy to determine its structure and contents, extracting keywords from documents within the strange taxonomy, formulating rules for determining the category from the extracted keywords, and applying the rules to classify a new document whose keywords have been extracted. The taxonomy is strange because the user has no knowledge of its internal structure and needs no such knowledge. The taxonomy may be flat or may be hierarchal, the later having rules formulated at each level for proceeding to the next level. Variations for creating new and refurbishing old document management systems are disclosed.

Citations

20 Claims

1. A method of classifying document content within a strange taxonomy, the strange taxonomy comprising a plurality of first categories in a computer document storage organizational scheme and a plurality of first documents, each first document tagged with at least one first category according to the strange taxonomy, the method comprising the steps of:
- spidering the strange taxonomy to generate at least one pairing of each first document with each first category with which the each first document is tagged, said strange taxonomy having an internal organizational structure that cannot be viewed by a user who is interacting with the strange taxonomy;
  
  creating a rule generation document representing each of the at least one pairings;
  
  parsing a second document according to the rule generation document; and
  
  classifying the parsed second document into a particular first category, said classifying comprising submitting the parsed second document to a classification engine.
- View Dependent Claims (2, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1, wherein the step of spidering the plurality of first documents comprises spidering to retrieve at least one of metadata, a storage location, and a category tag.
  - 3. The method of claim 1, wherein the step of spidering the plurality of first documents tagged with at least one first category according to the strange taxonomy comprises the steps of:
    - spidering the strange taxonomy with a first spider, the first spider adapted to the strange taxonomy being spidered;
      
      creating a third document using the first spider, the third document describing the strange taxonomy, the third document comprising a link to each of the first documents; and
      
      spidering the strange taxonomy with a second spider by spidering the third document created by the first spider, the second spider operable to access each of the first documents through the links in the third document.
  - 4. The method of claim 3, wherein the step of creating the third document comprises creating an XML document.
  - 7. The method of claim 3, further comprising making the third document available for use by document-searching software.
  - 8. The method of claim 1, wherein the step of creating a rule generation document comprises the steps of:
    - receiving a plurality of first-document-category pairings produced by the spidering step;
      
      extracting at least one of a keyword and a pattern of keywords from each of the first documents within the plurality of first documents;
      
      associating each at least one of a keyword and a pattern of keywords in each of the first documents with the at least one first category of the first document from which the at least one of a keyword and a pattern of keywords was extracted; and
      
      generating rules for mapping at least one of a keyword and a pattern of keywords to the first category.
  - 9. The method of claim 8, wherein the step of associating each at least one of a keyword and a pattern of keywords in each of the first documents with the at least one first category of the first document from which the at least one of a keyword and a pattern of keywords was extracted further comprises parsing each first document.
  - 10. The method of claim 8, wherein the step of associating each at least one of a keyword and a pattern of keywords in each of the first documents with the at least one first category of the first document from which the at least one of a keyword and a pattern of keywords was extracted further comprises reading keywords from the metadata of each first document.
  - 11. The method of claim 1, wherein the rule generation document comprises rules for mapping from at least one of a keyword and a pattern of keywords to one or more first categories, the step of parsing a second document according to the rule generation document comprises the steps of:
    - parsing the second document to determine at least one of a keyword and a pattern of keywords;
      
      looking up the at least one of a keyword and a pattern of keywords of the second document in the rule generation document to find at least one of the first categories associated with the at least one of a keyword and a pattern of keywords of the second document;
      
      scoring the found at least one first category according to a predetermined criteria; and
      
      determining from the scoring the at least one first category comprising the classification of the second document.
  - 12. The method of claim 11, wherein the step of scoring according to a predetermined criteria comprises scoring by at least one of:
    - similarity to at least one pattern of keywords associated with a first category;
      
      frequency of keywords in a first category;
      
      commonality of keywords among documents in a first category;
      
      absence of particular keywords among documents in a first category; and
      
      uniqueness of keywords in a first category.
  - 13. The method of claim 12, wherein the step of determining from the scoring at least one first category further comprises the steps of selecting one of:
    - a) the at least one first category having a score comprising an extrema among the alternatives;
      
      b) at least one first category having a score in a predetermined relationship to a predetermined threshold score; and
      
      c) at least one first category having a particular predetermined score.
  - 14. The method of claim 13, wherein the step of selecting further comprises selecting the at least one first category having the first-in-time score meeting the selection criteria.
  - 15. The method of claim 1, wherein the step of classifying the parsed second document into at least one first category comprises at least one of the steps of adding data to the metadata of the second document identifying the at least one first category, tagging the second document according to the taxonomy, and storing the second document in a location associated with the at least one first category.
  - 16. The method of claim 1, wherein the step of classifying the parsed second document into a first category further comprises tagging the parsed second document.

5. A method of classifying document content within a taxonomy, the taxonomy comprising a plurality of first categories in a computer document storage organizational scheme and a plurality of first documents, each first document tagged with at least one first category according to the taxonomy, the method comprising the steps of:
- spidering the taxonomy to generate at least one pairing of each first document with each first category with which the each first document is tagged;
  
  creating a rule generation document representing each of the at least one pairings;
  
  parsing a second document according to the rule generation document; and
  
  classifying the parsed second document into a particular first category, said classifying comprising submitting the parsed second document to a classification engine,wherein the taxonomy comprises a strange taxonomy and wherein the step of spidering the plurality of first documents tagged with at least one first category according to the taxonomy comprises the steps of;
  
  spidering the strange taxonomy with a first spider, the first spider adapted to the strange taxonomy being spidered;
  
  creating a third document using the first spider, the third document describing the strange taxonomy, the third document comprising a link to each of the first documents; and
  
  spidering the strange taxonomy with a second spider by spidering the third document created by the first spider, the second spider operable to access each of the first documents through the links in the third document,wherein the steps of spidering the strange taxonomy with the first spider and creating a third document comprise steps taken after the second document is classified into the taxonomy, the second document thereby becoming a first document within the plurality of first documents.

6. A method of classifying document content within a taxonomy, the taxonomy comprising a plurality of first categories in a computer document storage organizational scheme and a plurality of first documents, each first document tagged with at least one first category according to the taxonomy, the method comprising the steps of:
- spidering the taxonomy to generate at least one pairing of each first document with each first category with which the each first document is tagged;
  
  creating a rule generation document representing each of the at least one pairings;
  
  parsing a second document according to the rule generation document; and
  
  classifying the parsed second document into a particular first category, said classifying comprising submitting the parsed second document to a classification engine,wherein the taxonomy comprises a strange taxonomy and wherein the step of spidering the plurality of first documents tagged with at least one first category according to the taxonomy comprises the steps of;
  
  spidering the strange taxonomy with a first spider, the first spider adapted to the strange taxonomy being spidered;
  
  creating a third document using the first spider, the third document describing the strange taxonomy, the third document comprising a link to each of the first documents; and
  
  spidering the strange taxonomy with a second spider by spidering the third document created by the first spider, the second spider operable to access each of the first documents through the links in the third document,wherein the step of spidering the strange taxonomy with a second spider comprises the step of spidering the strange taxonomy with a second spider after the second document is presented for classification within the taxonomy.

17. A method of classifying document content within a taxonomy, the taxonomy comprising a plurality of first categories in a computer document storage organizational scheme and a plurality of first documents, each first document tagged with at least one first category according to the taxonomy, the method comprising the steps of:
- spidering the taxonomy to generate at least one pairing of each first document with each first category with which the each first document is tagged;
  
  creating a rule generation document representing each of the at least one pairings;
  
  parsing a second document according to the rule generation document; and
  
  classifying the parsed second document into a particular first category, said classifying the parsed second document into the particular first category comprising submitting the parsed second document to a classification engine,wherein the taxonomy comprises a plurality of strange taxonomies, and further wherein;
  
  the step of creating a rule generation document comprises generating a single rule generation document for the plurality of strange taxonomies; and
  
  the step of classifying the parsed second document into at least one first category comprises the steps of;
  
  classifying the parsed second document into one strange taxonomy within the plurality of strange taxonomies; and
  
  classifying the parsed second document into one category within the plurality of categories within the strange taxonomy;
  
  the method operable to select one strange taxonomy among the plurality of strange taxonomies within which to classify the second document.

18. A method of classifying document content within a taxonomy, the taxonomy comprising a plurality of first categories in a computer document storage organizational scheme and a plurality of first documents, each first document tagged with at least one first category according to the taxonomy, the method comprising the steps of:
- spidering the taxonomy to generate at least one pairing of each first document with each first category with which the each first document is tagged;
  
  creating a rule generation document representing each of the at least one pairings;
  
  parsing a second document according to the rule generation document; and
  
  classifying the parsed second document into a particular first category, said classifying the parsed second document into the particular first category comprising submitting the parsed second document to a classification engine,wherein the taxonomy comprises a hierarchy of strange taxonomies, and further wherein;
  
  the step of creating a rule generation document comprises at least one of;
  
  generating at least one rule within the rule generation document for each strange taxonomy within the hierarchy of strange taxonomies; and
  
  creating a rule generation document for each level of the hierarchy of strange taxonomies; and
  
  the step of classifying the parsed second document into at least one first category comprises the steps of;
  
  classifying the parsed second document into at least one strange taxonomy within the hierarchy of strange taxonomies; and
  
  classifying the parsed second document into at least one first category within the at least one strange taxonomy within the hierarchy of strange taxonomies.

19. A method of classifying document content within a taxonomy, the taxonomy comprising a plurality of first categories in a computer document storage organizational scheme and a plurality of first documents, each first document tagged with at least one first category according to the taxonomy, the method comprising the steps of:
- spidering the taxonomy to generate at least one pairing of each first document with each first category with which the each first document is tagged;
  
  creating a rule generation document representing each of the at least one pairings;
  
  parsing a second document according to the rule generation document; and
  
  classifying the parsed second document into a particular first category, said classifying the parsed second document into the particular first category comprising submitting the parsed second document to a classification engine,wherein the rule generation document comprises rules for mapping from at least one of a keyword and a pattern of keywords to one or more first categories, and wherein the step of parsing the second document according to the rule generation document comprises the steps of;
  
  finding no keywords in the parsed second document similar to keywords in the rule generation document;
  
  creating a new category within the taxonomy; and
  
  classifying the second document in the new category.

20. A method for categorizing the content of a new document within a strange taxonomy, the strange taxonomy comprising a plurality of first categories and a plurality of first documents within at least one of the first categories, wherein a root node for the strange taxonomy has been provided, the plurality of first documents being stored on a computer-readable strorage device, the method being implemented through execution of computer readable program code by a processor of a computer system, said computer readable program code being stored on a computer usable medium, the method comprising the steps of:
- automatically spidering the strange taxonomy to identify each first category and each document among the plurality of first documents classified within each respective first category;
  
  automatically forming pairs for each of the first documents, each pair comprising one of the first documents and the category within which the one of the first documents is classified;
  
  automatically extracting at least one of a keyword and a pattern of keywords from each of the first documents in each of the first categories;
  
  automatically associating at least one of a keyword and a pattern of keywords extracted from each of the first documents within each of the first categories with the first category in which the first documents are classified;
  
  automatically generating rules, each rule mapping at least one of a keyword and patterns of keywords to the first category in which the first documents containing the at least one of a keywords and a pattern of keywords are classified;
  
  automatically parsing an unclassified document to determine new keywords therein; and
  
  automatically classifying the unclassified document into at least one of a new category and a first category having documents containing at least one of keywords and patterns of keywords similar to the new keywords.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Gosby, Desiree D. G., Conroy, William F.
Primary Examiner(s)
Metjahic, Safet
Assistant Examiner(s)
DANG, THANH HA T

Application Number

US10/335,351
Publication Number

US 20040139059A1
Time in Patent Office

1,232 Days
Field of Search

707/3, 707/101
US Class Current

1/1
CPC Class Codes

G06F 16/93 Document management systems

Y10S 707/99933 Query processing, i.e. sear...

Method for automatic deduction of rules for matching content to categories

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method for automatic deduction of rules for matching content to categories

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links