Very-large-scale automatic categorizer for web content

US 6,826,576 B2
Filed: 09/25/2001
Issued: 11/30/2004
Est. Priority Date: 05/07/2001
Status: Expired due to Term

First Claim

Patent Images

1. A method of training a classifier system by utilizing previously classified data objects comprising one or more electronic documents including at least one of a text document, an image file, an audio sequence, a video sequence, and a hybrid document including a combination of text and images, said previously classified data objects being organized into a subject hierarchy of a plurality of nodes, the method comprising:

selecting one node of the plurality of nodes;

aggregating those of the previously classified data objects corresponding to the selected node and any associated sub-nodes of the selected node, to form a content class of data objects, said content class of data objects comprising a content class of the one or more electronic documents;

aggregating those of the previously classified data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes to form an anti-content class of data objects, said anti-content class of data objects comprising an anti-content class of the one or more electronic documents; and

extracting features from at least one of the content class of data objects and the anti-content class of data objects to facilitate characterization of said previously classified data objects.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for efficiently classifying and categorizing data objects such as electronic text, graphics, and audio based documents within very-large-scale hierarchical classification trees is provided. In accordance with one embodiment of the invention, a first node of a plurality of nodes of a subject hierarchy is selected. Previously classified data objects corresponding to a selected first node of a subject hierarchy as well as any associated sub-nodes of the selected node are aggregated to form a content class of data objects. Similarly, data objects corresponding to sibling nodes of the selected node and any associated sub-nodes of the sibling nodes are then aggregated to form an anti-content class of data objects. Features are then extracted from each of the content class of data objects and the anti-content class of data objects to facilitate characterization of said previously classified data objects.

Citations

22 Claims

1. A method of training a classifier system by utilizing previously classified data objects comprising one or more electronic documents including at least one of a text document, an image file, an audio sequence, a video sequence, and a hybrid document including a combination of text and images, said previously classified data objects being organized into a subject hierarchy of a plurality of nodes, the method comprising:
- selecting one node of the plurality of nodes;
  
  aggregating those of the previously classified data objects corresponding to the selected node and any associated sub-nodes of the selected node, to form a content class of data objects, said content class of data objects comprising a content class of the one or more electronic documents;
  
  aggregating those of the previously classified data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes to form an anti-content class of data objects, said anti-content class of data objects comprising an anti-content class of the one or more electronic documents; and
  
  extracting features from at least one of the content class of data objects and the anti-content class of data objects to facilitate characterization of said previously classified data objects.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein said node is a root node.
  - 3. The method of claim 1, wherein said aggregating those of the previously classified data objects corresponding to the selected node and any associated sub-nodes comprises aggregating those of the previously classified data objects corresponding to the selected node and all of said associated sub-nodes.
  - 4. The method of claim 1, wherein said aggregating those of the previously classified data objects corresponding to any associated sibling nodes comprises aggregating those of the previously classified data objects corresponding to all of said associated sibling nodes.
  - 5. The method of claim 1, wherein said aggregating those of the previously classified data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes comprises aggregating those of the previously classified data objects corresponding to all of said associated sibling nodes of the selected node and all of said associated sub-nodes.
  - 6. The method of claim 1, wherein said extracting features further comprises determining which of said extracted features are salient features and creating said content and anti-content class of data objects based upon said salient features.
  - 7. The method of claim 6, wherein said determining which extracted features are salient further comprises:
8. The method of claim 7, wherein the cumulative frequencies of occurrence of said first set of features is approximately 20 percent of the cumulative frequencies of occurrence of said second set of features, and the cumulative frequencies of occurrence of said second set of features is approximately 80 percent of the cumulative frequencies of occurrence of said first set of features.
9. The method of claim 6, wherein said salient features are n-gram salient features.
10. The method of claim 6, wherein said extracted features are determined to be salient based upon mutual information techniques.

11. An apparatus comprising:
- a storage medium having stored therein a plurality of programming instructions designed to implement a plurality of functions of a category name service for providing a category name to a data object, including first one or more functions to select a first node of a hierarchically organized classifier having a plurality of nodes and one or more previously classified data objects associated with each of said plurality of nodes, aggregate those of the previously classified data objects corresponding to the selected node and any associated sub-nodes of the selected node to form a content class of data objects, aggregate those of the previously classified data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes to form an anti-content class of data objects, extract features from at least one of the content class of data objects and the anti-content class of data objects to facilitate characterization of said previously classified data objects; and
  
  a processor coupled to the storage medium to execute the programming instructions.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 12. The apparatus of claim 11, wherein said first node is a root node.
  - 13. The apparatus of claim 11, wherein said plurality of instructions to aggregate those of the previously classified data objects corresponding to the selected node and any associated sub-nodes further comprise instructions to aggregate those of the previously classified data objects corresponding to the selected node and all of said associated sub-nodes.
  - 14. The apparatus of claim 11, wherein said plurality of instructions to aggregate those of the previously classified data objects corresponding to any associated sibling nodes further comprise instructions to aggregate those of the previously classified data objects corresponding to all of said associated sibling nodes.
  - 15. The apparatus of claim 11, wherein said plurality of instructions to aggregate those of the previously classified data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes further comprise instructions to aggregate those of the previously classified data objects corresponding to all of said associated sibling nodes of the selected node and all of said associated sub-nodes.
  - 16. The apparatus of claim 11, wherein said plurality of instructions to extract features further comprise instructions to determine which of said extracted features are salient features and creating said content and anti-content class of data objects based upon said salient features.
  - 17. The apparatus of claim 16, wherein said plurality of instructions to determine which extracted features are salient further comprise instructions to
18. The apparatus of claim 17, wherein the cumulative frequencies of occurrence of said first set of features is approximately 20 percent of the cumulative frequencies of occurrence of said second set of features, and the cumulative frequencies of occurrence of said second set of features is approximately 80 percent of the cumulative frequencies of occurrence of said first set of features.
19. The apparatus of claim 16, wherein said salient features are n-gram salient features.
20. The method of claim 16, wherein said salient features are determined based upon mutual information techniques.
21. The apparatus of claim 11, wherein said data object comprises an electronic document.
22. The apparatus of claim 21, wherein said electronic document comprises at least one of a text document, an image file, an audio sequence, a video sequence, and a hybrid document including a combination of text and images.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Lulich, Daniel P., Guilak, Farzin G.
Primary Examiner(s)
AMSBURY, WAYNE P

Application Number

US09/963,178
Publication Number

US 20020174095A1
Time in Patent Office

1,162 Days
Field of Search

707/102, 707/7, 707/6, 706/14
US Class Current

707/740
CPC Class Codes

G06F 16/3323   using document space presen...

G06F 16/951   Indexing; Web crawling tech...

G06F 16/954   Navigation, e.g. using cate...

Y10S 707/914   Video

Y10S 707/915   Image

Y10S 707/916   Audio

Y10S 707/917   Text

Y10S 707/955   Object-oriented

Y10S 707/956   Hierarchical

Y10S 707/99937   Sorting

Y10S 707/99943   Generating database or data...

Very-large-scale automatic categorizer for web content

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Very-large-scale automatic categorizer for web content

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links