Method and apparatus for automatically determining salient features for object classification

US 6,938,025 B1
Filed: 09/25/2001
Issued: 08/30/2005
Est. Priority Date: 05/07/2001
Status: Expired due to Fees

First Claim

Patent Images

1. A method for classifying one or more electronic documents, said method comprising:

extracting one or more unique features from a first content group of data objects representing a first group of electronic documents to form a first feature list;

extracting one or more unique features from a second anti-content group of data objects representing a second group of electronic documents to form a second feature list;

identifying those unique features of said first feature list that are not present in said second feature list;

identifying those unique features of said first feature list that are also present in said second feature list;

creating a ranked list of features by applying statistical differentiation between unique features of said first feature list and unique features of said second feature list, wherein those unique features of said first feature list that are not present in said second feature list are ranked higher within said ranked list as compared to those unique features of said first feature list that are also present in said second feature list;

identifying a set of salient features from said ranked list of features, wherein the set of salient features distinguishes the first group of electronic documents from the second group of electronic documents; and

classifying the first group of electronic documents and the second group of electronic documents based on the set of salient features.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for automatically determining salient features for object classification is provided. In accordance with one embodiment, one or more unique features are extracted from a first content group of objects to form a first feature list, and one or more unique features are extracted from a second anti-content group of objects to form a second feature list. A ranked list of features is then created by applying statistical differentiation between unique features of the first feature list and unique features of the second feature list. A set of salient features is then identified from the resulting ranked list of features.

130 Citations

31 Claims

1. A method for classifying one or more electronic documents, said method comprising:
- extracting one or more unique features from a first content group of data objects representing a first group of electronic documents to form a first feature list;
  
  extracting one or more unique features from a second anti-content group of data objects representing a second group of electronic documents to form a second feature list;
  
  identifying those unique features of said first feature list that are not present in said second feature list;
  
  identifying those unique features of said first feature list that are also present in said second feature list;
  
  creating a ranked list of features by applying statistical differentiation between unique features of said first feature list and unique features of said second feature list, wherein those unique features of said first feature list that are not present in said second feature list are ranked higher within said ranked list as compared to those unique features of said first feature list that are also present in said second feature list;
  
  identifying a set of salient features from said ranked list of features, wherein the set of salient features distinguishes the first group of electronic documents from the second group of electronic documents; and
  
  classifying the first group of electronic documents and the second group of electronic documents based on the set of salient features.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 15)
- - 2. The method of claim 1, further comprising:
    - determining a first total number of data objects comprising said first content group of data objects; and
      
      determining a second total number of data objects comprising said second anti-content group of data objects.
  - 3. The method of claim 2, further comprising:
    - determining, for each of said one or more unique features forming said first feature list, a first number of data objects of said first content group of data objects that contain at least one instance of each respective said one or more unique features of said first feature list; and
      
      determining, for each of said one or more unique features forming said second feature list, a second number of data objects of said second anti-content group of data objects that contain at least one instance of each respective said one or more unique features of said second feature list.
  - 4. The method of claim 3, further comprising:
    - applying a probabilistic function to each of those unique features of said first feature list that are also present in said second feature list to obtain a result vector, wherein said probabilistic function comprises a ratio of the first number of data objects divided by said first total number of data objects, to said second number of data objects divided by said second total number of data objects; and
      
      ordering those unique features of said first feature list that are also present in said second feature list within said ranked list based at least in part upon the result vector of said probabilistic function.
  - 5. The method of claim 3, wherein those unique features of said first feature list that are not present in said second feature list are further ranked based upon the first number of data objects.
  - 6. The method of claim 1, wherein identifying said set of salient features from said ranked list of features comprises selecting a first N contiguous features of said ranked list of features.
  - 7. The method of claim 1, wherein identifying said set of salient features from said ranked list of features comprises selecting a last M contiguous features of said ranked list of features.
  - 8. The method of claim 1, wherein each of said unique features comprises a grouping of one or more alphanumeric characters.
  - 9. The method of claim 1, further comprising:
    - classifying a new data object as being most related to one of said first content group of data objects and said second anti-content group of data objects based at least in part upon said set of salient features.
  - 10. The method of claim 1, wherein said first content group of data objects comprises those data objects corresponding to a selected node of a subject hierarchy having a plurality of nodes and any associated sub-nodes of the selected node;
    - andwherein said second anti-content group of data objects comprises those data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes.
  - 15. The method of claim 1, wherein identifying as salient comprises selecting a last M consecutive unique features from said ranked list of unique features.

11. A method for classifying one or more electronic documents, the method comprising:
- identifying one or more unique features that are members of a first data class, said first data class comprising a first group of electronic documents;
  
  examining a second data class to identify those of said one or more unique features that are also members of said second data class, and those of said one or more unique features that are not members of said second data class, said second data class comprising a second group of electronic documents;
  
  generating a ranked list of unique features having an order based upon membership of each of said one or more unique features within said second data class, wherein those of said unique features that are not members of said second data class are ranked higher in said ranked list than those of said unique features that are also members of said second data class;
  
  identifying as salient one or more of said ranked list of unique features, wherein said one or more of said ranked list of unique features identified as salient distinguish the first group of electronic documents from the second group of electronic documents; and
  
  classifying the first group of electronic documents from the second group of electronic documents based on said one or more of said ranked list of unique features identified as salient.
- View Dependent Claims (12, 13, 14)
- - 12. The method of claim 11, further comprising:
    - determining, for each of said ranked list of unique features, a number of objects within said first data class that contain each respective unique feature.
  - 13. The method of claim 12, wherein generating a ranked list further comprises ranking those of said unique features that belong to a greater number of objects of said first data class higher in said ranked list than those of said unique features that belong to a lesser number of objects within said first data class.
  - 14. The method of claim 11, wherein identifying as salient comprises selecting a first set of N consecutive unique features from said ranked list of unique features.

16. An apparatus for classifying one or more electronic documents, said apparatus comprising:
- a storage medium having stored therein a plurality of programming instructions designed to implement a plurality of functions of a category name service for providing a category name to a data object, including first one or more functions to extract one or more unique features from a first content group of data objects representing a first group of electronic documents to form a first feature list, extract one or more unique features from a second anti-content group of data objects representing a second group of electronic documents to form a second feature list, identify those unique features of said first feature list that are not present in said second feature list, identify those unique features of said first feature list that are also present in said second feature list, create a ranked list of features by applying statistical differentiation between unique features of said first feature list and unique features of said second feature list, wherein those unique features of said first feature list that are not present in said second feature list are ranked higher within said ranked list as compared to those unique features of said first feature list that are also present in said second feature list, identify a set of salient features from said ranked list of features, wherein said set of salient features distinguishes the first group of electronic documents from the second group of electronic documents, and classify the first group of electronic documents and the second group of electronic documents based on the set of salient features.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 17. The apparatus of claim 16, wherein each of said first content group of data objects and said second anti-content group of data objects comprises one or more data objects.
  - 18. The apparatus of claim 16, wherein said plurality of instructions further comprises instructions todetermine a first total number of data objects comprising said first content group of data objects, and determine a second total number of data objects comprising said second anti-content group of data objects.
  - 19. The apparatus of claim 16, wherein said plurality of instructions further comprises instructions todetermine, for each of said one or more unique features forming said first feature list, a first number of data objects of said first content group of data objects that contain at least one instance of each respective said one or more unique features of said first feature list, and determine, for each of said one or more unique features forming said second feature list, a second number of data objects of said second anti-content group of data objects that contain at least one instance of each respective said one or more unique features of said second feature list.
  - 20. The apparatus of claim 17, wherein said plurality of instructions further comprises instructions toapply a probabilistic function to each of those unique features of said first feature list that are also present in said second feature list to obtain a result vector, wherein said probabilistic function comprises a ratio of the first number of data objects divided by said first total number of data objects, to said second number of documents divided by said second total number of data objects, and order those unique features of said first feature list that are also present in said second feature list within said ranked list based at least in part upon the result vector of said probabilistic function.
  - 21. The apparatus of claim 17, wherein those unique features of said first feature list that are not present in said second feature list are further ranked based upon the first number of data objects.
  - 22. The apparatus of claim 16, wherein said plurality of instructions to identify said set of salient features from said ranked list of features further comprises instructions to select a first N contiguous features of said ranked list of features.
  - 23. The apparatus of claim 16, wherein said plurality of instructions to identify said set of salient features from said ranked list of features further comprises instructions to select a last M contiguous features of said ranked list of features.
  - 24. The apparatus of claim 16, wherein each of said unique features comprises a grouping of one or more alphanumeric characters.
  - 25. The apparatus of claim 16, wherein said plurality of instructions further comprises instructions toclassify a new data object as being most related to one of said first content group of data objects and said second anti-content group of data objects based at least in part upon said set of salient features.
  - 26. The apparatus of claim 16, wherein said first content group of data objects comprises those data objects corresponding to a selected node of a subject hierarchy having a plurality of nodes and any associated sub-nodes of the selected node;
    - andwherein said second anti-content group of data objects comprises those data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes.

27. An apparatus comprising:
- a storage medium having stored therein a plurality of programming instructions designed to implement a plurality of functions including first one or more functions to identify one or more unique features that are members of a first data class, said first data class comprising a first group of electronic documents, examine a second data class to identify those of said one or more unique features that are also members of said second data class, and those of said one or more unique features that are not members of said second data class, said second data class comprising a second group of electronic documents, generate a ranked list of unique features having an order based upon membership of each of said one or more unique features within said second data class, wherein those of said unique features that are not members of said second data class are ranked higher in said ranked list than those of said unique features that are also members of said second data class, and identify as salient one or more of said ranked list of unique features, wherein said salient distinguishes the first group of electronic documents from the second group of electronic documents, and classify the first group of electronic documents and the second group of electronic documents based on the set of salient features.
- View Dependent Claims (28, 29, 30, 31)
- - 28. The apparatus of claim 27, wherein said plurality of instructions further comprises instructions todetermine, for each of said ranked list of unique features, a number of objects within said first data class that contain each respective unique feature.
  - 29. The apparatus of claim 28, wherein said plurality of instructions to generate a ranked list further comprises instructions to rank those of said unique features that belong to a greater number of objects of said first data class higher in said ranked list than those of said unique features that belong to a lesser number of objects within said first data class.
  - 30. The apparatus of claim 27, wherein said plurality of instructions to identify as salient further comprises instructions to select a first set of N consecutive unique features from said ranked list of unique features.
  - 31. The apparatus of claim 27, wherein said plurality of instructions to identify as salient further comprises instructions to select a last set of M consecutive unique features from said ranked list of unique features.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Lulich, Daniel P., Guilak, Farzin G.
Primary Examiner(s)
STARKS, WILBERT L

Application Number

US09/963,170
Time in Patent Office

1,435 Days
Field of Search

706/45, 382/190, 707/1, 707/6
US Class Current

706/45
CPC Class Codes

G06F 16/285 Clustering or classification

Y10S 707/99931 Database or file accessing

Method and apparatus for automatically determining salient features for object classification

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

130 Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for automatically determining salient features for object classification

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

130 Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links