INFORMATION CLASSIFICATION PARADIGM

US 20070112756A1
Filed: 03/15/2006
Published: 05/17/2007
Est. Priority Date: 11/15/2005
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

classifying an initial set of source documents into one of at least three groups based at least in part on at least one classification identifier, the at least three groups comprising a first group containing documents of interest, a second group containing documents not of interest and a third group containing documents where the interest level is undetermined; and

for each document classified into the third group;

extracting from a document at least one feature vector; and

classifying the document into either the first group or the second group based on the at least one feature vector.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A mechanism to classify source documents into one of two categories, either likely to contain desired information or unlikely to contain desired information. Generally some form of rules based classification in conjunction with deeper analysis using advanced techniques on difficult cases is utilized. The rules based classification is generally good for eliminating cases from further consideration and for identifying documents of interest based on generally discernable relationships between data or based on the presence or absence of data. The deeper analysis is used to uncover more complex relationships between data that may identify documents of interest. Portions of the process may use the entire document while other portions of the process may use only a portion of the document.

32 Citations

View as Search Results

20 Claims

1. A method comprising:
- classifying an initial set of source documents into one of at least three groups based at least in part on at least one classification identifier, the at least three groups comprising a first group containing documents of interest, a second group containing documents not of interest and a third group containing documents where the interest level is undetermined; and
  
  for each document classified into the third group;
  
  extracting from a document at least one feature vector; and
  
  classifying the document into either the first group or the second group based on the at least one feature vector.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1 wherein the classification identifier is dependent on at least language or locale or both.
  - 3. The method of claim 1 wherein the initial set of source documents is classified into the three groups by:
    - selecting one of the source documents;
      
      examining the source document to determine the presence or absence of the classification identifier;
      
      based on the presence or absence of the classification identifier, classifying the source document either into the second group or identifying the source document as requiring further examination;
      
      if the source document requires further examination, then extracting from the source document a snippet based at least in part on the classification identifier; and
      
      searching the snippet for further information and based on the presence or absence of the information, classifying the source document into either the first group or the third group.
  - 4. The method of claim 3 wherein the further information is dependent on at least language or locale or both.

5. A method comprising:
- determining whether at least one currency identifier exists within a document and based on the presence or absence of the at lest one currency identifier, classifying the document either as of interest or as requiring further examination;
  
  if the document requires further examination then generating at least one feature vector representing a number of characteristics of the information in the document; and
  
  determining if the generated at least one feature vector indicates that the document is likely to be product related.
- View Dependent Claims (6, 7, 8)
- - 6. The method as set forth in claim 5 wherein determining whether at least one currency identifier exists within a document further comprises extracting from the document at least one price snippet representing a number of document elements surrounding at least one currency identifier;
  - 7. The method as set forth in claim 5 further comprising identifying whether one or more keyword indicators exist near the currency identifier.
  - 8. The method as set forth in claim 7 further comprising classifying the document as likely of interest if at least one keyword is identified.

9. At least one computer-readable medium having executable instructions stored thereon comprising:
- a first classifier adapted to classify an initial set of source documents into one of at least three groups, a first group containing documents of interest, a second group containing documents not of interest and a third group containing documents where the interest level is undetermined, the first classifier having input comprising at least one language dependent classification identifier used to make its classification decisions;
  
  a feature extractor adapted to extract features from a document classified into the third group and to form at least one feature vector; and
  
  a second classifier adapted to receive input comprising the at least one feature vector and language dependent model information and further adapted to classify the document associated with the at least one feature vector into either the first group or the second group based on the at lest one feature vector and the language dependent model information.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The at least one computer readable medium of claim 9 wherein the first classifier performs the method of:
    - determining the presence or absence of the at least one language dependent classification identifier in a source document;
      
      if the at least one classification identifier is absent within the source document then classifying the source document into the second group;
      
      if the at least one classification identifier is present within the source document then extracting a snippet from the document and determining the presence or absence of at least one keyword within the snippet;
      
      if at least one keyword is present within the snippet, then classifying the source document into the first group otherwise classifying the document into the third group.
  - 11. The at least one computer readable medium of claim 9 wherein the at least one language dependent classification identifier comprises a currency identifier.
  - 12. The at least one computer readable medium of claim 11 wherein the at least one keyword comprises information that would indicate the currency identifier is associated with a price.
  - 13. The at least one computer readable medium of claim 9 wherein the at least one feature vector comprises a ratio metric calculated by taking the ratio of the count at least one type of HTML tag to the total number of HTML tags in a document.
  - 14. The at least one computer readable medium of claim 9 wherein the at least one feature vector comprises at least one of either:
    - an image element;
      
      a link element;
      
      a text element;
      
      or a ratio metric.
  - 15. The at least one computer readable medium of claim 9 further comprising a snippet extractor adapted to extract a snippet of the document which is then used by the feature extractor to extract features for the at least one feature vector.
  - 16. The at least one computer readable medium of claim 15 wherein the snippet is selected based on a subset of the total type of structures available in a source document.

17. A system comprising:
- a first classifier adapted to classify an initial set of source documents into one of at least three groups, a first group containing documents of interest, a second group containing documents not of interest and a third group containing documents where the interest level is undetermined, through the following method;
  
  determining the presence or absence of at least one classification identifier in a source document from the initial set of source documents;
  
  if the at least one classification identifier is absent within the source document then classifying the source document into the second group;
  
  if the at least one classification identifier is present within the source document then extracting a snippet from the document and determining the presence or absence of at least one keyword within the snippet;
  
  if at least one keyword is present within the snippet, then classifying the source document into the first group otherwise classifying the document into the third group;
  
  a feature extractor adapted to extract features from the snippet associated with a document classified into the third group and to form at least one feature vector from the snippet; and
  
  a second classifier adapted to receive input comprising the at least one feature vector and model information and further adapted to classify the document associated with the at least one feature vector into either the first group or the second group based on the at lest one feature vector and the model information.
- View Dependent Claims (18, 19, 20)
- - 18. The system of claim 17 wherein the at least one feature vector comprises a ratio metric calculated by taking the ratio of the count at least one type of HTML tag to the total number of HTML tags in a document.
  - 19. The system of claim 17 wherein the at least one feature vector comprises at least one of either:
    - an image element;
      
      a link element;
      
      a text element;
      
      or a ratio metric.
  - 20. The at least one computer readable medium of claim 17 wherein the at least one language dependent classification identifier comprises a currency identifier.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Ma, Wei-Ying, Nie, Zaiqing, Wen, Ji-Rong, Sun, Yan-Feng, Jiang, Renkuan

Granted Patent

US 7,529,748 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/353   into predefined classes

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99937   Sorting

INFORMATION CLASSIFICATION PARADIGM

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

32 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

INFORMATION CLASSIFICATION PARADIGM

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

32 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links