Information classification paradigm

US 7,529,748 B2
Filed: 03/15/2006
Issued: 05/05/2009
Est. Priority Date: 11/15/2005
Status: Active Grant

First Claim

Patent Images

1. A machine-implemented method for determining whether documents being searched are relevant to a desired piece of information, the method comprising:

accessing documents stored in electronic form in a memory;

automatically classifying, by a processor, documents in an initial set of source documents of the electronic documents into one of at least three groups, where the classifying is performed by a first classifier comprising an untrained rules-based classifier applying rules to text of the documents to determine presence and absence of classification identifiers in the initial set of source documents, the first classifier comprising software executing on a machine, the at least three groups comprising a first group containing documents determined by the untrained rules-based classifier to be of interest to the desired piece of information, a second group containing documents determined by the untrained rules-based classifier to be not of interest to the desired piece of information, and a third group containing documents that the untrained rules-based classifier did not place in the first group or the second group, the first classifier classifying by the processor;

determining the presence or absence of at least one classification identifier in a source document;

if the at least one classification identifier is absent within the source document then classifying the source document into the second group;

if the at least one classification identifier is present within the source document then extracting, with a snippet extractor a snippet from the document and determining the presence or absence of at least one keyword within the snippet wherein the snippet is selected based on structures of the source document; and

if at least one keyword is present within the snippet, then classifying the source document into the first group, and otherwise classifying the source document into the third group; and

for each document classified into the third group, the processor;

extracting from a document at least one feature vector; and

classifying the document with a second classifier into either the first group or the second group based on the at least one feature vector, wherein the second classifier comprises a support vector machine (SVM) trained, prior to the classifying, with a plurality of labeled training documents, the labeled training documents having been labeled according to analysis of the documents prior to performance of the method, and where the second classifier comprises software executing on the machine.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A mechanism to classify source documents into one of two categories, either likely to contain desired information or unlikely to contain desired information. Generally some form of rules based classification in conjunction with deeper analysis using advanced techniques on difficult cases is utilized. The rules based classification is generally good for eliminating cases from further consideration and for identifying documents of interest based on generally discernable relationships between data or based on the presence or absence of data. The deeper analysis is used to uncover more complex relationships between data that may identify documents of interest. Portions of the process may use the entire document while other portions of the process may use only a portion of the document.

286 Citations

13 Claims

1. A machine-implemented method for determining whether documents being searched are relevant to a desired piece of information, the method comprising:
- accessing documents stored in electronic form in a memory;
  
  automatically classifying, by a processor, documents in an initial set of source documents of the electronic documents into one of at least three groups, where the classifying is performed by a first classifier comprising an untrained rules-based classifier applying rules to text of the documents to determine presence and absence of classification identifiers in the initial set of source documents, the first classifier comprising software executing on a machine, the at least three groups comprising a first group containing documents determined by the untrained rules-based classifier to be of interest to the desired piece of information, a second group containing documents determined by the untrained rules-based classifier to be not of interest to the desired piece of information, and a third group containing documents that the untrained rules-based classifier did not place in the first group or the second group, the first classifier classifying by the processor;
  
  determining the presence or absence of at least one classification identifier in a source document;
  
  if the at least one classification identifier is absent within the source document then classifying the source document into the second group;
  
  if the at least one classification identifier is present within the source document then extracting, with a snippet extractor a snippet from the document and determining the presence or absence of at least one keyword within the snippet wherein the snippet is selected based on structures of the source document; and
  
  if at least one keyword is present within the snippet, then classifying the source document into the first group, and otherwise classifying the source document into the third group; and
  
  for each document classified into the third group, the processor;
  
  extracting from a document at least one feature vector; and
  
  classifying the document with a second classifier into either the first group or the second group based on the at least one feature vector, wherein the second classifier comprises a support vector machine (SVM) trained, prior to the classifying, with a plurality of labeled training documents, the labeled training documents having been labeled according to analysis of the documents prior to performance of the method, and where the second classifier comprises software executing on the machine.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1 wherein the classification identifier is dependent on at least language or locale or both.
  - 3. The method of claim 1 wherein the initial set of source documents is classified into the three groups by:
    - selecting one of the source documents;
      
      examining the source document to determine the presence or absence of the classification identifier;
      
      based on the presence or absence of the classification identifier, classifying the source document either into the second group or identifying the source document as requiring further examination;
      
      if the source document requires further examination, then extracting from the source document a snippet based at least in part on the classification identifier; and
      
      searching the snippet for further information and based on the presence or absence of the information, classifying the source document into one of the groups.
  - 4. The method of claim 3 wherein the further information is dependent on at least language or locale or both.

5. A storage memory having executable instructions stored thereon, when executed by a processor carry out a method comprising:
- a first classifier comprising an untrained rules-based classifier classifying an initial set of electronic source documents into one of at least three groups by applying rules to text of the documents to determine presence and absence of classification identifiers in the initial set of electronic documents, a first group containing documents determined by the first classifier to have a sufficient probability of being of interest for a given piece of information, a second group containing documents determined by the first classifier to have sufficient probability of being not of interest for the given piece of information, and a third group containing documents analyzed by the first classifier but not determined by the first classifier to have sufficient probability of being of interest or not of interest for the given piece of information, the first classifier having input comprising at least one classification identifier used to make its classification determinations, the first classifier classifying by;
  
  determining the presence or absence of the at least one classification identifier in a source document;
  
  if the at least one classification identifier is absent within the source document then classifying the source document into the second group;
  
  if the at least one classification identifier is present within the source document then extracting, with a snippet extractor, a snippet from the document and determining the presence or absence of at least one keyword within the snippet, wherein the snippet is extracted based on structures of the source document; and
  
  if at least one keyword is present within the snippet, then classifying the source document into the first group, and otherwise classifying the document into the third group;
  
  a feature extractor extracts features from a document classified into the third group and to form at least one feature vector corresponding to the document; and
  
  a second classifier receiving input comprising the at least one feature vector and further classifying the document associated with the at least one feature vector into either the first group or the second group based on the at least one feature vector, the second classifier comprising a learning to classify documents by being trained with labeled training documents prior to performing classification.
- View Dependent Claims (6, 7, 8, 9)
- - 6. The storage memory of claim 5 wherein the classification identifier comprises a currency identifier.
  - 7. The storage memory medium of claim 6 wherein the at least one keyword comprises information that would indicate the currency identifier is associated with a price.
  - 8. The storage memory of claim 5 wherein the at least one feature vector comprises a ratio metric calculated by taking the ratio of the count at least one type of HTML tag to the total number of HTML tags in a document.
  - 9. The storage memory of claim 5 wherein the at least one feature vector comprises at least one of either:
    - an image element;
      
      a link element;
      
      a text element;
      
      or a ratio metric.

10. A computing device for determining whether electronic documents being searched are relevant to a given piece of information, the device comprising:
- a first classifier in a memory of the computing device, the first classifier comprising an untrained rules-based classifier classifying documents in an initial set of source documents of the electronic documents into at least three groups by applying rules to text of the documents to determine presence and absence of classification identifiers in the initial set of source documents, a first group containing documents determined by the first classifier to be of interest to the desired piece of information, a second group containing documents determined by the first classifier not to be of interest to the desired piece of information, and a third group containing documents that the first classifier did not classify into the first group or second group, the first classifier using a processor of the computing device to classify by;
  
  applying rules describing text content to determine the presence or absence of at least one classification identifier in a source document from the initial set of source documents;
  
  if the at least one classification identifier is absent from the source document then classifying the source document into the second group;
  
  if the at least one classification identifier is present within the source document then extracting a snippet from the document containing the classification identifier and determining the presence or absence of at least one keyword within the snippet;
  
  if the at least one keyword is present within the snippet, then classifying the source document into the first group, and otherwise classifying the document into the third group;
  
  a feature extractor, in the memory of the computing device, extracting, with the processor, features from the snippet associated with the document classified into the third group and to form at least one feature vector from the snippet; and
  
  a second classifier, in the memory of the computing device, receiving input comprising the at least one feature vector and further classifying the document associated with the at least one feature vector into either the first group or the second group based on the at least one feature vector, wherein the second classifier learns to classify documents by being trained with labeled training documents prior to the classifying such that classification outcomes for the inputted feature vector depend on the training documents.
- View Dependent Claims (11, 12, 13)
- - 11. The device of claim 10 wherein the at least one feature vector comprises a ratio metric calculated by taking the ratio of the count at least one type of HTML tag to the total number of HTML tags in a document.
  - 12. The device of claim 10 wherein the at least one feature vector comprises at least one of either:
    - an image element;
      
      a link element;
      
      a text element;
      
      or a ratio metric.
  - 13. The device of claim 10 wherein the classification identifier comprises a currency identifier.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Ma, Wei-Ying, Nie, Zaiqing, Wen, Ji-Rong, Jiang, Renkuan, Sun, Yan-Feng
Primary Examiner(s)
Chace; Christian P.
Assistant Examiner(s)
Vu; Bai D

Application Number

US11/276,818
Publication Number

US 20070112756A1
Time in Patent Office

1,147 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/353   into predefined classes

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99937   Sorting

Information classification paradigm

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

286 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Information classification paradigm

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

286 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links