Method and system for classifying semi-structured documents

US 6,606,620 B1
Filed: 07/24/2000
Issued: 08/12/2003
Est. Priority Date: 07/24/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A classifier, for use on a computer readable medium, for dynamically classifying a semi-structured document with a schema, comprising:

a vectorization module for parsing the document into a structured vector model, wherein the structured vector model is divided into a tree of sub-vectors to reflect a plurality of hierarchical levels beginning with a root and ending with a plurality of leaves;

a sorting module for searching the, document and for counting the occurrences of individual terms in the document;

the sorting module further accounting for the frequency of occurrence of the terms at each hierarchical level to achieve a high contextual sensitivity;

a testing module for assigning a class to the document by using a statistical model based on probability calculation to create a classification model; and

wherein the classification model assigns a class to the document that maximizes a posteriori class probability Pr[c|d,F_k] according to the following expression;

$\Pr [c \rangle d, F_{k}] = \frac{π (c) \prod_{t ε p_{d} (i, j), t ε d ⋂ F_{k} (e_{d} (i, j))} f {(c, p_{d} (i, j), t)}^{n (d, p_{d} (i, j), t)}_{}}{\sum_{c^{'}} π (c^{'}) \prod_{t ε p_{d} (i \cdot), t ε d ⋂ F_{k} (e_{d} (i, j))} f {(c, p_{d} (i, j), t)}^{n (d, p_{d} (i, j), t)}_{}}$ where d is the document, p(c) is a prior distribution on the class c;

c′

is a class in a set of documents;

p_dis a path to a structure node e_dfrom a root;

n is a number of occurrences of term t in p_d, f is a maximum likelihood estimation;

F_kis a set of selected terms;

F is a Fisher index defined by the following equation, where c₁and c₂are children of an internal class c₀, and m is an average number of an occurrence of term t in class c;

$F (t) = \frac{\sum {}_{c_{1}, c_{2}}{(μ (c_{1}, t) - μ (c_{2}, t))}^{2}}{\sum_{c} \frac{1}{\langle c \rangle} \sum {}_{d ε c}{(f (t, d, c) - μ (c, t))}^{2}} .$

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A classifier for semi-structured documents and associated method dynamically and accurately classify documents with an implicit or explicit schema by taking advantage of the term-frequency and term distribution information inherent in the document. The system uses a structured vector model that allows like terms to be grouped together and dissimilar terms to be segregated based on their frequency and distribution within the sub-vectors of the structure vector, thus achieving context sensitivity. The final decision for assigning the class of a document is based on a mathematical comparison of the similarity of the terms in the structured vector to those of the various class models. The classifier of the present invention is capable of both learning and testing. In the learning phase the classifier develops models for classes with information it develops from the composite information gleaned from numerous training documents. Specifically, it develops a structured vector model for each training document. Then, within a given class of documents it adds and then normalizes the occurrences of terms.

Citations

26 Claims

1. A classifier, for use on a computer readable medium, for dynamically classifying a semi-structured document with a schema, comprising:
- a vectorization module for parsing the document into a structured vector model, wherein the structured vector model is divided into a tree of sub-vectors to reflect a plurality of hierarchical levels beginning with a root and ending with a plurality of leaves;
  
  a sorting module for searching the, document and for counting the occurrences of individual terms in the document;
  
  the sorting module further accounting for the frequency of occurrence of the terms at each hierarchical level to achieve a high contextual sensitivity;
  
  a testing module for assigning a class to the document by using a statistical model based on probability calculation to create a classification model; and
  
  wherein the classification model assigns a class to the document that maximizes a posteriori class probability Pr[c|d,F_k] according to the following expression;
  
  $\Pr [c \rangle d, F_{k}] = \frac{π (c) \prod_{t ε p_{d} (i, j), t ε d ⋂ F_{k} (e_{d} (i, j))} f {(c, p_{d} (i, j), t)}^{n (d, p_{d} (i, j), t)}_{}}{\sum_{c^{'}} π (c^{'}) \prod_{t ε p_{d} (i \cdot), t ε d ⋂ F_{k} (e_{d} (i, j))} f {(c, p_{d} (i, j), t)}^{n (d, p_{d} (i, j), t)}_{}}$ where d is the document, p(c) is a prior distribution on the class c;
  
  c′
  
  is a class in a set of documents;
  
  p_dis a path to a structure node e_dfrom a root;
  
  n is a number of occurrences of term t in p_d, f is a maximum likelihood estimation;
  
  F_kis a set of selected terms;
  
  F is a Fisher index defined by the following equation, where c₁and c₂are children of an internal class c₀, and m is an average number of an occurrence of term t in class c;
  
  $F (t) = \frac{\sum {}_{c_{1}, c_{2}}{(μ (c_{1}, t) - μ (c_{2}, t))}^{2}}{\sum_{c} \frac{1}{\langle c \rangle} \sum {}_{d ε c}{(f (t, d, c) - μ (c, t))}^{2}} .$
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The classifier according to claim 1, wherein the sorting module stores the frequency of occurrence of the terms in separate histogram bins.
  - 3. The classifier according to claim 1, further including a modeling module that uses a statistical model to create a classification model.
  - 4. The classifier according to claim 3, wherein the testing module uses the classification model created by the modeling module to assign the class based on probability calculation.
  - 5. The classifier according to claim 3, wherein the modeling module normalizes the frequency of occurrence of the terms at each hierarchical level.
  - 6. The classifier according to claim 1, wherein the document is an XML document.
  - 7. The classifier according to claim 1, further including a training module for classifying documents with known class labels and for developing structured vector models therefrom.
  - 8. The classifier according to claim 7, wherein the testing module classifies documents with unknown class labels, based on the class label structured vector models developed by the training module.
  - 9. The classifier according to claim 1, wherein the structured vector model uses structured information embedded in the document schema and text content to develop the structured vector model.
  - 10. The classifier according to claim 1, wherein the leaves include textual terms.
  - 11. The classifier according to claim 10, wherein the leaves consist exclusively of textual terms.

12. A software program product for dynamically classifying a semi-structured document with a schema, comprising:
- a vectorization module for parsing the document into a structured vector model, wherein the structured vector model is divided into a tree of sub-vectors to reflect a plurality of hierarchical levels beginning with a root and ending with a plurality of leaves;
  
  a sorting module for searching the document and for counting the occurrences of individual terms in the document;
  
  the sorting module further accounting for the frequency of occurrence of the terms at each hierarchical level to achieve a high contextual sensitivity;
  
  a testing module for assigning a class to the document by using a statistical model based on probability calculation to create a classification model; and
  
  wherein the classification model assigns a class to the document that maximizes a posteriori class probability Pr[c|d,F_k] according to the following expression;
  
  $\Pr [c \rangle d, F_{k}] = \frac{π (c) \prod_{t ε p_{d} (i, j), t ε d ⋂ F_{k} (e_{d} (i, j))} f {(c, p_{d} (i, j), t)}^{n (d, p_{d} (i, j), t)}_{}}{\sum_{c^{'}} π (c^{'}) \prod_{t ε p_{d} (i \cdot), t ε d ⋂ F_{k} (e_{d} (i, j))} f {(c, p_{d} (i, j), t)}^{n (d, p_{d} (i, j), t)}_{}}$ where d is the document, p(c) is a prior distribution on the class c;
  
  c′
  
  is a class in a set of documents;
  
  p_dis a path to a structure node e_dfrom a root;
  
  n is a number of occurrences of term t in p_d, f is a maximum likelihood estimation;
  
  F_kis a set of selected terms;
  
  F is a Fisher index defined by the following equation, where c₁and c₂are children of an internal class c₀, and m is an average number of an occurrence of term t in class c;
  
  $F (t) = \frac{\sum {}_{c_{1}, c_{2}}{(μ (c_{1}, t) - μ (c_{2}, t))}^{2}}{\sum_{c} \frac{1}{\langle c \rangle} \sum {}_{d ε c}{(f (t, d, c) - μ (c, t))}^{2}} .$
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
- - 13. The software program product according to claim 12, wherein the sorting module stores the frequency of occurrence of the terms in separate histogram bins.
  - 14. The software program product according to claim 12, further including a modeling module that uses a statistical model to create a classification model.
  - 15. The software program product according to claim 14, wherein the testing module uses the classification model created by the modeling module to assign the class based on probability calculation.
  - 16. The software program product according to claim 14, wherein the modeling module normalizes the frequency of occurrence of the terms at each hierarchical level.
  - 17. The software program product according to claim 12, wherein the document is an XML document.
  - 18. The software program product according to claim 12, further including a training module for classifying documents with known class labels and for developing structured vector models therefrom.
  - 19. The software program product according to claim 18, wherein the testing module classifies documents with unknown class labels, based on the class label structured vector models developed by the training module.
  - 20. The software program product according to claim 12, wherein the structured vector model uses structured information embedded in the document schema and text content to develop the structured vector model.

21. A method for dynamically classifying a semi-structured document, comprising:
- parsing the document into a structured vector model;
  
  dividing the structured vector model into a tree of sub-vectors to reflect a plurality of hierarchical levels beginning with a root and ending with a plurality of leaves;
  
  searching the document and counting the occurrences of individual terms in the document;
  
  accounting for the frequency of occurrence of the terms at each hierarchical level to achieve a high contextual sensitivity;
  
  assigning a class to the document by using a statistical model based on probability calculation to create a classification model; and
  
  wherein the classification model assigns a class to the document that maximizes a posteriori class probability Pr[c|d,F_k] according to the following expression;
  
  $\Pr [c \rangle d, F_{k}] = \frac{π (c) \prod_{t ε p_{d} (i, j), t ε d ⋂ F_{k} (e_{d} (i, j))} f {(c, p_{d} (i, j), t)}^{n (d, p_{d} (i, j), t)}_{}}{\sum_{c^{'}} π (c^{'}) \prod_{t ε p_{d} (i \cdot), t ε d ⋂ F_{k} (e_{d} (i, j))} f {(c^{'}, p_{d} (i, j), t)}^{n (d, p_{d} (i, j), t)}_{}}$ where d is the document, p(c) is a prior distribution on the class c;
  
  c′
  
  is a class in a set of documents;
  
  p_dis a path to a structure node e_dfrom a root;
  
  n is a number of occurrences of term t in p_d, f is a maximum likelihood estimation;
  
  F_kis a set of selected terms;
  
  F is a Fisher index defined by the following equation, where c₁and c₂are children of an internal class c₀, and m is an average number of an occurrence of term t in class c;
  
  $F (t) = \frac{\sum {}_{c_{1}, c_{2}}{(μ (c_{1}, t) - μ (c_{2}, t))}^{2}}{\sum_{c} \frac{1}{\langle c \rangle} \sum {}_{d ε c}{(f (t, d, c) - μ (c, t))}^{2}} .$
- View Dependent Claims (22, 23, 24)
- - 22. The method according to claim 21, wherein counting the occurrences includes storing the frequency of occurrence of the terms in separate histogram bins.
  - 23. The method according to claim 21, wherein accounting for the frequency of occurrence of the terms includes storing the frequency of occurrence of the terms in separate histogram bins.
  - 24. The method according to claim 21, further including normalizing the frequency of occurrence of the terms at each hierarchical level.

25. A method for dynamically classifying a semi-structured document, comprising:
- parsing the document into a structured vector;
  
  organizing the structured vector into a tree comprised of any of sub-vectors or structured vectors, to reflect a plurality of hierarchical levels in the document, beginning with a root and ending with a plurality of leaves;
  
  searching the document and counting the occurrences of individual terms in the document;
  
  accounting for the frequency of occurrence of the terms at each hierarchical level to achieve a high contextual sensitivity;
  
  assigning a class to the document based on both term frequency and term distribution information and structure within the structured vector of the document, by using a statistical model based on probability calculation to create a classification model; and
  
  wherein the classification model assigns a class to the document that maximizes a posteriori class probability Pr[c|d,F_k] according to the following expression;
  
  $\Pr [c \rangle d, F_{k}] = \frac{π (c) \prod_{t ε p_{d} (i, j), t ε d ⋂ F_{k} (e_{d} (i, j))} f {(c, p_{d} (i, j), t)}^{n (d, p_{d} (i, j), t)}_{}}{\sum_{c^{'}} π (c^{'}) \prod_{t ε p_{d} (i \cdot), t ε d ⋂ F_{k} (e_{d} (i, j))} f {(c, p_{d} (i, j), t)}^{n (d, p_{d} (i, j), t)}_{}}$ where d is the document, p(c) is a prior distribution on the class c;
  
  c′
  
  is a class in a set of documents;
  
  p_dis a path to a structure node e_dfrom a root;
  
  n is a number of occurrences of term t in p_d, f is a maximum likelihood estimation;
  
  F_kis a set of selected terms;
  
  F is a Fisher index defined by the following equation, where c₁and c₂are children of an internal class c₀, and m is an average number of an occurrence of term t in class c;
  
  $F (t) = \frac{\sum {}_{c_{1}, c_{2}}{(μ (c_{1}, t) - μ (c_{2}, t))}^{2}}{\sum_{c} \frac{1}{\langle c \rangle} \sum {}_{d ε c}{(f (t, d, c) - μ (c, t))}^{2}} .$
- View Dependent Claims (26)
- - 26. The method according to claim 25, wherein the leaves include textual terms.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Yi, Jeonghee, Sundaresan, Neelakantan
Primary Examiner(s)
Popovici, Dov
Assistant Examiner(s)
Mahmoudi, Hassan

Application Number

US09/624,616
Time in Patent Office

1,114 Days
Field of Search

707/3, 707/5, 707/104.1, 707/500, 707/513, 707/2, 707/4, 707/205, 707/503, 370/503, 382/176, 382/173, 382/270, 709/217, 702/180
US Class Current

1/1
CPC Class Codes

G06F 16/30   of unstructured textual dat...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99945   Object-oriented database st...

Method and system for classifying semi-structured documents

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for classifying semi-structured documents

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links