GENERATION OF CLASSIFICATION DATA USED FOR CLASSIFYING DOCUMENTS

US 20170351688A1
Filed: 06/07/2016
Published: 12/07/2017
Est. Priority Date: 06/07/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for generating classification data which is used for classifying documents, the method comprising:

reading, in a memory, documents in a form of a spreadsheet;

collecting cell values in each of the documents;

finding, using a processor, in each of common or near cell locations among all or a part of the documents, one or more common cell values among the collected values;

counting, using the processor, for each of the common cell values, a number of the documents having the common cell value;

storing, if the number of the documents is equal to or larger than a predetermined number, the common cell value as a candidate header label in a memory;

calculating, using the processor, a distance between cell locations of the candidate header labels in each of the documents;

choosing, according to the calculated distance, two or more candidate header labels among the candidate header labels for each of the documents; and

storing, in a storage, one or more combinations of the chosen two or more candidate header labels (hereinafter referred to as “

header”

) as the classification data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are provided for generating classification data which is used for classifying documents. The method includes reading documents in a form of a spreadsheet; collecting cell values in each of the documents; finding one or more common cell values among the collected values; counting, for each of the common cell values, a number of the documents having the common cell value; storing, if the number of the documents is equal to or larger than a predetermined number, the common cell value as a candidate header label in a memory; calculating a distance between cell locations of the candidate header labels in each of the documents; choosing, according to the calculated distance, two or more candidate header labels among the candidate header labels for each of the documents; and storing one or more combinations of the chosen two or more candidate header labels as the classification data.

5 Citations

View as Search Results

20 Claims

1. A computer-implemented method for generating classification data which is used for classifying documents, the method comprising:
- reading, in a memory, documents in a form of a spreadsheet;
  
  collecting cell values in each of the documents;
  
  finding, using a processor, in each of common or near cell locations among all or a part of the documents, one or more common cell values among the collected values;
  
  counting, using the processor, for each of the common cell values, a number of the documents having the common cell value;
  
  storing, if the number of the documents is equal to or larger than a predetermined number, the common cell value as a candidate header label in a memory;
  
  calculating, using the processor, a distance between cell locations of the candidate header labels in each of the documents;
  
  choosing, according to the calculated distance, two or more candidate header labels among the candidate header labels for each of the documents; and
  
  storing, in a storage, one or more combinations of the chosen two or more candidate header labels (hereinafter referred to as “
  
  header”
  
  ) as the classification data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method as recited in claim 1, further comprising:
    - counting, for each of the header, the number of the documents having cell values corresponding to the header; and
      
      if the number of the documents is equal to or larger than a predetermined number, replacing the classification data with the header.
  - 3. The method as recited in claim 1, further comprising:
    - calculating a similarity between or among the headers;
      
      choosing, based on the similarity, two or more headers among the headers in each of the documents; and
      
      replacing the classification data with a combination of the chosen two or more headers (hereinafter referred to as “
      
      header pattern”
      
      ).
  - 4. The method as recited in claim 1, further comprising:
    - counting, for each of the headers, the number of the documents having cell values corresponding to the header;
      
      if the number of the documents is equal to or larger than a predetermined number, choosing one or more candidate header labels;
      
      if the number of chosen headers is plural in each of the documents, calculating similarity between or among the headers;
      
      choosing, based on the similarity, two or more headers among the plural headers; and
      
      replacing the classification data with a combination of the chosen two or more headers.
  - 5. The method as recited in claim 1, wherein the near cell locations are cell locations in a row, cell locations in a column, or cell locations in any row and column.
  - 6. The method as recited in claim 1, wherein the distance between cell locations is a distance between cells in a row, a distance between cells in a column, or a distance between cells in any row and column.
  - 7. The method as recited in claim 3, wherein the choice of the two or more candidate header labels is carried out if the similarity is equal to or larger than a predetermined value.
  - 8. The method as recited in claim 3, wherein the similarity is calculated with cosign similarity or edit distance.
  - 9. The method as recited in claim 3, further comprising:
    - finding out an overlapping relation of headers, between or among the header patterns;
      
      setting, after finding out the overlapping relation, a part having the overlapping relation is to a new header pattern (hereinafter referred to as “
      
      a major header pattern”
      
      ); and
      
      setting, after finding out the overlapping relation, a header pattern having the part to a derived header pattern for the major header part.
  - 10. The method as recited in claim 9, further comprising:
    - calculating a ratio of the number of documents in which the major header pattern is comprised and the number of documents in which the derived header pattern is comprised; and
      
      if the ratio is equal to or larger than a predetermined value, replacing the derived header pattern with the major header pattern. cm 11. A system for generating classification data which is used for classifying documents, comprising;
      
      a memory; and
      
      a processor configured to;
      
      read, in the memory, documents in a form of a spreadsheet and collecting cell values in each of the documents;
      
      find, in each of common or near cell locations among all or a part of the documents, one or more common cell values among the collected values;
      
      count, for each of the common cell values, the number of the documents having the common cell value;
      
      store, if the number of the documents is equal to or larger than a predetermined number, the common cell value as a candidate header label in a memory;
      
      calculate a distance between cell locations of the candidate header labels in each of the document;
      
      choose, according to the calculated distance, two or more candidate header labels among the candidate header labels for each of the documents; and
      
      store one or more combinations of the chosen two or more candidate header labels (hereinafter referred to as “
      
      header”
      
      ) as the classification data in a storage.

12. The system as recited in claim 11, the processor being further configured to:
- count, for each of the headers, the number of the documents having cell values corresponding to the header; and
  
  replace, if the number of the documents is equal to or larger than a predetermined number, the classification data with the header.

13. The system as recited in claim 11, the processor being further configured to:
- calculate a similarity between or among the headers;
  
  choose, based on the similarity, two or more headers among the headers in each of the documents; and
  
  replace the classification data with a combination of the chosen two or more headers.

14. The system as recited in claim 11, the processor being further configured to:
- count, for each of the headers, the number of the documents having cell values corresponding to the header;
  
  choose, if the number of the documents is equal to or larger than a predetermined number, one or more headers;
  
  calculate, if the number of chosen headers is plural in each of the documents, a similarity between or among the headers;
  
  choose, based on the similarity, two or more headers among the plural headers; and
  
  replace the classification data with a combination of the chosen two or more headers.

15. A non-transitory computer readable storage medium comprising a computer readable program for generating classification data which is used for classifying documents, wherein the computer readable program when executed on a computer causes the computer to perform the steps of:
- reading, in a memory, documents in a form of a spreadsheet and collecting cell values in each of the documents;
  
  finding, in each of common or near cell locations among all or a part of the documents, one or more common cell values among the collected values;
  
  counting, for each of the common cell values, a number of the documents having the common cell value;
  
  storing, if the number of the documents is equal to or larger than a predetermined number, the common cell value as a candidate header label in a memory;
  
  calculating a distance between cell locations of the candidate header labels in each of the document;
  
  choosing, according to the calculated distance, two or more candidate header labels among the candidate header labels for each of the documents; and
  
  storing one or more combinations of the chosen two or more candidate header labels (hereinafter referred to as “
  
  header”
  
  ) as the classification data in a storage.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The non-transitory computer readable storage medium as recited in claim 15 wherein the computer readable program when executed on the computer causes the computer to further perform the steps of:
    - counting, for each of the headers, the number of the documents having cell values corresponding to the header; and
      
      replacing, if the number of the documents is equal to or larger than a predetermined number, the classification data with the header.
  - 17. The non-transitory computer readable storage medium as recited in claim 15, wherein the computer readable program when executed on the computer causes the computer to further perform the steps of:
    - calculating similarity between or among the headers;
      
      choosing, based on the similarity, two or more headers among the headers in each of the documents; and
      
      replacing the classification data with a combination of the chosen two or more headers (hereinafter referred to as “
      
      header pattern”
      
      ).
  - 18. The non-transitory computer readable storage medium as recited in claim 15, wherein the computer readable program when executed on the computer causes the computer to further perform the steps of:
    - counting, for each of the headers, the number of the documents having cell values corresponding to the header;
      
      choosing, if the number of the documents is equal to or larger than a predetermined number, one or more headers;
      
      calculating, if the number of chosen headers is plural in each of the documents, a similarity between or among the headers;
      
      choosing, based on the similarity, two or more headers among the plural headers; and
      
      replacing the classification data with a combination of the chosen two or more headers.
  - 19. The non-transitory computer readable storage medium as recited in claim 17, wherein the computer readable program when executed on the computer causes the computer to further perform the steps of:
    - finding out an overlapping relation of headers between or among the header patterns;
      
      setting, after finding out the overlapping relation, a part having the overlapping relation is to a new header pattern (hereinafter referred to as “
      
      a major header pattern”
      
      ); and
      
      setting, after finding out the overlapping relation, a header pattern having the part to a derived header pattern for the major header part.
  - 20. The non-transitory computer readable storage medium as recited in claim 19, wherein the computer readable program when executed on the computer causes the computer to further perform the steps of:
    - calculating a ratio of the number of documents in which the major header pattern is comprised and the number of documents in which the derived header pattern is comprised; and
      
      replacing, if the ratio is equal to or larger than a predetermined value, the derived header pattern with the major header pattern.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Yasue, Toshiaki

Granted Patent

US 10,318,568 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 40/18   of spreadsheets form-fillin...

G06F 40/258   Heading extraction; Automat...

GENERATION OF CLASSIFICATION DATA USED FOR CLASSIFYING DOCUMENTS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

5 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

GENERATION OF CLASSIFICATION DATA USED FOR CLASSIFYING DOCUMENTS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

5 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links