GENERATION OF CLASSIFICATION DATA USED FOR CLASSIFYING DOCUMENTS
First Claim
1. A computer-implemented method for generating classification data which is used for classifying documents, the method comprising:
- reading, in a memory, documents in a form of a spreadsheet;
collecting cell values in each of the documents;
finding, using a processor, in each of common or near cell locations among all or a part of the documents, one or more common cell values among the collected values;
counting, using the processor, for each of the common cell values, a number of the documents having the common cell value;
storing, if the number of the documents is equal to or larger than a predetermined number, the common cell value as a candidate header label in a memory;
calculating, using the processor, a distance between cell locations of the candidate header labels in each of the documents;
choosing, according to the calculated distance, two or more candidate header labels among the candidate header labels for each of the documents; and
storing, in a storage, one or more combinations of the chosen two or more candidate header labels (hereinafter referred to as “
header”
) as the classification data.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods are provided for generating classification data which is used for classifying documents. The method includes reading documents in a form of a spreadsheet; collecting cell values in each of the documents; finding one or more common cell values among the collected values; counting, for each of the common cell values, a number of the documents having the common cell value; storing, if the number of the documents is equal to or larger than a predetermined number, the common cell value as a candidate header label in a memory; calculating a distance between cell locations of the candidate header labels in each of the documents; choosing, according to the calculated distance, two or more candidate header labels among the candidate header labels for each of the documents; and storing one or more combinations of the chosen two or more candidate header labels as the classification data.
5 Citations
20 Claims
-
1. A computer-implemented method for generating classification data which is used for classifying documents, the method comprising:
-
reading, in a memory, documents in a form of a spreadsheet; collecting cell values in each of the documents; finding, using a processor, in each of common or near cell locations among all or a part of the documents, one or more common cell values among the collected values; counting, using the processor, for each of the common cell values, a number of the documents having the common cell value; storing, if the number of the documents is equal to or larger than a predetermined number, the common cell value as a candidate header label in a memory; calculating, using the processor, a distance between cell locations of the candidate header labels in each of the documents; choosing, according to the calculated distance, two or more candidate header labels among the candidate header labels for each of the documents; and storing, in a storage, one or more combinations of the chosen two or more candidate header labels (hereinafter referred to as “
header”
) as the classification data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
12. The system as recited in claim 11, the processor being further configured to:
-
count, for each of the headers, the number of the documents having cell values corresponding to the header; and replace, if the number of the documents is equal to or larger than a predetermined number, the classification data with the header.
-
-
13. The system as recited in claim 11, the processor being further configured to:
-
calculate a similarity between or among the headers; choose, based on the similarity, two or more headers among the headers in each of the documents; and replace the classification data with a combination of the chosen two or more headers.
-
-
14. The system as recited in claim 11, the processor being further configured to:
-
count, for each of the headers, the number of the documents having cell values corresponding to the header; choose, if the number of the documents is equal to or larger than a predetermined number, one or more headers; calculate, if the number of chosen headers is plural in each of the documents, a similarity between or among the headers; choose, based on the similarity, two or more headers among the plural headers; and replace the classification data with a combination of the chosen two or more headers.
-
-
15. A non-transitory computer readable storage medium comprising a computer readable program for generating classification data which is used for classifying documents, wherein the computer readable program when executed on a computer causes the computer to perform the steps of:
-
reading, in a memory, documents in a form of a spreadsheet and collecting cell values in each of the documents; finding, in each of common or near cell locations among all or a part of the documents, one or more common cell values among the collected values; counting, for each of the common cell values, a number of the documents having the common cell value; storing, if the number of the documents is equal to or larger than a predetermined number, the common cell value as a candidate header label in a memory; calculating a distance between cell locations of the candidate header labels in each of the document; choosing, according to the calculated distance, two or more candidate header labels among the candidate header labels for each of the documents; and storing one or more combinations of the chosen two or more candidate header labels (hereinafter referred to as “
header”
) as the classification data in a storage. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification