Document division method and system
First Claim
1. One or more tangible non-transitory computer-readable media storing instructions that, when executed by a processor, perform operations comprising:
- receiving a first electronic document;
determining an entropy value for the first electronic document;
determining a first information gain value associated with a first line that divides the first electronic document into a first portion and a second portion, comprising;
a) determining an entropy value for the first portion of the first electronic document and an entropy value for the second portion of the first electronic document,b) based on the entropy value for the first portion of the first electronic document and the entropy value for the second portion of the first electronic document, determining an entropy value associated with the first line, andc) determining the first information gain value by determining a difference between i) the entropy value for the first electronic document and ii) the entropy value associated with the first line;
determining a second information gain value associated with a second line that divides the first electronic document into a third portion and a fourth portion, comprising;
a) determining an entropy value for the third portion of the first electronic document and an entropy value for the fourth portion of the first electronic document,b) based on the entropy value for the third portion of the first electronic document and the entropy value for the fourth portion of the first electronic document, determining an entropy value associated with the second line, andc) determining the second information gain value by determining a difference between i) the entropy value for the first electronic document and ii) the entropy value associated with the second line;
determining which of the first information gain value and second information gain value is greater;
in response to determining that the first information gain value is greater, generating a second electronic document that includes at least a portion defined by the first line and using the first information gain value to recursively divide the portions defined by the first line;
in response to determining that the second information gain value is greater, generating a third electronic document that includes at least a portion defined by the second line and using the second information gain value to recursively divide the portions defined by the second line,wherein the entropy value for the first portion of the first electronic document and the entropy value for the second portion of the first electronic document are based at least on a variation in pixel intensity for pixels that the first line intersects in the first electronic document, and the entropy value for the third portion of the first electronic document and the entropy value for the fourth portion of the first electronic document are based at least on a variation in pixel intensity for pixels that the second line intersects in the first electronic document.
2 Assignments
0 Petitions
Accused Products
Abstract
Computer-readable media stores instructions that perform operations including receiving a first electronic document; determining a first information gain value associated with a first line that divides the first electronic document into a first portion and a second portion; determining a second information gain value associated with a second line that divides the first electronic document into a third portion and a fourth portion; and determining which of the first information gain value and second information gain value is greater. Information gain values are determined by calculating a difference between an entropy value associated with a line and an entropy value associated with an electronic document. Entropy values associated lines or electronic documents are determined based at least in part on document objects in the portions created by a line or an electronic document.
70 Citations
15 Claims
-
1. One or more tangible non-transitory computer-readable media storing instructions that, when executed by a processor, perform operations comprising:
-
receiving a first electronic document; determining an entropy value for the first electronic document; determining a first information gain value associated with a first line that divides the first electronic document into a first portion and a second portion, comprising; a) determining an entropy value for the first portion of the first electronic document and an entropy value for the second portion of the first electronic document, b) based on the entropy value for the first portion of the first electronic document and the entropy value for the second portion of the first electronic document, determining an entropy value associated with the first line, and c) determining the first information gain value by determining a difference between i) the entropy value for the first electronic document and ii) the entropy value associated with the first line; determining a second information gain value associated with a second line that divides the first electronic document into a third portion and a fourth portion, comprising; a) determining an entropy value for the third portion of the first electronic document and an entropy value for the fourth portion of the first electronic document, b) based on the entropy value for the third portion of the first electronic document and the entropy value for the fourth portion of the first electronic document, determining an entropy value associated with the second line, and c) determining the second information gain value by determining a difference between i) the entropy value for the first electronic document and ii) the entropy value associated with the second line; determining which of the first information gain value and second information gain value is greater; in response to determining that the first information gain value is greater, generating a second electronic document that includes at least a portion defined by the first line and using the first information gain value to recursively divide the portions defined by the first line; in response to determining that the second information gain value is greater, generating a third electronic document that includes at least a portion defined by the second line and using the second information gain value to recursively divide the portions defined by the second line, wherein the entropy value for the first portion of the first electronic document and the entropy value for the second portion of the first electronic document are based at least on a variation in pixel intensity for pixels that the first line intersects in the first electronic document, and the entropy value for the third portion of the first electronic document and the entropy value for the fourth portion of the first electronic document are based at least on a variation in pixel intensity for pixels that the second line intersects in the first electronic document. - View Dependent Claims (2, 3, 4, 5)
-
-
6. One or more tangible non-transitory computer-readable media storing instructions that, when executed by a processor, perform operations comprising:
-
receiving a first electronic document; dividing the first electronic document along a first line into a first portion and a second portion; dividing the first electronic document along a second line into a third portion and a fourth portion; determining a first information gain value based on the first division, comprising; a) determining an entropy value for the first portion of the first electronic document and an entropy value for the second portion of the first electronic document, b) based on the entropy value for the first portion of the first electronic document and the entropy value for the second portion of the first electronic document, determining an entropy value associated with the first line, and c) determining the first information gain value by determining a difference between i) an entropy value for the first electronic document and ii) the entropy value associated with the first line; determining a second information gain value based on the second division, comprising; a) determining an entropy value for the third portion of the first electronic document and an entropy value for the fourth portion of the first electronic document, b) based on the entropy value for the third portion of the first electronic document and the entropy value for the fourth portion of the first electronic document, determining an entropy value associated with the second line, and c) determining the second information gain value by determining a difference between i) the entropy value for the first electronic document and ii) the entropy value associated with the second line; selecting the division that produces the higher information gain value; generating a second electronic document comprising the portions created by the selected division; and using the highest information gain value to recursively divide the first electronic document into portions, wherein the entropy value for the first portion of the first electronic document and the entropy value for the second portion of the first electronic document are based at least on a variation in pixel intensity for pixels that the first line intersects in the first electronic document, and the entropy value for the third portion of the first electronic document and the entropy value for the fourth portion of the first electronic document are based at least on a variation in pixel intensity for pixels that the second line intersects in the first electronic document. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A system comprising:
-
one or more computer processors; and one or more non-transitory computer readable devices that include instructions that, when executed by the one or more computer processors, causes the processors to perform operations, the operations comprising; receiving a first electronic document; determining an entropy value for the first electronic document; determining a first information gain value associated with a first line that divides the first electronic document into a first portion and a second portion, comprising; a) determining an entropy value for the first portion of the first electronic document and an entropy value for the second portion of the first electronic document, b) based on the entropy value for the first portion of the first electronic document and the entropy value for the second portion of the first electronic document, determining an entropy value associated with the first line, and c) determining the first information gain value by determining a difference between i) the entropy value for the first electronic document and ii) the entropy value associated with the first line; determining a second information gain value associated with a second line that divides the first electronic document into a third portion and a fourth portion, comprising; a) determining an entropy value for the third portion of the first electronic document and an entropy value for the fourth portion of the first electronic document, b) based on the entropy value for the third portion of the first electronic document and the entropy value for the fourth portion of the first electronic document, determining an entropy value associated with the second line, and c) determining the second information gain value by determining a difference between i) the entropy value for the first electronic document and ii) the entropy value associated with the second line, wherein each of the entropy values is based at least in part on document objects in the respective portions of the first electronic document; determining which of the first information gain value and second information gain value is greater; in response to determining that the first information gain value is greater, generating a second electronic document that includes at least a portion defined by the first line and using the first information gain value to recursively divide the portions defined by the first line; in response to determining that the second information gain value is greater, generating a third electronic document that includes at least a portion defined by the second line and using the second information gain value to recursively divide the portions defined by the second line, wherein the entropy value for the first portion of the first electronic document and the entropy value for the second portion of the first electronic document are based at least on a variation in pixel intensity for pixels that the first line intersects in the first electronic document, and the entropy value for the third portion of the first electronic document and the entropy value for the fourth portion of the first electronic document are based at least on a variation in pixel intensity for pixels that the second line intersects in the first electronic document. - View Dependent Claims (12, 13, 14, 15)
-
Specification