Document similarity evaluation system, document similarity evaluation method, and computer program
First Claim
1. A computer-implemented apparatus evaluating document similarity comprising:
- a processor; and
a memory capable of storing instructions to be executed by the processor by causing the processor to execute;
a segment search unit implemented by hardware including the processor and the memory and which finds common segments in both a first segment string and a second segment string, counts the number of the common segments that are found, and identifies an appearance range within which the common segments appear; and
a similarity index calculation unit implemented by the hardware and which calculates a second sum that is a sum of the numbers of characters of each segment included in the appearance range identified by the segment search unit, calculates a first sum that is a sum of the numbers of characters of each segment identified as the common segments, and calculates the similarity index indicating the similarity between the first segment string and the second segment string by using the following equation,
similarity index=F(NTC)/G(NCC)×
NS(Where, in the above-mentioned equation,NTC is the first sum,NCC is the second sum,NS is the number of the common segments, anda function F and a function G are monotonically increasing functions by which a certain integer value is associated with a positive real value).
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed is a document similarity evaluation system or the like which can evaluate a degree of concentration and dispersion of parts with high similarity in at least two kinds of documents. The system includes a segment search unit which finds common segments (CS) in first and second segment strings, counts the number of CS, and identifies an appearance range (AR) within CS; and a similarity index (SI) calculation unit which calculates a first sum that is a sum of the numbers of characters of each segment (NCS) in AR and a second sum that is a sum of NCS of CS and calculates SI between the first and second segment strings by the following equation, SI=F(NTC)/G(NCC)×NS (where, NTC is the first sum, NCC is the second sum, NS is the number of the CS, functions F and G monotonically increase at larger than 0).
21 Citations
9 Claims
-
1. A computer-implemented apparatus evaluating document similarity comprising:
-
a processor; and a memory capable of storing instructions to be executed by the processor by causing the processor to execute; a segment search unit implemented by hardware including the processor and the memory and which finds common segments in both a first segment string and a second segment string, counts the number of the common segments that are found, and identifies an appearance range within which the common segments appear; and a similarity index calculation unit implemented by the hardware and which calculates a second sum that is a sum of the numbers of characters of each segment included in the appearance range identified by the segment search unit, calculates a first sum that is a sum of the numbers of characters of each segment identified as the common segments, and calculates the similarity index indicating the similarity between the first segment string and the second segment string by using the following equation,
similarity index=F(NTC)/G(NCC)×
NS(Where, in the above-mentioned equation, NTC is the first sum, NCC is the second sum, NS is the number of the common segments, and a function F and a function G are monotonically increasing functions by which a certain integer value is associated with a positive real value). - View Dependent Claims (2, 3, 4)
-
-
5. A document similarity evaluation method calculating a similarity index indicating a similarity between a first segment string and a second segment string comprising:
-
finding common segments in both the first segment string and the second segment string, counting the number of the common segments that are found; identifying an appearance range within which the common segments appear; calculating a second sum that is a sum of the numbers of characters of each segment included in the appearance range; calculating a first sum that is a sum of the numbers of characters of each segment identified as the common segments; and calculating the similarity index by the following equation,
similarity index=F(NTC)/G(NCC) x NS(Where, in the above-mentioned equation, NTC is the first sum, NCC is the second sum, NS is the number of the common segments, and a function F and a function G are monotonically increasing functions by which a certain integer value is associated with a positive real value). - View Dependent Claims (6, 7)
-
-
8. A computer-implemented apparatus evaluating a similarity comprising:
-
a processor; and a memory capable of storing instructions to be executed by the processor by causing the processor to execute; a similarity index calculation unit which calculates a similarity index indicating the similarity between a first segment string and a second segment on the basis of (i) a ratio of the number of characters included in an first appearance range, which is a range where common segments in the first segment string and the second segment string appear in the second segment string, to a result of multiplying the number of appearance of the common segments in the first appearance range and the number of characters of the common segments, and (ii) the number of the common segments. - View Dependent Claims (9)
-
Specification