Document similarity evaluation system, document similarity evaluation method, and computer program

US 9,235,624 B2
Filed: 11/09/2012
Issued: 01/12/2016
Est. Priority Date: 01/19/2012
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented apparatus evaluating document similarity comprising:

a processor; and

a memory capable of storing instructions to be executed by the processor by causing the processor to execute;

a segment search unit implemented by hardware including the processor and the memory and which finds common segments in both a first segment string and a second segment string, counts the number of the common segments that are found, and identifies an appearance range within which the common segments appear; and

a similarity index calculation unit implemented by the hardware and which calculates a second sum that is a sum of the numbers of characters of each segment included in the appearance range identified by the segment search unit, calculates a first sum that is a sum of the numbers of characters of each segment identified as the common segments, and calculates the similarity index indicating the similarity between the first segment string and the second segment string by using the following equation,
similarity index=F(NTC)/G(NCC)×

NS(Where, in the above-mentioned equation,NTC is the first sum,NCC is the second sum,NS is the number of the common segments, anda function F and a function G are monotonically increasing functions by which a certain integer value is associated with a positive real value).

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed is a document similarity evaluation system or the like which can evaluate a degree of concentration and dispersion of parts with high similarity in at least two kinds of documents. The system includes a segment search unit which finds common segments (CS) in first and second segment strings, counts the number of CS, and identifies an appearance range (AR) within CS; and a similarity index (SI) calculation unit which calculates a first sum that is a sum of the numbers of characters of each segment (NCS) in AR and a second sum that is a sum of NCS of CS and calculates SI between the first and second segment strings by the following equation, SI=F(NTC)/G(NCC)×NS (where, NTC is the first sum, NCC is the second sum, NS is the number of the CS, functions F and G monotonically increase at larger than 0).

21 Citations

View as Search Results

9 Claims

1. A computer-implemented apparatus evaluating document similarity comprising:
- a processor; and
  
  a memory capable of storing instructions to be executed by the processor by causing the processor to execute;
  
  a segment search unit implemented by hardware including the processor and the memory and which finds common segments in both a first segment string and a second segment string, counts the number of the common segments that are found, and identifies an appearance range within which the common segments appear; and
  
  a similarity index calculation unit implemented by the hardware and which calculates a second sum that is a sum of the numbers of characters of each segment included in the appearance range identified by the segment search unit, calculates a first sum that is a sum of the numbers of characters of each segment identified as the common segments, and calculates the similarity index indicating the similarity between the first segment string and the second segment string by using the following equation,
  similarity index=F(NTC)/G(NCC)×
  
  NS(Where, in the above-mentioned equation,NTC is the first sum,NCC is the second sum,NS is the number of the common segments, anda function F and a function G are monotonically increasing functions by which a certain integer value is associated with a positive real value).
- View Dependent Claims (2, 3, 4)
- - 2. The document similarity evaluation system according to claim 1,whereinthe similarity index calculation unit calculates the first sum and the second sum based on a character number information in which each segment included in the appearance range is associated with the number of characters included in the each segment.
  - 3. The document similarity evaluation system according to claim 1,wherein the similarity index calculation unit calculates the similarity index indicating the similarity between the first segment string and the second segment string by using the following equation,
    similarity index=H(NTC/NCC)×
    - NS,(Where, in the above-mentioned equation,NTC is the first sum,NCC is the second sum,NS is the number of the common segments, anda function H is a monotonically increasing function by which a certain integer value is associated With a positive real value).
  - 4. The document similarity evaluation system according to claim 1,Wherein the similarity index calculation unit calculates the similarity index indicating the similarity between the first segment string and the second segment string by using the following equation,
    similarity index=NTC/NCC×
    - NS(Where, in the above-mentioned equation,NTC is the first sum,NCC is the second sum, andNS is the number of the common segments).

5. A document similarity evaluation method calculating a similarity index indicating a similarity between a first segment string and a second segment string comprising:
- finding common segments in both the first segment string and the second segment string,counting the number of the common segments that are found;
  
  identifying an appearance range within which the common segments appear;
  
  calculating a second sum that is a sum of the numbers of characters of each segment included in the appearance range;
  
  calculating a first sum that is a sum of the numbers of characters of each segment identified as the common segments; and
  
  calculating the similarity index by the following equation,
  similarity index=F(NTC)/G(NCC) x NS(Where, in the above-mentioned equation,NTC is the first sum,NCC is the second sum,NS is the number of the common segments, anda function F and a function G are monotonically increasing functions by which a certain integer value is associated with a positive real value).
- View Dependent Claims (6, 7)
- - 6. The document similarity evaluation method according to claim 5,Wherein the calculation of the similarity index indicating the similarity is performed by using the following equation,
    similarity index=H(NTC/NCC)×
    - NS(Where, in the above-mentioned equation,NTC is the first sum,NCC is the second sum,NS is the number of the common segments, and a function H is a monotonically increasing function by which a certain integer value is associated with a positive real value).
  - 7. The document similarity evaluation method according to claim 5,Wherein the calculation of the similarity index indicating the similarity is performed by using the following equation,
    similarity index=NTC/NCC×
    - NS(Where, in the above-mentioned equation,NTC is the first sum,NCC is the second sum, andNS is the number of the common segments).

8. A computer-implemented apparatus evaluating a similarity comprising:
- a processor; and
  
  a memory capable of storing instructions to be executed by the processor by causing the processor to execute;
  
  a similarity index calculation unit which calculates a similarity index indicating the similarity between a first segment string and a second segment on the basis of(i) a ratio ofthe number of characters included in an first appearance range, which is a range where common segments in the first segment string and the second segment string appear in the second segment string, toa result of multiplying the number of appearance of the common segments in the first appearance range and the number of characters of the common segments, and(ii) the number of the common segments.
- View Dependent Claims (9)
- - 9. The computer-implemented apparatus according to claim 8, wherein,when the ratio calculated in case of the first segment string and the second segment is the same to the ratio calculated in case of the first segment string and a third segment string, the similarity index calculation unit determines which segment string is more similar to the first segment string by comparingthe number of common segments in the first segment string and the second segment string withthe number of common segments in the first segment string and the third segment string.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NEC Corporation
Original Assignee
NEC Corporation
Inventors
Zhou, Wenqi
Primary Examiner(s)
Beausoliel, Jr., Robert
Assistant Examiner(s)
Khakhar, Nirav K

Application Number

US13/672,794
Publication Number

US 20130191410A1
Time in Patent Office

1,159 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/24558   Binary matching operations

G06F 16/334   Query execution G06F16/335 ...

G06F 40/194   Calculation of difference b...

Document similarity evaluation system, document similarity evaluation method, and computer program

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

21 Citations

9 Claims

Specification

Use Cases

Quick Links

Others

Document similarity evaluation system, document similarity evaluation method, and computer program

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

21 Citations

9 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others