Segmenting a string using similarity values

US 8,081,823 B2
Filed: 11/20/2007
Issued: 12/20/2011
Est. Priority Date: 11/20/2007
Status: Active Grant

First Claim

Patent Images

1. A method for segmenting a string comprising one or more segments into discrete segments, wherein each of the one or more segments comprises data that is the same as or similar to a marker string, the method comprising:

generating a similarity vector comprising a plurality of similarity values and associated locations within the string wherein a similarity value represents a comparison of the marker string and at least a portion of the string and an associated location associated with the similarity value is the location within the string of the start of the at least a portion of the string used in the comparison;

identifying a set of ideal segmentation locations based upon an expected number of discrete segments within the string;

using the similarity vector to identify a set of candidate segmentation locations;

responsive to a candidate segmentation location having a similarity value less than another candidate segmentation location within a local window, removing the candidate segmentation location from the set of candidate segmentation locations;

responsive to a candidate segmentation location and a closest ideal segmentation location being at a distance that is greater than the distance threshold, removing the candidate segmentation location from the set of candidate segmentation locations; and

using the set of candidate segmentation locations and the set of ideal segmentation locations to generate a set of segmentation locations; and

using the set of segmentation locations to segment the string.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are systems and methods for segmenting a string comprised of one or more string segments using similarity values. In embodiments, each string segment may contain at least a variation of a marker string that may be used to separate string segments in the string. In embodiments, a similarity value representing the result of comparing the marker string to substrings of the string may be computed, and a similarity vector representing the set of comparisons for the locations on the string may be generated. In embodiments, the similarity vector may be used to identify candidate segmentation locations in the string. In embodiments, a set of segmentation locations in the string may be derived from the candidate segmentation locations in the string, and the string may be segmented according to the set of segmentation locations.

50 Citations

View as Search Results

18 Claims

1. A method for segmenting a string comprising one or more segments into discrete segments, wherein each of the one or more segments comprises data that is the same as or similar to a marker string, the method comprising:
- generating a similarity vector comprising a plurality of similarity values and associated locations within the string wherein a similarity value represents a comparison of the marker string and at least a portion of the string and an associated location associated with the similarity value is the location within the string of the start of the at least a portion of the string used in the comparison;
  
  identifying a set of ideal segmentation locations based upon an expected number of discrete segments within the string;
  
  using the similarity vector to identify a set of candidate segmentation locations;
  
  responsive to a candidate segmentation location having a similarity value less than another candidate segmentation location within a local window, removing the candidate segmentation location from the set of candidate segmentation locations;
  
  responsive to a candidate segmentation location and a closest ideal segmentation location being at a distance that is greater than the distance threshold, removing the candidate segmentation location from the set of candidate segmentation locations; and
  
  using the set of candidate segmentation locations and the set of ideal segmentation locations to generate a set of segmentation locations; and
  
  using the set of segmentation locations to segment the string.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein a similarity value from the plurality of similarity values is obtained by performing the steps comprising:
    - selecting a first substring of the string wherein the first substring has a number of characters equal to a number of characters of the marker string;
      
      identifying a longest common subsequence between the marker string and the first substring; and
      
      calculating the similarity value by dividing the number of characters of the longest common subsequence length by the number of characters of the first substring.
  - 3. The method of claim 1 wherein the step of using the similarity vector to identify a set of candidate segmentation locations comprises:
    - generating a smoothed similarity vector by applying a low-pass filter to the similarity vector;
      
      calculating a histogram of similarity values from the smoothed similarity vector;
      
      calculating a similarity value threshold using the histogram of similarity values in the smoothed similarity vector and the expected number of discrete segments; and
      
      responsive to an element of the similarity vector having a similarity value that is greater than or equal to the similarity value threshold, adding the element to the set of candidate segmentation locations.
  - 4. The method of claim 1 wherein the local window is equal to the number of characters of the marker string.
  - 5. The method of claim 1 wherein the step of using the set of candidate segmentation locations and the set of ideal segmentation locations to generate the set of segmentation locations comprises:
    - responsive to the set of candidate segmentation locations being an empty set, defining the set of ideal segmentation locations as a set of segmentation locations; and
      
      responsive to the set of candidate segmentation locations not being an empty set, using at least some of the candidate segmentation locations to define a set of segmentation locations.
  - 6. The method of claim 5 wherein using at least some of the candidate segmentation locations to define a set of segmentation locations comprises:
    - responsive to an ideal segmentation location being within the distance threshold of a candidate segmentation location, adding the candidate segmentation location to the set of segmentation locations;
      
      responsive to an ideal segmentation location that is at the beginning of the set of ideal segmentation locations not having a candidate segmentation location within the distance threshold of the ideal segmentation location, adding the ideal segmentation location to the set of segmentation locations;
      
      responsive to an ideal segmentation location that is at the end of the set of ideal segmentation locations not having a candidate segmentation location within the distance threshold of the ideal segmentation location, adding the ideal segmentation location to the set of segmentation locations; and
      
      responsive to an ideal segmentation location that is not at the beginning or at the end of the set of ideal segmentation locations not being within the distance threshold of a candidate segmentation location, calculating an estimated segmentation location that is added to the set of segmentation locations by;
      
      responsive to a previous segmentation location being within the distance threshold of a previous ideal segmentation location that is adjacent to the ideal segmentation location and a next segmentation location being within the distance threshold of a subsequent ideal segmentation location that is adjacent to the ideal segmentation location, using the previous segmentation location and the next segmentation location to calculate the estimated segmentation location; and
      
      responsive to either the previous segmentation location or the next segmentation location not being within the distance threshold of an ideal segmentation location, using the ideal segmentation location as the estimated segmentation location.
  - 7. A tangible computer readable medium having instructions for performing the method of claim 1.

8. A method for segmenting a string comprising one or more segments into discrete segments, wherein each of the one or more segments comprises data that is at least a variant of a marker string, the method comprising:
- generating a similarity vector comprising a plurality of similarity values and associated locations within the string wherein a similarity value represents a comparison of the marker string and at least a portion of the string and an associated location associated with the similarity value is the location within the string of the start of the at least a portion of the string used in the comparison;
  
  identifying a set of ideal segmentation locations in the string based upon an expected number of discrete segments within the string;
  
  using the similarity vector to generate a set of candidate segmentation locations for segmenting the string based on a comparison of each of a plurality of elements of the similarity vector to a similarity value threshold obtained from a smoothed similarity vector;
  
  using the set of candidate segmentation locations and the set of ideal segmentation locations to generate the set of segmentation locations; and
  
  using the set of segmentation locations to segment the string.
- View Dependent Claims (9, 10, 11, 12)
- - 9. The method of claim 8 wherein the step of using the similarity vector to generate a set of candidate segmentation locations for segmenting the string based on a comparison of each of a plurality of elements of the similarity vector to a similarity value threshold obtained from a smoothed similarity vector comprises:
    - generating the smoothed similarity vector by applying a low-pass filter to the similarity vector;
      
      calculating a histogram of similarity values from the smoothed similarity vector;
      
      calculating the similarity value threshold using the histogram of similarity values in the smoothed similarity vector and the expected number of discrete segments; and
      
      responsive to an element of the similarity vector having a similarity value that is greater than or equal to the similarity value threshold, adding the element to the set of candidate segmentation locations.
  - 10. The method of claim 8 wherein the step of using the set of candidate segmentation locations and the set of ideal segmentation locations to generate the set of segmentation locations comprises:
    - responsive to the set of candidate segmentation locations being an empty set, defining the set of ideal segmentation locations as a set of segmentation locations; and
      
      responsive to the set of candidate segmentation locations not being an empty set, using at least some of the candidate segmentation locations to define a set of segmentation locations.
  - 11. The method of claim 10 wherein using at least some of the candidate segmentation locations to define a set of segmentation locations comprises:
    - responsive to an ideal segmentation location being within the distance threshold of a candidate segmentation location, adding the candidate segmentation location to the set of segmentation locations;
      
      responsive to an ideal segmentation location that is at the beginning of the set of ideal segmentation locations not having a candidate segmentation location within the distance threshold of the ideal segmentation location, adding the ideal segmentation location to the set of segmentation locations;
      
      responsive to an ideal segmentation location that is at the end of the set of ideal segmentation locations not having a candidate segmentation location within the distance threshold of the ideal segmentation location, adding the ideal segmentation location to the set of segmentation locations; and
      
      responsive to an ideal segmentation location that is not at the beginning or at the end of the set of ideal segmentation locations not being within the distance threshold of a candidate segmentation location, calculating an estimated segmentation location that is added to the set of segmentation locations by;
      
      responsive to a previous segmentation location being within the distance threshold of a previous ideal segmentation location that is adjacent to the ideal segmentation location and a next segmentation location being within the distance threshold of a subsequent ideal segmentation location that is adjacent to the ideal segmentation location, using the previous segmentation location and the next segmentation location to calculate the estimated segmentation location; and
      
      responsive to either the previous segmentation location or the next segmentation location not being within the distance threshold of an ideal segmentation location, using the ideal segmentation location as the estimated segmentation location.
  - 12. A tangible computer readable medium having instructions for performing the method of claim 8.

13. A system for segmenting a string comprising one or more segments into discrete segments, wherein each of the one or more segments comprises data that is the same as or similar to a marker string, the system comprising:
- a similarity vector generator, coupled to receive the string and the marker string, that generates a similarity vector comprising a plurality of similarity values and associated locations within the string wherein a similarity value represents a comparison of the marker string and at least a portion of the string and an associated location associated with the similarity value is the location within the string of the start of the at least a portion of the string used in the comparison;
  
  a segment location set generator, coupled to receive the similarity vector, that identifies a set of ideal segmentation locations based upon an expected number of discrete segments within the string, uses the similarity vector to identify a set of candidate segmentation locations, responsive to a candidate segmentation location having a similarity value less than another candidate segmentation location within a local window, removes the candidate segmentation location from the set of candidate segmentation locations, responsive to a candidate segmentation location and a closest ideal segmentation location being at a distance that is greater than a distance threshold, removes the candidate segmentation location from the set of candidate segmentation locations, and uses the set of candidate segmentation locations and the set of ideal segmentation locations to generate a set of segmentation locations, wherein a segmentation location marks the beginning of a discrete segment in the string; and
  
  a string segmenter, coupled to receive the set of segmentation locations, that uses the set of segmentation locations to segment the string.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The system of claim 13 wherein a similarity value from the plurality of similarity values is obtained by performing the steps comprising:
    - selecting a first substring of the string wherein the first substring has a number of characters equal to a number of characters of the marker string;
      
      identifying a longest common subsequence between the marker string and the first substring; and
      
      calculating the similarity value by dividing the number of characters of the longest common subsequence length by the number of characters of the first substring.
  - 15. The system of claim 13 wherein performing the steps to use the similarity vector to identify a set of candidate segmentation locations comprises:
    - generating a smoothed similarity vector by applying a low-pass filter to the similarity vector;
      
      calculating a histogram of similarity values from the smoothed similarity vector;
      
      calculating a similarity value threshold using the histogram of similarity values in the smoothed similarity vector and the expected number of discrete segments; and
      
      responsive to an element of the similarity vector having a similarity value that is greater than or equal to the similarity value threshold, adding the element to the set of candidate segmentation locations.
  - 16. The system of claim 13 wherein the local window is equal to the number of characters of the marker string.
  - 17. The system of claim 13 wherein performing the steps to use the set of candidate segmentation locations and the set of ideal segmentation locations to generate the set of segmentation locations comprises:
    - responsive to the set of candidate segmentation locations being an empty set, defining the set of ideal segmentation locations as a set of segmentation locations; and
      
      responsive to the set of candidate segmentation locations not being an empty set, using at least some of the candidate segmentation locations to define a set of segmentation locations.
  - 18. The system of claim 17 wherein using at least some of the candidate segmentation locations to define a set of segmentation locations comprises:
    - responsive to an ideal segmentation location being within the distance threshold of a candidate segmentation location, adding the candidate segmentation location to the set of segmentation locations;
      
      responsive to an ideal segmentation location that is at the beginning of the set of ideal segmentation locations not having a candidate segmentation location within the distance threshold of the ideal segmentation location, adding the ideal segmentation location to the set of segmentation locations;
      
      responsive to an ideal segmentation location that is at the end of the set of ideal segmentation locations not having a candidate segmentation location within the distance threshold of the ideal segmentation location, adding the ideal segmentation location to the set of segmentation locations; and
      
      responsive to an ideal segmentation location that is not at the beginning or at the end of the set of ideal segmentation locations not being within the distance threshold of a candidate segmentation location, calculating an estimated segmentation location that is added to the set of'"'"'segmentation locations by;
      
      responsive to a previous segmentation location being within the distance threshold of a previous ideal segmentation location that is adjacent to the ideal segmentation location and a next segmentation location being within the distance threshold of a subsequent ideal segmentation location that is adjacent to the ideal segmentation location, using the previous segmentation location and the next segmentation location to calculate the estimated segmentation location; and
      
      responsive to either the previous segmentation location or the next segmentation location not being within the distance threshold of an ideal segmentation location, using the ideal segmentation location as the estimated segmentation location.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Seiko Epson Corporation (Seiko Group)
Original Assignee
Seiko Epson Corporation (Seiko Group)
Inventors
Zandifar, Ali, Xiao, Jing
Primary Examiner(s)
Liew, Alex

Application Number

US11/943,285
Publication Number

US 20090129676A1
Time in Patent Office

1,491 Days
Field of Search

382168-231
US Class Current

382/181
CPC Class Codes

G06F 18/00 Pattern recognition

G06F 2218/16 by matching signal segments

Segmenting a string using similarity values

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

50 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Segmenting a string using similarity values

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

50 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links