Method and apparatus for learning, recognizing and generalizing sequences

US 20070055662A1
Filed: 08/01/2004
Published: 03/08/2007
Est. Priority Date: 08/01/2004
Status: Abandoned Application

First Claim

Patent Images

1-164. -164. (canceled)

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of generalizing a dataset having a plurality of sequences defined over a lexicon of tokens is provided. The method comprises: searching over the dataset for similarity sets, where each similarity set comprises a plurality of segments of size L having L−S common tokens and S uncommon tokens; and defining a plurality of equivalence classes corresponding to uncommon tokens of at least one similarity set. The method may further comprise a step in which a plurality of significant patterns are extracted, where each significant pattern corresponds to a most significant partial overlap between one sequence of the dataset and other sequences of the dataset. In one embodiment, a generalized dataset represented by a graph or a forest is constructed, and can be realized as a context-free grammar. The graph or forest can be used for generating sequences and/or testing grammatical structures.

80 Citations

194 Claims

1-164. -164. (canceled)

165. A method of extracting significant patterns from a dataset having a plurality of sequences defined over a lexicon of tokens, the method comprising, for each sequence of the plurality of sequences:
- searching for partial overlaps between said sequence and other sequences of the dataset, applying a significance test on said partial overlaps, and defining a most significant partial overlap as a significant pattern of said sequence, thereby extracting significant patterns from the dataset.
- View Dependent Claims (166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180)
- - 166. The method of claim 165, wherein said search for partial overlaps is by constructing a graph having a plurality of paths representing the dataset and searching for partial overlaps between paths of said graph.
  - 167. The method of claim 166, wherein said search for partial overlaps between paths of said graph comprises:
    - defining, for each path, a set of sub-paths of variable lengths, thereby defining a plurality of sets of sub-paths; and
      
      for each set of sub-paths, comparing each sub-path of said set with sub-paths of other sets.
  - 168. The method of claim 166, wherein said graph comprises a plurality of vertices, each representing one token of the lexicon, and further wherein each path of said plurality of paths comprises a sequence of vertices respectively corresponding to one sequence of the dataset.
  - 169. The method of claim 166, further comprising calculating, for each path, a set of probability functions characterizing said partial overlaps.
  - 170. The method of claim 165, further comprising grouping at least a few tokens of said significant pattern, thereby redefining the dataset.
  - 171. The method of claim 165, wherein the dataset comprises a corpus of text.
  - 172. The method of claim 165, wherein the dataset comprises a protein database.
  - 173. The method of claim 165, wherein the dataset comprises a DNA database.
  - 174. The method of claim 165, wherein the dataset comprises an RNA database.
  - 175. The method of claim 165, wherein the dataset comprises a recorded speech.
  - 176. The method of claim 165, wherein the dataset comprises a corpus of music notes.
  - 177. The method of claim 165, wherein the dataset comprises a weblog database.
  - 178. The method of claim 165, wherein the dataset comprises trajectory records of a transportation network.
  - 179. The method of claim 165, wherein the dataset comprises activity records of a self-active system.
  - 180. The method of claim 165, wherein the dataset comprises records of operational steps in a technical process.

181. A method of generalizing a dataset having a plurality of sequences defined over a lexicon of tokens, the method comprising:
- searching over the dataset for similarity sets, each similarity set comprising a plurality of segments of size L having L−
  
  S common tokens and S uncommon tokens, each of said plurality of segments being a portion of a different sequence of the dataset; and
  
  defining a plurality of equivalence classes corresponding to uncommon tokens of at least one similarity set, thereby generalizing the dataset.
- View Dependent Claims (182, 183, 184, 185, 186)
- - 182. The method of claim 181, wherein said definition of said plurality of equivalence classes comprises, for each segment of each similarity set:
    - extracting a significant pattern corresponding to a most significant partial overlap between said segment and other segments or combination of segments of said similarity set, thereby providing, for each similarity set, a plurality of significant patterns; and
      
      using said plurality of significant patterns for classifying tokens of said similarity set into at least one equivalence class;
      
      thereby defining said plurality of equivalence classes.
  - 183. The method of claim 182, further comprising, prior to said search for said similarity sets:
    - extracting a plurality of significant patterns from the dataset, each significant pattern of said plurality of significant patterns corresponding to a most significant partial overlap between one sequence of the dataset and other sequences of the dataset; and
      
      for each significant pattern of said plurality of significant patterns, grouping at least a few tokens of said significant pattern, thereby redefining the dataset.
  - 184. The method of claim 181, further comprising, for each similarity set having at least one equivalence class, grouping at least a few tokens of said similarity set thereby redefining the dataset.
  - 185. The method of claim 181, further comprising for each sequence, searching over said sequence for tokens being identified as members of previously defined equivalence classes, and attributing a respective equivalence class to each identified token, thereby generalizing said sequence, thereby further generalizing the dataset.
  - 186. The method of claim 183, further comprising constructing a graph having a plurality of paths representing the dataset, wherein each extraction of significant pattern is by searching for partial overlaps between paths of said graph.

187. An apparatus for generalizing a dataset having a plurality of sequences defined over a lexicon of tokens, the apparatus comprising:
- (a) a searcher, for searching over the dataset for similarity sets, each similarity set comprising a plurality of segments of size L having L−
  
  S common tokens and S uncommon tokens, each of said plurality of segments being a portion of a different sequence of the dataset; and
  
  (b) a definition unit, for defining a plurality of equivalence classes corresponding to uncommon tokens of at least one similarity set, thereby generalizing the dataset.
- View Dependent Claims (188, 189, 190, 191, 192, 193, 194)
- - 188. The apparatus of claim 187, further comprising an extractor, capable of extracting, for a given set of sequences, a significant pattern corresponding to a most significant partial overlap between one sequence of said set of sequences and other sequences of said set of sequences, thereby providing, for said given set of sequences, a plurality of significant patterns.
  - 189. The apparatus of claim 188, wherein said given set of sequences is a similarity set, hence said plurality of significant patterns corresponds to said similarity set.
  - 190. The apparatus of claim 188, wherein said classifier is designed for selecting a leading significant pattern of said similarity set, and defining uncommon tokens of segments corresponding to said leading significant pattern as an equivalence class.
  - 191. The apparatus of claim 188, wherein said given set of sequences is the dataset, hence said plurality of significant patterns corresponds to the dataset.
  - 192. The apparatus of claim 188, further comprising a first grouper for grouping at least a few tokens of each significant pattern of said plurality of significant patterns.
  - 193. The apparatus of claim 187, further comprising a second definition unit having a second searcher, for searching over each sequence for tokens being identified as members of previously defined equivalence classes, wherein said second definition unit is designed to attribute a respective equivalence class to each identified token.
  - 194. The apparatus of claim 188, further comprising a constructor, for constructing a graph having a plurality of paths representing the dataset.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cornell Research Foundation Incorporated (Cornell University), Ramot at Tel Aviv University Limited (Tel Aviv University)
Original Assignee
Cornell Research Foundation Incorporated (Cornell University)
Inventors
Horn, David, Ruppin, Eytan, Edelman, Shimon, Solan, Tsach

Application Number

US10/566,480
Publication Number

US 20070055662A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 40/237 Lexical tools

Method and apparatus for learning, recognizing and generalizing sequences

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

80 Citations

194 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for learning, recognizing and generalizing sequences

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

80 Citations

194 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links