Method, apparatus, and system for clustering and classification

US 20060095521A1
Filed: 11/04/2004
Published: 05/04/2006
Est. Priority Date: 11/04/2004
Status: Active Grant

First Claim

Patent Images

1. A computer method for labeling an electronic data stream as belonging to a predefined class comprising the steps of identifying an electronic data stream by one or more learning machines, comparing the outputs from the learning machines to determine the label to associate with the electronic data stream.

View all claims

13 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention provides a method, apparatus and system for classification and clustering electronic data streams such as email, images and sound files for identification, sorting and efficient storage. The inventive systems disclose labeling a document as belonging to a predefined class though computer methods that comprise the steps of identifying an electronic data stream using one or more learning machines and comparing the outputs from the machines to determine the label to associate with the data. The method further utilizes learning machines in combination with hashing schemes to cluster and classify documents. In one embodiment hash apparatuses and methods taxonomize clusters. In yet another embodiment, clusters of documents utilize geometric hash to contain the documents in a data corpus without the overhead of search and storage.

Citations

120 Claims

1. A computer method for labeling an electronic data stream as belonging to a predefined class comprising the steps of identifying an electronic data stream by one or more learning machines, comparing the outputs from the learning machines to determine the label to associate with the electronic data stream.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method as in claim 1 wherein the electronic data stream includes an ambiguous class.
  - 3. The method as in claim 1 wherein a neural network processing results in identifying and classifying the electronic data stream.
  - 4. The method as in claim 1 wherein a support vector machine processing results in identifying and classifying the electronic data stream.
  - 5. The method as in claim 1 wherein a naï
    - ve bayses processing results in identifying and classifying the electronic data stream.
  - 6. The method as in claim 1 wherein an outlier class is identified by an administrative function.
  - 7. The method as in claim 2 wherein a K-NN processes the ambiguous class providing for placement within a cluster of similar electronic data streams.
  - 8. The method as in claim 1 wherein the electronic data stream is a portion of a document.
  - 9. The method as in claim 8 wherein the document is an email.
  - 10. The method as in claim 1 wherein the electronic data stream is a portion of an image.
  - 11. The method as in claim 1 wherein the electronic data stream is a portion of sound file.
  - 12. The method as in claim 1 wherein hash technology processing results in classifying the electronic data stream.

13. A computer method for detecting a document having identified attributes comprising:
- (a) converting a binary coded message into numeric values;
  
  (b) computing a hashing vector based upon the numeric values provided to a mathematical function;
  
  (c) comparing a difference between a hashing vector and a stored vector.
- View Dependent Claims (14, 15)
- - 14. The computer method as in claim 13 which further comprises appending a header to the message that indicates said differences.
  - 15. The computer method as in claim 13 wherein the numeric values consist of at least two bytes.

16. A computer method for detection of a document having identified attributes received over a communication medium comprising the steps of:
- (a) generating an archive of a document having identified attribute digests;
  
  (b) providing a first means for computing a digest of an email digest;
  
  (c) computing a measure of difference between said email digest and one or more documents having identified attribute digests stored in the archive of documents.

17. A computer method for determining the similarity of a first data object to a second data object, comprising the steps of:
- (a) parsing each data object into a sequence of symbols having numerical value;
  
  (b) computing a set of first digests based upon a mathematical function;
  
  (c) grouping similar sets of first digest in an archive for retrieval;
  
  (d) computing a new digest from a second a set of data object sequence of symbols having numerical value based upon the mathematical function;
  
  (e) comparing the new digest to one or more similar sets of first digest so as to determine the smallest difference between the new digest and a member of the first set of digest to thereby determine data similarity of the objects.

18. An apparatus for detection of a document having identified attributes comprising (a) a means to convert a binary coded message into a set of numeric values;
- (b) a means to compute a hashing vector based upon the numeric values provided to a mathematical function;
  
  (c) a means to compare a difference between the value of the hash vector to a stored vector or digest representing the stored vector;
  
  (d) a means to append a header to a spam message based upon the comparison.

19. An apparatus for detection of a document having identified attributes received over a communication medium comprising:
- (a) a means for generating an archive of a document having identified attributes digests;
  
  (b) a means for providing a first means for computing a digest of an email digest;
  
  (c) a means for computing a measure of difference between said email digest and one or more document having identified attributes digests stored in the archive of document having identified attributes digests.

20. An apparatus for determining the similarity of a first data object to a second data object comprising:
- (a) a means for parsing each data object into a sequence of symbols having numerical value;
  
  (b) a means for computing from the numerical value a set of first digests based upon a mathematical function;
  
  (c) a means for grouping similar sets of first digests in an archive for retrieval;
  
  (d) a means for computing a new digest from the second data object sequence of symbols having numerical value based upon the mathematical functions;
  
  (e) a means for comparing the new digest to one or more similar sets of first digests so as to determine the smallest difference between the new digest and a member of the first set of digests to thereby determine data similarity of the objects.

21. A computer method for comparing a plurality of documents comprising the steps of:
- (a) receiving a first document having coded elements into a random access memory;
  
  (b) converting the coded elements into a number between two limits;
  
  (c) loading a data register serially from the random access memory with at least two adjacent data elements from the document;
  
  (d) computing a vector corresponding to at least two associated adjacent data elements and a uniform filter;
  
  (e) loading the one data register serially from a means for storing with a next adjacent data element from the document;
  
  (f) computing a vector corresponding to at least two associated adjacent data elements and a uniform filter;
  
  (g) repeating the steps (e) through (f) until elements from the first document have a corresponding vector;
  
  (h) summing each associated vector element to form an associated hashing vector elements;
  
  comparing the hashing vector with an archive of hashing vectors to determine similarity.

22. A uniform filter set comprises a function of a random variable and a random matrix, such that input from a first electronic signal and a random function generator to the uniform filter produces an output that has an association to the first electronic signal input.
- View Dependent Claims (23, 24)
- - 23. The uniform filter in claim 22 wherein each filter in the set differs by a related seed used for the initialization of such a function.
  - 24. The uniform filter in claim 22 wherein each filter in the set utilizes a same threshold to determine the measure of statistical identity between signals within the same class.

25. A computer method comprising the step of detecting the presence of a document having identified attributes by utilizing a uniform filter to test whether the document is email within a defined statistical class.
- View Dependent Claims (26, 27, 28, 29)
- - 26. The computer method in claim 25, further comprising the steps of:
    - (a) utilizing the uniform filter to form a feature vector that represents one or more classes of email;
      
      (b) comparing the feature vector to feature vector samples of to determine whether the sample is similar to the received email.
  - 27. The computer method in claim 25, further comprising the steps of tagging an email detected as spam with a measure of its spamicity.
  - 28. The computer method in claim 25, further comprising the steps of utilizing the measure of its spamicity to isolate the email.
  - 29. The computer method in claim 28, further comprising the steps of isolating the email based upon preprogrammed rules.

30. A computer method comprising the steps of:
- (a) receiving a plurality of hashing vectors from a set of documents and storing said sample hashing vectors into a random access memory;
  
  (b) loading a data register with at least two adjacent data elements from a received document;
  
  (d) computing an email hashing vector utilizing a hash means;
  
  (e) and comparing the email hashing vector with the plurality of sampled hashing vectors.

31. A computer method comprising the steps of:
- (a) producing random matrices of numbers;
  
  (b) inputting the numbers into a set of filters;
  
  (c) inputting one or more data into one or more of the filters;
  
  (d) calculating a function of the random number and the data;
  
  (e) and summing the result.
- View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40)
- - 32. The computer method in claim 31, wherein the matrices comprise separate matrices each having 256×
    - 256 elements.
  - 33. The computer method in claim 32, wherein the matrices comprise separate matrices each have 256×
    - 256 elements and entries in the range 0 through 255.
  - 34. The computer method in claim 33, wherein the matrices comprise separate matrices each having 256×
    - 256 elements and entries in the range 0 through 255, which are produced from a timestamp.
  - 35. The computer method in claim 34, wherein the matrices are formed from a pseudo-random number generator.
  - 36. The computer method in claim 35, wherein the pseudo-random number generator produces a sequence of entries of numbers in the range 0 to 255.
  - 37. The computer method in claim 36, wherein the pseudo-random number generator produces a sequence of entries of numbers in the range 0 to 255 base 10 that exceeds 33,000 places to the left and 33,000 to the right.
  - 38. The computer method in claim 37, wherein the pseudo-random number generator produces a sequence of entries of numbers in the range 0 to 255 base 10 utilizing the timestamp as input to create the sequence of random numbers.
  - 39. The computer method in claim 38, wherein the pseudo-random number generator produces a sequence of entries of numbers in the range 0 to 255 utilizing the timestamp as input to create the sequence of random numbers that have uniform distributions of values.
  - 40. The computer method in claim 31, wherein the sequence of random numbers that have uniform distributions of values are utilized to form the uniform filter.

41. A computer method for detection of a document having identified attributes received over a communication medium comprising the step of dividing a space of feature vectors by choosing distinguishing points as centers of balls of radius r.
- View Dependent Claims (42)
- - 42. The computer method as in claim 41 wherein an oldest in time feature vector loses eligibility as a center of a cluster and is replaced by a newest feature vector.

43. A computer method for detecting a document having identified attributes comprising the steps of:
- (a) inputting numeric values to a means for generating a hash;
  
  (b) inputting the random numbers to the means for generating a hash;
  
  (c) utilizing the means for generating a hash to compute a hashing vector based upon the inputs provided and a mathematical function, wherein the hashing vector elements are tested against one or more threshold.
- View Dependent Claims (44, 45)
- - 44. The computer method as in claim 43 comprising the further step of testing a first feature vector element (a) and if greater than a first preselected number or less than a second preselected number, then setting the state of a first element of an associated hash vector equal to one;
    - (b) or otherwise setting the state first element of an associated hash vector equal to zero; and
      
      (c) repeating step ‘
      
      a’
      
      though ‘
      
      b’
      
      for each of the elements of the feature vector and associated hash vector, until all feature vector elements are tested.
  - 45. The computer method as in claim 43 comprising the further steps of (a) testing whether the first feature vector element is greater than a quantizing element;
    - (b) and setting an associated bit mask to one otherwise;
      
      (c) or if the element is less than or equal to the quantizing element, then setting the associated bit mask to zero.

46. A process for detecting a pattern in an electronic signal comprising:
- (a) dividing the pattern signal into periods having an interval;
  
  (b) inputting one or more periods of the signal into one or more means for generating a hash;
  
  (c) inputting a random signal having periods with an interval to the one or more filters;
  
  (c) computing a feature signal by utilizing the filter to transform each pattern signal by period by a function of each random signal;
  
  (d) creating a hash pattern by comparing each feature signal time period n to a first selected one or more statistics of the pattern;
  
  (e) creating a mask pattern by comparing each feature signal period to a second selected one or more statistics of the pattern;
  
  (f) combining the hash pattern and the bit mask pattern and comparing the result to one or more patterns based upon the pattern to be detected; and
  
  if a match exists then said pattern is detected.
- View Dependent Claims (47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64)
- - 47. The process as in claim 46, wherein the electronic signal is analog.
  - 48. The process as in claim 46, wherein the electronic signal is digital.
  - 49. The process as in claim 46 wherein the electronic signal is a text message.
  - 50. The process as in claim 46, wherein the electronic signal is a voice message.
  - 51. The process as in claim 46, wherein the electronic signal is of an image.
  - 52. The process as in claim 46, wherein the electronic signal is a composite signal containing one or more of:
    - video, audio and control messages.
  - 53. The process as in claim 46, wherein the random signal is digital.
  - 54. The process as in claim 46, wherein the random signal is analog.
  - 55. The process as in claim 46, wherein the feature signal is digital.
  - 56. The process as in claim 46, wherein the feature signal is analog.
  - 57. The process as in claim 46, wherein the filter to transform is a non linear transformation.
  - 58. The process as in claim 46, wherein the filter to transform is a linear transformation.
  - 59. The process as in claim 46, wherein the hash pattern is analog.
  - 60. The process as in claim 46, wherein the hash pattern is digital.
  - 61. The process as in claim 46, wherein the bit mask pattern is digital.
  - 62. The process as in claim 46, wherein the bit mask pattern is analog.
  - 63. The process as in claim 46, further comprising appending a header to the electronic signal to indicate the match.
  - 64. The process as in claim 46, wherein the electronic signal is a binary coded message converted into numeric values consists of two bytes.

65. A system for detecting a pattern in an electronic signal comprising:
- (a) a means for dividing the pattern signal into periods having an interval;
  
  (b) a means for inputting one or more divided periods of the signal into one or more filters;
  
  (c) a means for inputting a random signal having one or more periods with an interval to the one or more filters;
  
  (c) a means for computing a feature signal by utilizing the filter to transform each pattern signal, by period, as a function of each random signal;
  
  (d) a means for creating a hash pattern by comparing each feature signal period to a first selected one or more statistics of the pattern;
  
  (e) a means for creating a bit mask pattern by comparing each feature signal period to a second selected one or more statistics of the pattern;
  
  (f) a means for combining the hash pattern and the bit mask pattern and comparing the result to one or more patterns based upon the pattern to be detected; and
  
  if a match exists then said pattern is detected.
- View Dependent Claims (66, 67)
- - 66. The system as in claim 65, wherein the signals are analog.
  - 67. The system as in claim 65, wherein the signals are digital.

68. A computer method for detecting transmission of a cluster of email, comprising the steps of:
- (a) receiving one or more email messages;
  
  (b) generating hash values, based on one or more portions of the plurality of email messages;
  
  (c) generating an associated bit mask value based on one or more portions of the plurality of email messages;
  
  (d) determining whether the generated hash values and the associated bit mask values match corresponding hash values and associated bit mask values related to one or more prior email messages in the cluster.
- View Dependent Claims (69, 70)
- - 69. The method of claim 68, further comprising generating a salience score for the plurality of email messages based on a result of the determination of whether the generated hash values and the associated bit mask values match corresponding hash values and the associated bit mask values related to prior email messages of the class.
  - 70. The method of claim 69 comprising the further step of taking remedial action when the one of the plurality of email messages is a potentially unwanted email.

71. A system for detecting transmission of potentially unwanted e-mails, comprising:
- means for observing a plurality of e-mails;
  
  a means for creating a hashing vector for one or more portions of the plurality of emails, a means to generate hash values and a means to generate bit masks and a means for determining whether the generated hash values and associated bit mask values match hash values and associated bit mask values related to prior emails; and
  
  a means for determining that the plurality of emails are potentially unwanted e-mails.

72. A computer method for improving the accuracy of text classification by operating within an unsure region comprising the steps of:
- utilizing a K-NN processor to determine the document having the greatest similarity to the text.
- View Dependent Claims (73, 74)
- - 73. The computer method in claim 72 for improving the accuracy of text classification by operating within an unsure region further comprising the step of:
    - utilizing a hash generating means to determine the cluster having the greatest similarity to the text.
  - 74. The computer method in claim 72 improving the accuracy of text classification by operating within an unsure region further comprising the step of:
    - utilizing a stackable hash process to determine the cluster having the greatest similarity to the text.

75. A computer method for storing email messages comprising the steps of utilizing a stackable hash process to determine the cluster wherein said cluster determines a delta-storage of the email.
- View Dependent Claims (76)
- - 76. The computer method in claim 75 for storing email messages further comprising the steps of utilizing a uniform filter process to determine the cluster wherein said cluster enables a delta-storage of the email.

77. A method for retrieving email messages comprising the steps of:
- utilizing a stackable hash process to determine the cluster wherein said cluster determines a location in memory.

78. A method for storing email messages comprising the steps of utilizing hash generating means to determine the cluster wherein said cluster determines a location in memory.

79. A method for creating an accumulation of documents stored as a cluster comprising the steps of utilizing a process to create a hashing vector to determine whether to add a document to a cluster.
- View Dependent Claims (81, 83, 85, 86, 87, 88, 89, 90, 94, 96, 99, 100, 101, 102, 103, 104, 106, 107, 108, 110, 111)
- - 81. The computer method for creating an accumulation of documents stored as a cluster as in claim 79 adjusted by an aging function.
  - 83. The computer method for creating an accumulation of documents stored as a cluster as in claim 79 further including a mask to identify document clusters.
  - 85. The computer method as in claim 79 further comprising auto-labeling emails according to instantiation of labels on the basis of whether these labels are pre-defined or user-defined.
  - 86. The computer method as in claim 79 further comprising determining whether text inputs are related or not related, by assigning a score relating to this determination.
  - 87. The computer method as in claim 79 further comprising forming email clustering and displaying in graphical form long hash and small hash.
  - 88. The computer method as in claim 79 further comprising forming email clustering utilizing a small hash threshold.
  - 89. The computer method as in claim 79 further comprising forming email clustering utilizing small hash length.
  - 90. The computer method as in claim 79 further comprising forming email clustering utilizing small hash average.
  - 94. The computer method as in claim 79 further to include the step of ascribing retention rates to clusters of email based upon clustering.
  - 96. The computer method as in claim 79 further to include the step of creating hash methods to minimize computation time for clustering and maximize accuracy of classification.
  - 99. The computer method as in claim 79 further to include the step of using a clustering method for the purpose of identifying emails containing one of a malicious code, a phishing, a virus or a worm.
  - 100. The computer method as in claim 79 further to include the step of using a clustering method for the purpose of identifying an email for purposes of delta storage.
  - 101. The computer method as in claim 79 further to include the step of using a mask on a time frame for the purpose of identifying a cluster member.
  - 102. The computer method as in claim 79 further to include the step of using an aging function to identify a cluster member.
  - 103. The computer method as in claim 79 further to include the step of using a cluster to populate a central repository of hash data.
  - 104. The computer method as in claim 79 further to include the step of populating a hash table for anonymous sharing of information between organizations.
  - 106. The computer method as in claim 79 of further to include the step of selecting a hash function with small collisions.
  - 107. The computer method as in claim 79 further to include the step of sorting by class a indexing technique to route queries.
  - 108. The computer method as in claim 79 further to include the step of using classes to route queries as part of a cluster.
  - 110. The computer method as in claim 79 further to include the step of choosing a prime number P for the purpose of determining equivalence classes (mod P) in a stackable hash method.
  - 111. The computer method as in claim 79 further to include the step of using a brightness and an aging function in a hash method for creating a cluster.

80. A computer method for creating an accumulation of documents stored as a set of clusters comprising the steps of utilizing a stackable hash to determine whether to add a document to the set of clusters.
- View Dependent Claims (82, 84)
- - 82. The computer method for creating an accumulation of documents stored as a cluster as in claim 80 adjusted by an aging function.
  - 84. The computer method for creating an accumulation of documents stored as a cluster as in claim 80 further including a mask to identify document clusters.

91. A computer method of combining SVM, NB and NN processes to optimize the machine-learning utility of text-classification.
- View Dependent Claims (92, 95)
- - 92. The computer method as in claim 91 further to include the step of combining K-NN processes to optimize the machine-learning utility of text-classification.
  - 95. The computer method as in claim 91 further to include the step of using user-defined instantiation of labels to initialize or train a text-classifier.

93. A computer method of combining naï
- ve-bayes and K-NN processes to optimize the machine-learning utility of text-classification.

97. A computer method of using the delta storage method comprising the steps of:
- (i) creating clusters;
  
  (ii) sorting clusters;
  
  (iii) labeling to clusters;
  
  (iv) identifying a shortest email of each cluster as representative;
  
  (v) calculating a binary differential function on all other members of cluster;
  
  (vi) tagging compressed emails within the clusters.
- View Dependent Claims (98, 105, 109)
- - 98. The computer method as in claim 97 further to include the step of using the delta storage method to compress files.
  - 105. The computer method as in claim 97 further to include the step of compressing email corpus by deleting redundant information.
  - 109. The computer method as in claim 97 further to include the step of applying a classification to one or more objects contained in a data corpus for the purpose of information retrieval.

112. A computer method for labeling an electronic data stream as belonging to a predefined class comprising the steps of identifying an electronic data stream by one or more learning machines, comparing the outputs from the learning machines to determine the label to associate with the electronic data stream, pre-defining a label for email users by processing and analyzing aggregate data compiled from an email content and label.
- View Dependent Claims (116, 117, 118, 119, 120)
- - 116. The computer method as in claim 112 further to include the step of choosing whether to use K-NN or neural networks based on one of:
    - a pre-processing time, an execution time, a level of accuracy or level of stability.
  - 117. The computer method as in claim 112 further to include the step of deciding to use a uniform filter method or a stackable hash method to cluster an email in the unsure region.
  - 118. The computer method as in claim 112 further to include the step of using the stackable hash to compare electronic documents.
  - 119. The computer method as in claim 112 further to include the step of for pre-processing text for clustering so as to map text inputs into a vector space for analysis by a learning machine.
  - 120. The computer method as in claim 112 further to include the step of using mixed distributions to evaluate importance and simultaneous optimization of pre-processing time, execution time, accuracy and stability for choice of classification methods from one of:
    - SVM, NN, and NB.

113. A computer method for labeling an electronic data stream as belonging to a predefined class comprising the steps of identifying an electronic data stream by one or more learning machines, comparing the outputs from the learning machines to determine the label to associate with the electronic data stream, deciding whether to use a uniform filter or a stackable hash to determine a cluster for the electronic data stream.

114. A computer method for labeling an electronic data stream as belonging to a predefined class comprising the steps of identifying an electronic data stream by one or more learning machines, comparing the outputs from the learning machines to determine the label to associate with the electronic data stream, deciding whether to use a uniform filter or a stackable hash to determine a cluster for a document having identified attributes email.

115. A computer method for labeling an electronic data stream as belonging to a predefined class comprising the steps of identifying an electronic data stream by one or more learning machines, comparing the outputs from the learning machines to determine the label to associate with the electronic data stream, determining an acceptable level of accuracy after use of a K-NN methods to divide space into one or more classes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sysxnet Ltd.
Original Assignee
Trustwave Holdings Incorporated (Singapore Telecommunications Limited)
Inventors
Patinkin, Seth

Granted Patent

US 7,574,409 B2
Time in Patent Office

Days
Field of Search
US Class Current

709/206
CPC Class Codes

G06Q 10/107 Computer-aided management o...

H04L 51/212 using filtering or selectiv...

Method, apparatus, and system for clustering and classification

First Claim

13 Assignments

0 Petitions

Accused Products

Abstract

Citations

120 Claims

Specification

Solutions

Use Cases

Quick Links

Method, apparatus, and system for clustering and classification

First Claim

13 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

120 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links