Method, apparatus, and system for clustering and classification
First Claim
1. A computer method for labeling an electronic data stream as belonging to a predefined class comprising the steps of identifying an electronic data stream by one or more learning machines, comparing the outputs from the learning machines to determine the label to associate with the electronic data stream.
13 Assignments
0 Petitions
Accused Products
Abstract
The invention provides a method, apparatus and system for classification and clustering electronic data streams such as email, images and sound files for identification, sorting and efficient storage. The inventive systems disclose labeling a document as belonging to a predefined class though computer methods that comprise the steps of identifying an electronic data stream using one or more learning machines and comparing the outputs from the machines to determine the label to associate with the data. The method further utilizes learning machines in combination with hashing schemes to cluster and classify documents. In one embodiment hash apparatuses and methods taxonomize clusters. In yet another embodiment, clusters of documents utilize geometric hash to contain the documents in a data corpus without the overhead of search and storage.
-
Citations
120 Claims
- 1. A computer method for labeling an electronic data stream as belonging to a predefined class comprising the steps of identifying an electronic data stream by one or more learning machines, comparing the outputs from the learning machines to determine the label to associate with the electronic data stream.
-
13. A computer method for detecting a document having identified attributes comprising:
- (a) converting a binary coded message into numeric values;
(b) computing a hashing vector based upon the numeric values provided to a mathematical function;
(c) comparing a difference between a hashing vector and a stored vector. - View Dependent Claims (14, 15)
- (a) converting a binary coded message into numeric values;
-
16. A computer method for detection of a document having identified attributes received over a communication medium comprising the steps of:
- (a) generating an archive of a document having identified attribute digests;
(b) providing a first means for computing a digest of an email digest;
(c) computing a measure of difference between said email digest and one or more documents having identified attribute digests stored in the archive of documents.
- (a) generating an archive of a document having identified attribute digests;
-
17. A computer method for determining the similarity of a first data object to a second data object, comprising the steps of:
- (a) parsing each data object into a sequence of symbols having numerical value;
(b) computing a set of first digests based upon a mathematical function;
(c) grouping similar sets of first digest in an archive for retrieval;
(d) computing a new digest from a second a set of data object sequence of symbols having numerical value based upon the mathematical function;
(e) comparing the new digest to one or more similar sets of first digest so as to determine the smallest difference between the new digest and a member of the first set of digest to thereby determine data similarity of the objects.
- (a) parsing each data object into a sequence of symbols having numerical value;
-
18. An apparatus for detection of a document having identified attributes comprising (a) a means to convert a binary coded message into a set of numeric values;
- (b) a means to compute a hashing vector based upon the numeric values provided to a mathematical function;
(c) a means to compare a difference between the value of the hash vector to a stored vector or digest representing the stored vector;
(d) a means to append a header to a spam message based upon the comparison.
- (b) a means to compute a hashing vector based upon the numeric values provided to a mathematical function;
-
19. An apparatus for detection of a document having identified attributes received over a communication medium comprising:
- (a) a means for generating an archive of a document having identified attributes digests;
(b) a means for providing a first means for computing a digest of an email digest;
(c) a means for computing a measure of difference between said email digest and one or more document having identified attributes digests stored in the archive of document having identified attributes digests.
- (a) a means for generating an archive of a document having identified attributes digests;
-
20. An apparatus for determining the similarity of a first data object to a second data object comprising:
- (a) a means for parsing each data object into a sequence of symbols having numerical value;
(b) a means for computing from the numerical value a set of first digests based upon a mathematical function;
(c) a means for grouping similar sets of first digests in an archive for retrieval;
(d) a means for computing a new digest from the second data object sequence of symbols having numerical value based upon the mathematical functions;
(e) a means for comparing the new digest to one or more similar sets of first digests so as to determine the smallest difference between the new digest and a member of the first set of digests to thereby determine data similarity of the objects.
- (a) a means for parsing each data object into a sequence of symbols having numerical value;
-
21. A computer method for comparing a plurality of documents comprising the steps of:
- (a) receiving a first document having coded elements into a random access memory;
(b) converting the coded elements into a number between two limits;
(c) loading a data register serially from the random access memory with at least two adjacent data elements from the document;
(d) computing a vector corresponding to at least two associated adjacent data elements and a uniform filter;
(e) loading the one data register serially from a means for storing with a next adjacent data element from the document;
(f) computing a vector corresponding to at least two associated adjacent data elements and a uniform filter;
(g) repeating the steps (e) through (f) until elements from the first document have a corresponding vector;
(h) summing each associated vector element to form an associated hashing vector elements;
comparing the hashing vector with an archive of hashing vectors to determine similarity.
- (a) receiving a first document having coded elements into a random access memory;
- 22. A uniform filter set comprises a function of a random variable and a random matrix, such that input from a first electronic signal and a random function generator to the uniform filter produces an output that has an association to the first electronic signal input.
- 25. A computer method comprising the step of detecting the presence of a document having identified attributes by utilizing a uniform filter to test whether the document is email within a defined statistical class.
-
30. A computer method comprising the steps of:
- (a) receiving a plurality of hashing vectors from a set of documents and storing said sample hashing vectors into a random access memory;
(b) loading a data register with at least two adjacent data elements from a received document;
(d) computing an email hashing vector utilizing a hash means;
(e) and comparing the email hashing vector with the plurality of sampled hashing vectors.
- (a) receiving a plurality of hashing vectors from a set of documents and storing said sample hashing vectors into a random access memory;
-
31. A computer method comprising the steps of:
- (a) producing random matrices of numbers;
(b) inputting the numbers into a set of filters;
(c) inputting one or more data into one or more of the filters;
(d) calculating a function of the random number and the data;
(e) and summing the result. - View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40)
- (a) producing random matrices of numbers;
- 41. A computer method for detection of a document having identified attributes received over a communication medium comprising the step of dividing a space of feature vectors by choosing distinguishing points as centers of balls of radius r.
-
43. A computer method for detecting a document having identified attributes comprising the steps of:
- (a) inputting numeric values to a means for generating a hash;
(b) inputting the random numbers to the means for generating a hash;
(c) utilizing the means for generating a hash to compute a hashing vector based upon the inputs provided and a mathematical function, wherein the hashing vector elements are tested against one or more threshold. - View Dependent Claims (44, 45)
- (a) inputting numeric values to a means for generating a hash;
-
46. A process for detecting a pattern in an electronic signal comprising:
- (a) dividing the pattern signal into periods having an interval;
(b) inputting one or more periods of the signal into one or more means for generating a hash;
(c) inputting a random signal having periods with an interval to the one or more filters;
(c) computing a feature signal by utilizing the filter to transform each pattern signal by period by a function of each random signal;
(d) creating a hash pattern by comparing each feature signal time period n to a first selected one or more statistics of the pattern;
(e) creating a mask pattern by comparing each feature signal period to a second selected one or more statistics of the pattern;
(f) combining the hash pattern and the bit mask pattern and comparing the result to one or more patterns based upon the pattern to be detected; and
if a match exists then said pattern is detected. - View Dependent Claims (47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64)
- (a) dividing the pattern signal into periods having an interval;
-
65. A system for detecting a pattern in an electronic signal comprising:
- (a) a means for dividing the pattern signal into periods having an interval;
(b) a means for inputting one or more divided periods of the signal into one or more filters;
(c) a means for inputting a random signal having one or more periods with an interval to the one or more filters;
(c) a means for computing a feature signal by utilizing the filter to transform each pattern signal, by period, as a function of each random signal;
(d) a means for creating a hash pattern by comparing each feature signal period to a first selected one or more statistics of the pattern;
(e) a means for creating a bit mask pattern by comparing each feature signal period to a second selected one or more statistics of the pattern;
(f) a means for combining the hash pattern and the bit mask pattern and comparing the result to one or more patterns based upon the pattern to be detected; and
if a match exists then said pattern is detected. - View Dependent Claims (66, 67)
- (a) a means for dividing the pattern signal into periods having an interval;
-
68. A computer method for detecting transmission of a cluster of email, comprising the steps of:
- (a) receiving one or more email messages;
(b) generating hash values, based on one or more portions of the plurality of email messages;
(c) generating an associated bit mask value based on one or more portions of the plurality of email messages;
(d) determining whether the generated hash values and the associated bit mask values match corresponding hash values and associated bit mask values related to one or more prior email messages in the cluster. - View Dependent Claims (69, 70)
- (a) receiving one or more email messages;
-
71. A system for detecting transmission of potentially unwanted e-mails, comprising:
- means for observing a plurality of e-mails;
a means for creating a hashing vector for one or more portions of the plurality of emails, a means to generate hash values and a means to generate bit masks and a means for determining whether the generated hash values and associated bit mask values match hash values and associated bit mask values related to prior emails; and
a means for determining that the plurality of emails are potentially unwanted e-mails.
- means for observing a plurality of e-mails;
-
72. A computer method for improving the accuracy of text classification by operating within an unsure region comprising the steps of:
- utilizing a K-NN processor to determine the document having the greatest similarity to the text.
- View Dependent Claims (73, 74)
- 75. A computer method for storing email messages comprising the steps of utilizing a stackable hash process to determine the cluster wherein said cluster determines a delta-storage of the email.
-
77. A method for retrieving email messages comprising the steps of:
- utilizing a stackable hash process to determine the cluster wherein said cluster determines a location in memory.
-
78. A method for storing email messages comprising the steps of utilizing hash generating means to determine the cluster wherein said cluster determines a location in memory.
- 79. A method for creating an accumulation of documents stored as a cluster comprising the steps of utilizing a process to create a hashing vector to determine whether to add a document to a cluster.
- 80. A computer method for creating an accumulation of documents stored as a set of clusters comprising the steps of utilizing a stackable hash to determine whether to add a document to the set of clusters.
- 91. A computer method of combining SVM, NB and NN processes to optimize the machine-learning utility of text-classification.
-
93. A computer method of combining naï
- ve-bayes and K-NN processes to optimize the machine-learning utility of text-classification.
-
97. A computer method of using the delta storage method comprising the steps of:
- (i) creating clusters;
(ii) sorting clusters;
(iii) labeling to clusters;
(iv) identifying a shortest email of each cluster as representative;
(v) calculating a binary differential function on all other members of cluster;
(vi) tagging compressed emails within the clusters. - View Dependent Claims (98, 105, 109)
- (i) creating clusters;
- 112. A computer method for labeling an electronic data stream as belonging to a predefined class comprising the steps of identifying an electronic data stream by one or more learning machines, comparing the outputs from the learning machines to determine the label to associate with the electronic data stream, pre-defining a label for email users by processing and analyzing aggregate data compiled from an email content and label.
-
113. A computer method for labeling an electronic data stream as belonging to a predefined class comprising the steps of identifying an electronic data stream by one or more learning machines, comparing the outputs from the learning machines to determine the label to associate with the electronic data stream, deciding whether to use a uniform filter or a stackable hash to determine a cluster for the electronic data stream.
-
114. A computer method for labeling an electronic data stream as belonging to a predefined class comprising the steps of identifying an electronic data stream by one or more learning machines, comparing the outputs from the learning machines to determine the label to associate with the electronic data stream, deciding whether to use a uniform filter or a stackable hash to determine a cluster for a document having identified attributes email.
-
115. A computer method for labeling an electronic data stream as belonging to a predefined class comprising the steps of identifying an electronic data stream by one or more learning machines, comparing the outputs from the learning machines to determine the label to associate with the electronic data stream, determining an acceptable level of accuracy after use of a K-NN methods to divide space into one or more classes.
Specification