Electronic document classification using composite hyperspace distances
First Claim
1. A non-transitory computer-readable medium encoding instructions which, when executed by a computer system, cause the computer system to:
- parse an electronic text document to generate a document vector for the electronic text document, wherein the document vector includes a feature count component and a feature position component, wherein the feature count component includes a plurality of feature count indicators for the electronic text document, wherein the feature position component includes a data structure selected from a group consisting of an ordered list and a tree of document substructure indicators, each document substructure indicator denoting a type of substructure in the electronic text document, and wherein a position of said each document substructure indicator in the data structure characterizes a position of a corresponding substructure in the electronic text document;
determine a plurality of composite hyperspace distances between the document vector and a plurality of reference vectors, each composite hyperspace distance being defined between the document vector and a reference vector of the plurality of reference vectors, wherein each composite hyperspace distance is a function of a Euclidean-space distance dependent on the feature count component of the document vector and of an edit distance dependent on the feature position component of the document vector; and
classify the electronic text document according to at least one of the plurality of composite hyperspace distances.
0 Assignments
0 Petitions
Accused Products
Abstract
In some embodiments, a layout-based electronic communication classification (e.g. spam filtering) method includes generating a layout vector characterizing a layout of a message, assigning the message to a selected cluster according to a hyperspace distance between the layout vector and a central vector of the selected cluster, and classifying the message (e.g. labeling as spam or non-spam) according to the selected cluster. The layout vector is a message representation characterizing a set of relative positions of metaword substructures of the message, as well as metaword substructure counts. Examples of metaword substructures include MIME parts and text lines. For example, a layout vector may have a first component having scalar axes defined by numerical layout feature counts (e.g. numbers of lines, blank lines, links, email addresses), and a second vector component including a line-structure list and a formatting part (e.g. MIME part) list.
-
Citations
18 Claims
-
1. A non-transitory computer-readable medium encoding instructions which, when executed by a computer system, cause the computer system to:
-
parse an electronic text document to generate a document vector for the electronic text document, wherein the document vector includes a feature count component and a feature position component, wherein the feature count component includes a plurality of feature count indicators for the electronic text document, wherein the feature position component includes a data structure selected from a group consisting of an ordered list and a tree of document substructure indicators, each document substructure indicator denoting a type of substructure in the electronic text document, and wherein a position of said each document substructure indicator in the data structure characterizes a position of a corresponding substructure in the electronic text document; determine a plurality of composite hyperspace distances between the document vector and a plurality of reference vectors, each composite hyperspace distance being defined between the document vector and a reference vector of the plurality of reference vectors, wherein each composite hyperspace distance is a function of a Euclidean-space distance dependent on the feature count component of the document vector and of an edit distance dependent on the feature position component of the document vector; and classify the electronic text document according to at least one of the plurality of composite hyperspace distances. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method comprising employing a computer system to:
-
parse an electronic text document to generate a document vector for the electronic text document, wherein the document vector includes a feature count component and a feature position component, wherein the feature count component includes a plurality of feature count indicators for the electronic text document, wherein the feature position component includes a data structure selected from a group consisting of an ordered list and a tree of document substructure indicators, each document substructure indicator denoting a type of substructure in the electronic text document, and wherein a position of said each document substructure indicator in the data structure characterizes a position of a corresponding substructure in the electronic text document; determine a plurality of composite hyperspace distances between the document vector and a plurality of reference vectors, each composite hyperspace distance being defined between the document vector and a reference vector of the plurality of reference vectors, wherein each composite hyperspace distance is a function of a Euclidean-space distance dependent on the feature count component of the document vector and of an edit distance dependent on the feature position component of the document vector; and classify the electronic text document according to at least one of the plurality of composite hyperspace distances.
-
-
10. A non-transitory computer-readable medium encoding instructions which, when executed by a computer system, cause the computer system to:
-
generate an electronic communication feature count indicator comprising a plurality of feature counts for the electronic communication; generate a text line structure indicator characterizing a line structure of the electronic communication, the text line structure indicator comprising a data structure selected from a group consisting of an ordered list and a tree of text line type indicators, a position of each text line type indicator in the data structure being indicative of a position of a corresponding text line type in the electronic communication; determine a composite hyperspace distance between a vector representing the electronic communication and a predetermined vector, the hyperspace distance being a function of a Euclidean-space distance dependent on the feature count indicator, and of an edit distance dependent the text line structure indicator; and classify the electronic communication according to the composite hyperspace distance.
-
-
11. A method comprising employing a computer system to:
-
generate an electronic communication feature count indicator comprising a plurality of feature counts for the electronic communication; generate a text line structure indicator characterizing a line structure of the electronic communication, the text line structure indicator comprising a data structure selected from a group consisting of an ordered list and a tree of text line type indicators, a position of each text line type indicator in the data structure being indicative of a position of a corresponding text line type in the electronic communication; determine a composite hyperspace distance between a vector representing the electronic communication and a predetermined vector, the hyperspace distance being a function of a Euclidean-space distance dependent on the feature count indicator, and of an edit distance dependent the text line structure indicator; and classify the electronic communication according to the composite hyperspace distance. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
-
Specification