Extracting product purchase information from electronic messages
First Claim
1. A method, comprising,by computer apparatus:
- grouping electronic messages into respective clusters based on similarities between the electronic messages;
for each of one or more of the clusters, extracting a respective grammar defining a respective arrangement of structural elements of the electronic messages in the cluster, wherein the extracting comprises, for each of the one or more clusters,in non-transitory computer-readable memory, building a respective generalized suffix tree representation from a concatenation of tokenized contents of electronic messages in the cluster into a single respective string, wherein the generalized suffix tree representation maintains the order of suffixes from the single respective string in a hierarchical tree structure of nodes that are linearly interconnected from root to leaf node and, for each suffix, identifies the electronic messages in which the suffix appears and a respective count of times the suffix appears in each electronic message in the cluster, andtraversing the respective generalized suffix tree, wherein the traversing comprises ascertaining the respective arrangement of structural elements for the respective grammar based on appearance frequencies of substrings in the electronic messages in the cluster;
based on training data comprising training data fields, building one or more classifiers that classify field tokens extracted from selected electronic messages with respective product purchase relevant labels based on correspondences between tokens extracted from the selected electronic messages and the structural elements of the grammars respectively matched to the selected electronic messages; and
in non-transitory computer-readable memory, storing the grammars and the one or more classifiers in one or more data structures associated with a parser executable by a processor to parse product purchase information from electronic messages.
5 Assignments
0 Petitions
Accused Products
Abstract
Improved systems and methods for extracting product purchase information from electronic messages transmitted between physical network nodes to convey product purchase information to designated recipients. These examples provide a product purchase information extraction service that is able to extract product purchase information from electronic messages with high precision across a wide variety of electronic message formats and thereby solve the practical problems that have arisen as a result of the proliferation of different electronic message formats used by individual merchants and across different merchants and different languages. In this regard, these examples are able to automatically learn the structures and semantics of different message formats, which accelerates the ability to support new message sources, new markets, and different languages.
-
Citations
20 Claims
-
1. A method, comprising,
by computer apparatus: -
grouping electronic messages into respective clusters based on similarities between the electronic messages; for each of one or more of the clusters, extracting a respective grammar defining a respective arrangement of structural elements of the electronic messages in the cluster, wherein the extracting comprises, for each of the one or more clusters, in non-transitory computer-readable memory, building a respective generalized suffix tree representation from a concatenation of tokenized contents of electronic messages in the cluster into a single respective string, wherein the generalized suffix tree representation maintains the order of suffixes from the single respective string in a hierarchical tree structure of nodes that are linearly interconnected from root to leaf node and, for each suffix, identifies the electronic messages in which the suffix appears and a respective count of times the suffix appears in each electronic message in the cluster, and traversing the respective generalized suffix tree, wherein the traversing comprises ascertaining the respective arrangement of structural elements for the respective grammar based on appearance frequencies of substrings in the electronic messages in the cluster; based on training data comprising training data fields, building one or more classifiers that classify field tokens extracted from selected electronic messages with respective product purchase relevant labels based on correspondences between tokens extracted from the selected electronic messages and the structural elements of the grammars respectively matched to the selected electronic messages; and in non-transitory computer-readable memory, storing the grammars and the one or more classifiers in one or more data structures associated with a parser executable by a processor to parse product purchase information from electronic messages. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. Apparatus, comprising a memory storing processor-readable instructions, and a processor coupled to the memory, operable to execute the instructions, and based at least in part on the execution of the instructions operable to perform operations comprising:
-
grouping electronic messages into respective clusters based on similarities between the electronic messages; for each of one or more of the clusters, extracting a respective grammar defining a respective arrangement of structural elements of the electronic messages in the cluster, wherein the extracting comprises, for each of the one or more clusters, in non-transitory computer-readable memory, building a respective generalized suffix tree representation from a concatenation of tokenized contents of electronic messages in the cluster into a single respective string, wherein the generalized suffix tree representation maintains the order of suffixes from the single respective string in a hierarchical tree structure of nodes that are lineally interconnected from root to leaf node and, for each suffix, identifies the electronic messages in which the suffix appears and a respective count of times the suffix appears in each electronic message in the cluster, and traversing the respective generalized suffix tree, wherein the traversing comprises ascertaining the respective arrangement of structural elements for the respective grammar based on appearance frequencies of substrings in the electronic messages in the cluster; based on training data comprising training data fields, building one or more classifiers that classify field tokens extracted from selected electronic messages with respective product purchase relevant labels based on correspondences between tokens extracted from the selected electronic messages and the structural elements of the grammars respectively matched to the selected electronic messages; and in non-transitory computer-readable memory, storing the grammars and the at least one classifier in one or more data structures associated with a parser executable by a processor to parse product purchase information from electronic messages.
-
-
19. At least one non-transitory computer-readable medium having processor-readable program code embodied therein, the processor-readable program code adapted to be executed by a processor to implement a method comprising:
-
grouping electronic messages into respective clusters based on similarities between the electronic messages; for each of one or more of the clusters, extracting a respective grammar defining a respective arrangement of structural elements of the electronic messages in the cluster, wherein the extracting comprises, for each of the one or more clusters, in non-transitory computer-readable memory, building a respective generalized suffix tree representation from a concatenation of tokenized contents of electronic messages in the cluster into a single respective string, wherein the generalized suffix tree representation maintains the order of suffixes from the single respective string in a hierarchical tree structure of nodes that are linearly interconnected from root to leaf node and, for each suffix, identifies the electronic messages in which the suffix appears and a respective count of times the suffix appears in each electronic message in the cluster, and traversing the respective generalized suffix tree, wherein the traversing comprises ascertaining the respective arrangement of structural elements for the respective grammar based on appearance frequencies of substrings in the electronic messages in the cluster; based on training data comprising training data fields, building one or more classifiers that classify field tokens extracted from selected electronic messages with respective product purchase relevant labels based on correspondences between tokens extracted from the selected electronic messages and the structural elements of the grammars respectively matched to the selected electronic messages; and in non-transitory computer-readable memory, storing the grammars and the one or more classifiers in one or more data structures associated with a parser executable by a processor to parse product purchase information from electronic messages.
-
-
20. Apparatus, comprising at least one non-transitory memory storing processor-readable instructions, and at least one processor coupled to the memory, operable to execute the instructions, and based at least in part on the execution of the instructions operable to implement:
-
a product purchase information grammar extractor operable by a processor to group electronic messages into respective clusters based on similarities between the electronic messages, and for each of one or more of the clusters, extract a respective grammar defining a respective arrangement of structural elements of the electronic messages in the cluster wherein, for each of the one or more clusters, a respective generalized suffix tree representation is built in non-transitory computer-readable memory from a concatenation of tokenized contents of electronic messages in the cluster into a single respective string, wherein the generalized suffix tree representation maintains the order of suffixes from the single respective string in a hierarchical tree structure of nodes that are linearly interconnected from root to leaf node and, for each suffix, identities the electronic messages in which the suffix appears and a respective count of times the suffix appears in each electronic message in the cluster, and the respective generalized suffix tree is traversed to ascertain the respective arrangement of structural elements for the respective grammar based on appearance frequencies of substrings in the electronic message in the cluster; a product purchase information token classifier trainer operable by a processor to build, based on training data comprising training data fields, one or more classifiers that classify field tokens extracted from selected electronic messages with respective product purchase relevant labels based on correspondences between tokens extracted from the selected electronic messages and the structural elements of the grammars respectively matched to the selected electronic messages; and a processor-executable electronic message parser stored in non-transitory computer-readable memory and executable by a processor to extract purchase-related information from purchase related electronic messages and, according to at least one of the one or more classifiers, ascertain respective product purchase relevant labels for the extracted purchase-related information based on associations between the clusters and the respective grammars that are stored in non-transitory computer-readable memory.
-
Specification