Generating and applying event data extraction templates
First Claim
1. A computer-implemented method, comprising:
- identifying, from a corpus of structured communications, a set of markup language paths, wherein each markup language path specifies a node that underlies a segment of text within a body of a structured communication from the corpus of structured communications;
classifying at least one markup language path of the set of markup language paths, associated with a first segment of text, as a transient markup language path in response to a determination that a frequency of occurrences of the first segment of text across the corpus of structured communications fails to satisfy a threshold;
applying one or more event heuristics to the structured communications of the corpus;
determining, based on the applying, an event data type for the transient markup language path;
assigning the event data type to the transient markup language path based on the determining;
generating an event data extraction template to extract, from one or more subsequent structured communications, one or more event-related segments of text associated with the transient markup language path associated with the assigned event data type;
extracting at least one event-related segment of text from a subsequent structured communication of the one or more subsequent structured communications by applying the event data extraction template to the subsequent structured communication, wherein the subsequent structured communication is addressed to a user; and
providing the extracted at least one event-related segment of text to the user using one or more computing devices operated by the user.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods and apparatus are described herein for generating and applying event data extraction templates. In various implementations, a set of structural paths may be identified from a corpus of communications. A first structural path of the set of structural paths, associated with a first segment of text, may be classified as transient in response to a determination that a frequency of occurrences of the first segment of text across the corpus satisfies a criterion. Event heuristics may be applied to the communications of the corpus. A determination may be made, based on the applying, that the communications of the corpus are event-related. An event data type may be assigned to the transient structural path based on the applying. An event data extraction template may be generated to extract, from one or more subsequent communications, one or more event-related segments of text associated with the transient structural path.
-
Citations
16 Claims
-
1. A computer-implemented method, comprising:
-
identifying, from a corpus of structured communications, a set of markup language paths, wherein each markup language path specifies a node that underlies a segment of text within a body of a structured communication from the corpus of structured communications; classifying at least one markup language path of the set of markup language paths, associated with a first segment of text, as a transient markup language path in response to a determination that a frequency of occurrences of the first segment of text across the corpus of structured communications fails to satisfy a threshold; applying one or more event heuristics to the structured communications of the corpus; determining, based on the applying, an event data type for the transient markup language path; assigning the event data type to the transient markup language path based on the determining; generating an event data extraction template to extract, from one or more subsequent structured communications, one or more event-related segments of text associated with the transient markup language path associated with the assigned event data type; extracting at least one event-related segment of text from a subsequent structured communication of the one or more subsequent structured communications by applying the event data extraction template to the subsequent structured communication, wherein the subsequent structured communication is addressed to a user; and providing the extracted at least one event-related segment of text to the user using one or more computing devices operated by the user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. At least one non-transitory computer-readable medium comprising instructions that, in response to execution of the instructions by a computing device, cause the computing device to perform the following operations:
-
grouping a corpus of communications into a plurality of clusters based on one or more similarities between communications of the corpus, the plurality of clusters including at least one cluster of structured communications; identifying, from the at least one cluster of structured communications, a set of markup language paths, wherein each markup language path specifies a node that underlies a segment of text within a body of a structured communication from the at least one cluster of structured communications; classifying at least one markup language structural path of the set of markup language paths, associated with a first segment of text, as a transient markup language path in response to a determination that a count of occurrences of the first segment of text across the at least one cluster of structured communications fails to satisfy a threshold; applying one or more event heuristics to the structured communications of the at least one cluster of structured communications; determining, based on the applying, an event data type for the transient markup language path; assigning the event data type to the transient markup language path based on the determining; generating a data extraction template to extract, from one or more subsequent structured communications, one or more segments of text associated with the transient markup language path associated with the assigned event data toe; extracting at least one event-related segment of text from a subsequent structured communication of the one or more subsequent structured communications by applying the event data extraction template to the subsequent structured communication, wherein the subsequent structured communication is addressed to a user; and providing the extracted at least one event-related segment of text to the user using one or more computing devices operated by the user. - View Dependent Claims (16)
-
Specification