Method and system relating to salient content extraction for electronic content
First Claim
Patent Images
1. A method comprising:
- a) receiving an item of content;
b) identifying within the item of content using a microprocessor a set of lexical pattern cues for core content of the item of content and selecting a segment of the item of content having a highest likelihood as being the core content based upon a structural analysis of the item of content in dependence upon at least the set of lexical pattern cues;
c) parsing the item of content to generate a hierarchy of content within the item of content;
d) ranking the hierarchy of content in dependence upon at least the lexical pattern cues and sorting the resulting ranking;
e) identifying a gap when searching down the ranking meeting a predetermined threshold and removing those portions of the hierarchy of content below the gap to generate truncated content;
f) finding all occurrences for portions of the hierarchy of content with closest match to the lexical pattern cues closest to the start of the item of content;
g) determining whether multiple matches to the lexical pattern cues exist and establishing an action in dependence upon at least whether multiple matches exist or not;
h) performing the action, wherein the action is at least one of;
establishing the occurrence for the portion of the hierarchy of content as the core content of the item of content when the determination of multiple matches is negative; and
establishing the occurrence for the portion of the hierarchy of content that at least one of contains the largest portion of the item of content and is the first occurrence as the core content of the item of content when the determination of multiple matches is positive.
1 Assignment
0 Petitions
Accused Products
Abstract
Individuals receive overwhelming barrage of information which must be filtered, processed, analyzed, reviewed, consolidated and distributed or acted upon. Automatic approaches to “scraping” salient content from sources of content are provided allowing the salient content to be provided to the user or subjected to further processing such as clustering or sentiment analysis for example.
Embodiments of the invention provide for:
- automated scraper induction based on document and/or contextual semantic cues and document structure analysis.
- identifying salient text, removing boiler-plate text, off-topic content and other non-salient content;
- deriving reusable descriptive extraction patterns for subsequent documents;
- applying descriptive extraction patterns for extraction from subsequent documents from the same source;
- intelligent identification of extraction success confidence score, using historical success scores; and
- employing confidence scores to automatically trigger new extraction pattern identification if extracted confidence is below an acceptable confidence threshold.
9 Citations
8 Claims
-
1. A method comprising:
-
a) receiving an item of content; b) identifying within the item of content using a microprocessor a set of lexical pattern cues for core content of the item of content and selecting a segment of the item of content having a highest likelihood as being the core content based upon a structural analysis of the item of content in dependence upon at least the set of lexical pattern cues; c) parsing the item of content to generate a hierarchy of content within the item of content; d) ranking the hierarchy of content in dependence upon at least the lexical pattern cues and sorting the resulting ranking; e) identifying a gap when searching down the ranking meeting a predetermined threshold and removing those portions of the hierarchy of content below the gap to generate truncated content; f) finding all occurrences for portions of the hierarchy of content with closest match to the lexical pattern cues closest to the start of the item of content; g) determining whether multiple matches to the lexical pattern cues exist and establishing an action in dependence upon at least whether multiple matches exist or not; h) performing the action, wherein the action is at least one of; establishing the occurrence for the portion of the hierarchy of content as the core content of the item of content when the determination of multiple matches is negative; and establishing the occurrence for the portion of the hierarchy of content that at least one of contains the largest portion of the item of content and is the first occurrence as the core content of the item of content when the determination of multiple matches is positive. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
Specification