METHOD AND SYSTEM RELATING TO SALIENT CONTENT EXTRACTION FOR ELECTRONIC CONTENT
First Claim
Patent Images
1. A method comprising:
- a) receiving an item of content;
b) identifying within the item of content using a microprocessor a set of lexical pattern cues for core content of the item of content and selecting a segment of the item of content having a highest likelihood as being the core content based upon a structural analysis of the item of content in dependence upon at least the set of lexical pattern cues;
c) parsing the item of content to generate a hierarchy of content within the item of content;
d) ranking the hierarchy of content in dependence upon at least the lexical pattern cues and sorting the resulting ranking;
e) identifying a gap when searching down the ranking meeting a predetermined threshold and removing those portions of the hierarchy of content below the gap to generate truncated content;
f) find all occurrences for portions of the hierarchy of content with closest match to the lexical pattern cues closest to the start of the item of content;
g) determining whether multiple matches to the lexical pattern cues exist and establishing an action in dependence upon at least whether multiple matches exist or not;
h) performing the action, wherein the action is at least one of;
establishing the occurrence for the portion of the hierarchy of content as the core content of the item of content when the determination of multiple matches is negative; and
establishing the occurrence for the portion of the hierarchy of content that at least one of contains the largest portion of the item of content and is the first occurrence as the core content of the item of content when the determination of multiple matches is positive.
1 Assignment
0 Petitions
Accused Products
Abstract
Individuals receive overwhelming barrage of information which must be filtered, processed, analysed, reviewed, consolidated and distributed or acted upon. Automatic approaches to “scraping” salient content from sources of content are provided allowing the salient content to be provided to the user or subjected to further processing such as clustering or sentiment analysis for example.
Embodiments of the invention provide for:
- automated scraper induction based on document and/or contextual semantic cues and document structure analysis.
- identifying salient text, removing boiler-plate text, off-topic content and other non-salient content;
- deriving reusable descriptive extraction patterns for subsequent documents;
- applying descriptive extraction patterns for extraction from subsequent documents from the same source;
- intelligent identification of extraction success confidence score, using historical success scores; and
- employing confidence scores to automatically trigger new extraction pattern identification if extracted confidence is below an acceptable confidence threshold.
-
Citations
20 Claims
-
1. A method comprising:
-
a) receiving an item of content; b) identifying within the item of content using a microprocessor a set of lexical pattern cues for core content of the item of content and selecting a segment of the item of content having a highest likelihood as being the core content based upon a structural analysis of the item of content in dependence upon at least the set of lexical pattern cues; c) parsing the item of content to generate a hierarchy of content within the item of content; d) ranking the hierarchy of content in dependence upon at least the lexical pattern cues and sorting the resulting ranking; e) identifying a gap when searching down the ranking meeting a predetermined threshold and removing those portions of the hierarchy of content below the gap to generate truncated content; f) find all occurrences for portions of the hierarchy of content with closest match to the lexical pattern cues closest to the start of the item of content; g) determining whether multiple matches to the lexical pattern cues exist and establishing an action in dependence upon at least whether multiple matches exist or not; h) performing the action, wherein the action is at least one of; establishing the occurrence for the portion of the hierarchy of content as the core content of the item of content when the determination of multiple matches is negative; and establishing the occurrence for the portion of the hierarchy of content that at least one of contains the largest portion of the item of content and is the first occurrence as the core content of the item of content when the determination of multiple matches is positive. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method comprising:
-
a) receiving an item of content; b) identifying within the item of content using a microprocessor a set of lexical pattern cues for core content of the item of content; c) parsing the item of content to generate a hierarchy of content within the item of content; d) searching within a first database for a match to a predetermined portion of the hierarchy of content of an entry within the database, the first database comprising entries relating to hierarchies of content previously established for other items of content together with associations to the items of content they relate to; e) where a match is determined calculating a density factor in dependence upon at least the contents of the identified hierarchy of content within the database and the set of lexical pattern cues; f) if the calculated density factor exceeds a predetermined threshold adding a predetermined count to a counter associated with the identified hierarchy of content stored within a second database; g) extracting from the item of content using the identified hierarchy of content truncated content of the item of content. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A method comprising:
-
a) establishing on a computer system comprising at least a microprocessor at least one lexical pattern cue of a plurality of lexical pattern cues; b) receiving on the computer system an item of content; c) processing on the computer system the item of content to establish a set of rankings, each ranking established in dependence upon at least the plurality of lexical pattern cues for a portion of the item of content; and d) generating a new item of content in dependence upon at least the item of content and the set of rankings of the plurality of lexical pattern cues when a ranking within the set of rankings exceeds a predetermined threshold. - View Dependent Claims (15, 16, 17)
-
-
18. A method comprising
receiving on a computer system an item of content accessed from a remote computer server to which the computer is connected via a network; -
executing a lookup mechanism to identify the existence of one or more descriptive extraction patterns associated with the remote computer server; parsing the item of content to generate a hierarchy of content within the item of content; applying a descriptive extraction pattern to extract one or more portions of the hierarchy of content; and extracting the final text based on the extracted portions of the hierarchy of content. - View Dependent Claims (19, 20)
-
Specification