Systems and methods for facilitating open source intelligence gathering
First Claim
Patent Images
1. A website content extraction system, comprisinga processor;
- anda memory logically connected to the processor and comprising a set of computer readable instructions executable by the processor to;
obtain source code used to generate the website on a display, wherein the source code includes a plurality of elements and each element includes at least one tag comprising at least one tag type;
parse the source code to obtain a node tree including a plurality of nodes arranged in a hierarchical structure, wherein each node comprises one of the elements, and wherein one of the plurality of nodes comprises a root node;
determine a tag type of a node under the root node;
assign a heuristic score to the node based at least in part on the tag type of the node;
continue to determine and assign for one or more additional nodes of the node tree, wherein the node under the root node comprises a parent node, and wherein the computer readable instructions that continue to determine and assign include instructions executable by the processor to;
determine, for a child node of the parent node, a tag type of the at least one tag of the child node; and
assign a heuristic score to the child node based at least in part of the tag type of the child node, wherein the computer readable instructions that assign the heuristic score to the child node include instructions executable by the processor to;
assign a first heuristic score to the child node without regard to the heuristic scores of other nodes in the node tree; and
add the first heuristic score to a heuristic score of the parent node to obtain a child node heuristic score; and
generate an object that includes content associated with nodes of the node tree having heuristic scores indicating that such content is of interest.
4 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods (e.g., utilities) for use in providing automated, lightweight collection of online, open source data which may be content-based to reduce website source bias. In one aspect, a utility is disclosed for use in extracting content of interest from at least one website or other online data source (e.g., where the extracted content can be used in a subsequent search query). In other aspects, utilities are disclosed that are operable to perform various types of analyses on such extracted content and present graphical representations of such analyses on a display of a client device.
-
Citations
17 Claims
-
1. A website content extraction system, comprising
a processor; - and
a memory logically connected to the processor and comprising a set of computer readable instructions executable by the processor to; obtain source code used to generate the website on a display, wherein the source code includes a plurality of elements and each element includes at least one tag comprising at least one tag type; parse the source code to obtain a node tree including a plurality of nodes arranged in a hierarchical structure, wherein each node comprises one of the elements, and wherein one of the plurality of nodes comprises a root node; determine a tag type of a node under the root node; assign a heuristic score to the node based at least in part on the tag type of the node; continue to determine and assign for one or more additional nodes of the node tree, wherein the node under the root node comprises a parent node, and wherein the computer readable instructions that continue to determine and assign include instructions executable by the processor to; determine, for a child node of the parent node, a tag type of the at least one tag of the child node; and assign a heuristic score to the child node based at least in part of the tag type of the child node, wherein the computer readable instructions that assign the heuristic score to the child node include instructions executable by the processor to; assign a first heuristic score to the child node without regard to the heuristic scores of other nodes in the node tree; and add the first heuristic score to a heuristic score of the parent node to obtain a child node heuristic score; and generate an object that includes content associated with nodes of the node tree having heuristic scores indicating that such content is of interest. - View Dependent Claims (2, 3, 4, 5, 6)
- and
-
7. A method for extracting content of interest from at least one website, the method comprising:
-
obtaining source code used to generate the at least one website on a display, wherein the source code includes a plurality of elements and each element includes at least one tag comprising at least one tag type; parsing the source code using a processor to obtain a node tree including a plurality of nodes arranged in a hierarchical structure, wherein each node comprises one of the elements, and wherein one of the plurality of nodes comprises a root node; determining a tag type of a node under the root node; assigning a heuristic score to the node based at least in part on the tag type of the node; repeating the determining and assigning for one or more additional nodes of the node tree, wherein the node under the root node comprises a parent node, and wherein the repeating comprises; determining, for a child node of the parent node, a tag type of the at least one tag of the child node; and assigning a heuristic score to the child node based at least in part of the tag type of the child node, wherein the assigning a heuristic score to the child node comprises; assigning a first heuristic score to the child node without regard to the heuristic scores of other nodes in the node tree; and adding the first heuristic score to a heuristic score of the parent node to obtain a child node heuristic score; and generating, using the processor, an object that includes content associated with nodes of the node tree having heuristic scores indicating that such content is of interest. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
Specification