Systems and methods for facilitating open source intelligence gathering

US 8,620,849 B2
Filed: 03/10/2011
Issued: 12/31/2013
Est. Priority Date: 03/10/2010
Status: Active Grant

First Claim

Patent Images

1. A website content extraction system, comprisinga processor;

anda memory logically connected to the processor and comprising a set of computer readable instructions executable by the processor to;

obtain source code used to generate the website on a display, wherein the source code includes a plurality of elements and each element includes at least one tag comprising at least one tag type;

parse the source code to obtain a node tree including a plurality of nodes arranged in a hierarchical structure, wherein each node comprises one of the elements, and wherein one of the plurality of nodes comprises a root node;

determine a tag type of a node under the root node;

assign a heuristic score to the node based at least in part on the tag type of the node;

continue to determine and assign for one or more additional nodes of the node tree, wherein the node under the root node comprises a parent node, and wherein the computer readable instructions that continue to determine and assign include instructions executable by the processor to;

determine, for a child node of the parent node, a tag type of the at least one tag of the child node; and

assign a heuristic score to the child node based at least in part of the tag type of the child node, wherein the computer readable instructions that assign the heuristic score to the child node include instructions executable by the processor to;

assign a first heuristic score to the child node without regard to the heuristic scores of other nodes in the node tree; and

add the first heuristic score to a heuristic score of the parent node to obtain a child node heuristic score; and

generate an object that includes content associated with nodes of the node tree having heuristic scores indicating that such content is of interest.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods (e.g., utilities) for use in providing automated, lightweight collection of online, open source data which may be content-based to reduce website source bias. In one aspect, a utility is disclosed for use in extracting content of interest from at least one website or other online data source (e.g., where the extracted content can be used in a subsequent search query). In other aspects, utilities are disclosed that are operable to perform various types of analyses on such extracted content and present graphical representations of such analyses on a display of a client device.

Citations

17 Claims

1. A website content extraction system, comprisinga processor;
- anda memory logically connected to the processor and comprising a set of computer readable instructions executable by the processor to;
  
  obtain source code used to generate the website on a display, wherein the source code includes a plurality of elements and each element includes at least one tag comprising at least one tag type;
  
  parse the source code to obtain a node tree including a plurality of nodes arranged in a hierarchical structure, wherein each node comprises one of the elements, and wherein one of the plurality of nodes comprises a root node;
  
  determine a tag type of a node under the root node;
  
  assign a heuristic score to the node based at least in part on the tag type of the node;
  
  continue to determine and assign for one or more additional nodes of the node tree, wherein the node under the root node comprises a parent node, and wherein the computer readable instructions that continue to determine and assign include instructions executable by the processor to;
  
  determine, for a child node of the parent node, a tag type of the at least one tag of the child node; and
  
  assign a heuristic score to the child node based at least in part of the tag type of the child node, wherein the computer readable instructions that assign the heuristic score to the child node include instructions executable by the processor to;
  
  assign a first heuristic score to the child node without regard to the heuristic scores of other nodes in the node tree; and
  
  add the first heuristic score to a heuristic score of the parent node to obtain a child node heuristic score; and
  
  generate an object that includes content associated with nodes of the node tree having heuristic scores indicating that such content is of interest.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system of claim 1, wherein the computer readable instructions that assign the heuristic score include instructions executable by the processor to:
    - allocate a first heuristic score to a first node responsive to the tag type for the first node being an “
      
      HTML a”
      
      tag, andallocate a second heuristic score to the first node responsive to the tag type for the first node being other than an HTML a tag, the second heuristic score being different than the first heuristic score.
  - 3. The system of claim 1, wherein the tag type of the node is determined to be an “
    - HTML list”
      
      tag, and wherein the computer readable instructions further comprise instructions executable by the processor to;
      
      delete the node and any corresponding child nodes responsive to the assigned score being greater than a first heuristic score;
      
      otherwise;
      
      continue to determine and assign on a subsequent node.
  - 4. The system of claim 1, wherein the node includes at least one child node, and wherein the computer readable instructions further comprise instructions executable by the processor to:
    - delete the node and the at least one child node responsive to the assigned score being greater than a first heuristic score;
      
      otherwise;
      
      continue to determine and assign on a subsequent node.
  - 5. The system of claim 2, wherein the computer readable instructions that allocate the first heuristic score include instructions executable by the processor to:
    - allocate a third heuristic score to the first node responsive to the HTML a tag lacking an href attribute or including an href attribute starting with #;
      
      otherwiseallocate a fourth heuristic score to the first node, the fourth heuristic score being less than the third heuristic score and greater than the second heuristic score.
  - 6. The system of claim 2, wherein the computer readable instructions that allocate the second heuristic score include instructions executable by the processor to:
    - allocate a third heuristic score to the first node responsive to the tag type for the first node being an “
      
      HTML text”
      
      tag;
      
      otherwiseallocate a fourth heuristic score to the first node, the fourth heuristic score being less than the first heuristic score and greater than the third heuristic score.

7. A method for extracting content of interest from at least one website, the method comprising:
- obtaining source code used to generate the at least one website on a display, wherein the source code includes a plurality of elements and each element includes at least one tag comprising at least one tag type;
  
  parsing the source code using a processor to obtain a node tree including a plurality of nodes arranged in a hierarchical structure, wherein each node comprises one of the elements, and wherein one of the plurality of nodes comprises a root node;
  
  determining a tag type of a node under the root node;
  
  assigning a heuristic score to the node based at least in part on the tag type of the node;
  
  repeating the determining and assigning for one or more additional nodes of the node tree, wherein the node under the root node comprises a parent node, and wherein the repeating comprises;
  
  determining, for a child node of the parent node, a tag type of the at least one tag of the child node; and
  
  assigning a heuristic score to the child node based at least in part of the tag type of the child node, wherein the assigning a heuristic score to the child node comprises;
  
  assigning a first heuristic score to the child node without regard to the heuristic scores of other nodes in the node tree; and
  
  adding the first heuristic score to a heuristic score of the parent node to obtain a child node heuristic score; and
  
  generating, using the processor, an object that includes content associated with nodes of the node tree having heuristic scores indicating that such content is of interest.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 8. The method of claim 7, wherein the assigning comprises:
    - allocating a first heuristic score to a first node responsive to the tag type for the first node being an “
      
      HTML a”
      
      tag, andallocating a second heuristic score to the first node responsive to the tag type for the first node being other than an HTML a tag, the second heuristic score being different than the first heuristic score.
  - 9. The method of claim 8, wherein the allocating a first heuristic score comprises:
    - allocating a third heuristic score to the first node responsive to the HTML a tag lacking an href attribute or including an href attribute starting with #;
      
      otherwiseallocating a fourth heuristic score to the first node, the fourth heuristic score being less than the third heuristic score and greater than the second heuristic score.
  - 10. The method of claim 8, wherein the allocating a second heuristic score comprises:
    - allocating a third heuristic score to the first node responsive to the tag type for the first node being an “
      
      HTML text”
      
      tag;
      
      otherwiseallocating a fourth heuristic score to the first node, the fourth heuristic score being less than the first heuristic score and greater than the third heuristic score.
  - 11. The method of claim 7, wherein the tag type of the node is determined to be an “
    - HTML list”
      
      tag, and wherein the method further comprises;
      
      deleting the node and any corresponding child nodes responsive to the assigned score being greater than a first heuristic score;
      
      otherwise;
      
      performing the repeating on a subsequent node.
  - 12. The method of claim 11, wherein the subsequent node is a sibling node.
  - 13. The method of claim 7, wherein the node includes at least one child node, and wherein the method further comprises:
    - deleting the node and the at least one child node responsive to the assigned score being greater than a first heuristic score;
      
      otherwise;
      
      performing the repeating on a subsequent node.
  - 14. The method of claim 7, further comprising:
    - performing each of the obtaining, parsing, determining, assigning, repeating and generating steps for additional websites to obtain objects for each of the websites including content of interest.
  - 15. The method of claim 14, further comprising:
    - receiving the x most frequently disclosed terms in the objects of the additional websites during a time period, wherein x is a positive number; and
      
      presenting, on a display, a first graphical representation illustrating a sentiment of each of the x most frequently disclosed terms during the time period.
  - 16. The method of claim 14, further comprising:
    - identifying at least one textual hierarchy including at least first and second levels, wherein the first level comprises at least one textual category and the second level comprises at least one term that describes the at least one textual category;
      
      determining a number of occurrences of the at least one term from the objects of the websites during a time period;
      
      first obtaining, using a processing engine, hierarchical signatures of the at least one term that represent a prevalence of the at least one term on the websites;
      
      second obtaining, from the first obtaining step, hierarchical signature of the at least one textual category that represent a prevalence of the at least one textual category on the websites;
      
      establishing hierarchical signatures for the websites utilizing the hierarchical signatures of the at least one term and/or at least one textual category; and
      
      presenting, on a display, graphical representations of the hierarchical signatures of the websites, wherein the graphical representations illustrate the prevalence of the at least one term and/or at least one textual category on the web sites.
  - 17. The method of claim 14, wherein each of the websites is identified by a uniform resource locator (URL), and wherein the method further comprises:
    - obtaining additional URLs from the objects of the websites; and
      
      presenting, on a display, a representation of an online information flow network that includes a graphical representation of information flows from the additional URLs to the URLs of the websites.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Leidos Innovations Technology, Inc., Lockheed Martin Corporation (Martin Marietta Corporation)
Original Assignee
Lockheed Martin Corporation (Martin Marietta Corporation)
Inventors
Moitra, Abha, Bracewell, David Brian, Gustafson, Steven Matt, Baylor, T. Michael, Chau, Tina H.
Primary Examiner(s)
Gaffin, Jeffrey A
Assistant Examiner(s)
Chubb, Mikayla

Application Number

US13/045,128
Publication Number

US 20110225115A1
Time in Patent Office

1,027 Days
Field of Search

None
US Class Current

706/50
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

G06F 16/958   Organisation or management ...

G06F 3/04817   using icons graphical or vi...

G06F 3/0482   Interaction with lists of s...

G06N 5/02   Knowledge representation; S...

Systems and methods for facilitating open source intelligence gathering

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for facilitating open source intelligence gathering

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links