Systems and methods for content extraction from mark-up language text accessible at an internet domain

US 9,372,838 B2
Filed: 05/23/2013
Issued: 06/21/2016
Est. Priority Date: 03/30/2005
Status: Active Grant

First Claim

Patent Images

1. A method performed by a computer processor for automatically classifying a markup language text that is accessible at an Internet domain comprising:

retrieving from one or more data repositories, data associated with the Internet domain;

computing a first identifier for the Internet domain based on at least the data associated with the Internet domain and the markup language text;

computing a measure of similarity between the computed first identifier and each of a first plurality of previously classified identifiers; and

assigning the markup language text a classification based on the computed measure of similarity between the computed first identifier and each of the first plurality of previously classified identifiers,wherein computing the first identifier comprises computing, for each of a plurality of words in a predetermined set of words, a frequency of each word in the markup language text and the search result, andwherein the predetermined set of words are generated by;

retrieving the markup language text from the Internet domain;

retrieving search results associated with the Internet domain from one or more search engines;

computing a frequency for each of a plurality of words in the search results and the markup language text; and

adding to the predetermined set of words each of the plurality of words whose frequency is greater than a threshold.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are presented for content extraction from markup language text. The content extraction process may parse markup language text into a hierarchical data model and then apply one or more filters. Output filters may be used to make the process more versatile. The operation of the content extraction process and the one or more filters may be controlled by one or more settings set by a user, or automatically by a classifier. The classifier may automatically enter settings by classifying markup language text and entering settings based on this classification. Automatic classification may be performed by clustering unclassified markup language texts with previously classified markup language texts.

57 Citations

View as Search Results

8 Claims

1. A method performed by a computer processor for automatically classifying a markup language text that is accessible at an Internet domain comprising:
- retrieving from one or more data repositories, data associated with the Internet domain;
  
  computing a first identifier for the Internet domain based on at least the data associated with the Internet domain and the markup language text;
  
  computing a measure of similarity between the computed first identifier and each of a first plurality of previously classified identifiers; and
  
  assigning the markup language text a classification based on the computed measure of similarity between the computed first identifier and each of the first plurality of previously classified identifiers,wherein computing the first identifier comprises computing, for each of a plurality of words in a predetermined set of words, a frequency of each word in the markup language text and the search result, andwherein the predetermined set of words are generated by;
  
  retrieving the markup language text from the Internet domain;
  
  retrieving search results associated with the Internet domain from one or more search engines;
  
  computing a frequency for each of a plurality of words in the search results and the markup language text; and
  
  adding to the predetermined set of words each of the plurality of words whose frequency is greater than a threshold.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the classification assigned to the markup language text is the same classification as that of the previously classified identifier with the best measure of similarity to the computed first identifier.
  - 3. The method of claim 1, wherein the classification assigned to the markup language text is a new classification.
  - 4. The method of claim 1, wherein the one or more data repositories are search engines and the data associated with the Internet domain is a search result.
  - 5. The method of claim 1, further comprising adding to the predetermined set of words each of the plurality of words whose frequency is one.
  - 6. The method of claim 1, wherein computing the measure of similarity comprises computing the Manhattan distance between the computed first identifiers and each of the first plurality of previously classified and previously computed identifiers.
  - 7. The method of claim 1, further comprising retrieving settings for a filter based on the classification assigned to the markup language text.
  - 8. The method of claim 1, further comprising:
    - (a) computing a second identifier for the markup language text based on the layout of the markup language text;
      
      (b) computing a measure of similarity between the second identifier and each of a second plurality of previously classified identifiers; and
      
      (c) assigning the markup language text a classification based on both the first identifier and the second identifier.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Trustees Of Columbia University In The City Of New York (Columbia University)
Original Assignee
Trustees Of Columbia University In The City Of New York (Columbia University)
Inventors
Gupta, Suhit, Kaiser, Gail, Stolfo, Salvatore J
Primary Examiner(s)
Nguyen, Maikhanh

Application Number

US13/900,912
Publication Number

US 20130326332A1
Time in Patent Office

1,125 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/80   of semi-structured data, e....

G06F 16/84   Mapping; Conversion

G06F 16/951   Indexing; Web crawling tech...

G06F 40/143   Markup, e.g. Standard Gener...

Systems and methods for content extraction from mark-up language text accessible at an internet domain

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

57 Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for content extraction from mark-up language text accessible at an internet domain

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

57 Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links