Method for scanning, analyzing and handling various kinds of digital information content
First Claim
1. A computer-readable medium storing a computer program for use in conjunction with a web browser client program to rate a web page relative to a selected characteristic, the program comprising:
- first means for identifying natural language textual portions of the web page and forming a list of words that appear in the identified natural language textual portions of the web page;
a database of predetermined words that are associated with the selected characteristic;
second means for querying the database to determine which of the list of words has a match in the database;
third means for acquiring a corresponding weight from the database for each such word having a match in the database so as to form a weighted set of terms; and
fourth means for calculating a rating for the web page responsive to the weighted set of terms, the calculating means including means for determining and taking into account a total number of natural language words that appear in the identified natural language textual portions of the web page.
1 Assignment
0 Petitions
Accused Products
Abstract
Computer-implemented methods are described for, first, characterizing a specific category of information content—pornography, for example—and then accurately identifying instances of that category of content within a real-time media stream, such as a web page, e-mail or other digital dataset. This content-recognition technology enables a new class of highly scalable applications to manage such content, including filtering, classifying, prioritizing, tracking, etc. An illustrative application of the invention is a software product for use in conjunction with web-browser client software for screening access to web pages that contain pornography or other potentially harmful or offensive content. A target attribute set of regular expression, such as natural language words and/or phrases, is formed by statistical analysis of a number of samples of datasets characterized as “containing,” and another set of samples characterized as “not containing,” the selected category of information content. This list of expressions is refined by applying correlation analysis to the samples or “training data.” Neural-network feed-forward techniques are then applied, again using a substantial training dataset, for adaptively assigning relative weights to each of the expressions in the target attribute set, thereby forming an awaited list that is highly predictive of the information content category of interest.
90 Citations
24 Claims
-
1. A computer-readable medium storing a computer program for use in conjunction with a web browser client program to rate a web page relative to a selected characteristic, the program comprising:
-
first means for identifying natural language textual portions of the web page and forming a list of words that appear in the identified natural language textual portions of the web page;
a database of predetermined words that are associated with the selected characteristic;
second means for querying the database to determine which of the list of words has a match in the database;
third means for acquiring a corresponding weight from the database for each such word having a match in the database so as to form a weighted set of terms; and
fourth means for calculating a rating for the web page responsive to the weighted set of terms, the calculating means including means for determining and taking into account a total number of natural language words that appear in the identified natural language textual portions of the web page. - View Dependent Claims (2, 3, 4)
-
-
5. A method of analyzing content of a digital data set, the method comprising the steps of:
-
identifying natural language textual portions of the data set;
forming a word list including all natural language words that appear in the textual portions of the data set;
for each word in the word list, querying a preexisting database of selected words to determine whether or not a match exists in the database;
for each word having a match in the database, reading a corresponding weight from the database so as to form a weighted set of terms; and
calculating a rating for the data set responsive to the weighted set of terms. - View Dependent Claims (6, 7, 8, 9)
-
-
10. A method of building a target attribute set for use in analyzing content of a digital data set, the method comprising the steps of:
-
acquiring a plurality of sample data sets for use as training data sets;
designating each of the training data sets as “
yes”
or “
no”
with respect to a predetermined content characteristic;
parsing through the content of all of the training data sets to form a list of regular expressions that appear in the training data sets;
forming data reflecting a frequency of occurrence of each regular expression in the training data sets;
analyzing the frequency of occurrence data in view of the “
yes”
or “
no”
designation of each data set, to identify and select a set of regular expressions that are indicative of either a “
yes”
designation or a “
no”
designation of a data set with respect to the predetermined characteristic; and
storing the selected set of regular expressions to form a target attribute set based on the downloaded training pages, whereby the target attribute set provides a set of regular expressions that are useful in discriminating data set content relative to the predetermined content characteristic. - View Dependent Claims (11, 12, 13)
-
-
14. A method of assigning weights to a list of regular expressions for use in analyzing content of a digital data set, the method comprising:
-
providing a predetermined target attribute set associated with a predetermined group of training data sets, the target attribute set including a list of regular expressions that are deemed useful for discriminating data set content relative to a predetermined content characteristic;
assigning an initial weight to each of the regular expressions in the target attribute set, thereby forming a weight database;
designating each of the group of training data sets as either “
yes”
or “
no”
relative to whether it exhibits the predetermined content characteristic;
examining one of the group of training data sets to identify all regular expressions within the data set that also appear in the target attribute set, thereby forming a match list for said data set;
in a neural network system, rating the examined data set using the weightings in the weight database;
comparing the rating of the examined data set to the corresponding “
yes”
or “
no”
designation to form a first error term;
repeating said examining, rating and comparing steps for each of the remaining data sets in the group of training data sets to form additional error terms; and
adjusting the weights in the weight database in response to the first and the additional error terms. - View Dependent Claims (15)
-
-
16. A method for controlling access to potentially offensive or harmful web pages comprising the steps of:
-
in conjunction with a web browser client program executing on a digital computer, examining a downloaded web page before the web page is displayed to the user;
said examining step including analyzing the web page natural language content relative to a predetermined database of words to form a rating, the database including words previously associated with potentially offensive or harmful web pages, and the database further including a relative weighting associated with each word in the database for use in forming the rating;
comparing the rating of the downloaded web page to a predetermined threshold rating; and
if the rating indicates that the downloaded web page is more likely to be offensive or harmful than a web page having the threshold rating, blocking the downloaded web page from being displayed to the user. - View Dependent Claims (17, 18, 19, 20)
-
-
21. A computer-readable medium storing a web search engine server program, the program comprising:
-
a data acquisition component for acquiring meta-content from target web sites into an internal database; and
an inquiry component for selecting and presenting meta-content from the internal database in response to an end-user request;
the data acquisition component including an analysis component that analyzes the content of web pages corresponding to the meta-content stored in the internal database, and returns a rating for each such web page; and
means for adding said returned ratings into the internal database as additional meta-content in association with the corresponding web pages. - View Dependent Claims (22, 23, 24)
-
Specification