Method for scanning, analyzing and rating digital information content
First Claim
1. A computer-readable medium storing a computer program for use in conjunction with a web browser client program to rate a web page relative to a selected characteristic, the program comprising:
- first means for identifying natural language textual portions of the web page and forming a list of words that appear in the identified natural language textual portions of the web page;
a database of predetermined words that are associated with the selected characteristic;
second means for acquiring a corresponding weight from the database for each such word having a match in the database so as to form a weighted set of terms; and
neural network means for calculating a rating for the web page responsive to the weighted set of terms, the neural network means including means for determining and taking into account a total number of natural language words that appear in the identified natural language textual portions of the web page.
3 Assignments
0 Petitions
Accused Products
Abstract
Computer-implemented methods are described for, first, characterizing a specific category of information content—pornography, for example—and then accurately identifying instances of that category of content within a real-time media stream, such as a web page, e-mail or other digital dataset. This content-recognition technology enables a new class of highly scalable applications to manage such content, including filtering, classifying, prioritizing, tracking, etc. An illustrative application of the invention is a software product for use in conjunction with web-browser client software for screening access to web pages that contain pornography or other potentially harmful or offensive content. A target attribute set of regular expression, such as natural language words and/or phrases, is formed by statistical analysis of a number of samples of datasets characterized as “containing,” and another set of samples characterized as “not containing,” the selected category of information content. This list of expressions is refined by applying correlation analysis to the samples or “training data.” Neural-network feed-forward techniques are then applied, again using a substantial training dataset, for adaptively assigning relative weights to each of the expressions in the target attribute set, thereby forming an awaited list that is highly predictive of the information content category of interest.
-
Citations
30 Claims
-
1. A computer-readable medium storing a computer program for use in conjunction with a web browser client program to rate a web page relative to a selected characteristic, the program comprising:
-
first means for identifying natural language textual portions of the web page and forming a list of words that appear in the identified natural language textual portions of the web page;
a database of predetermined words that are associated with the selected characteristic;
second means for acquiring a corresponding weight from the database for each such word having a match in the database so as to form a weighted set of terms; and
neural network means for calculating a rating for the web page responsive to the weighted set of terms, the neural network means including means for determining and taking into account a total number of natural language words that appear in the identified natural language textual portions of the web page. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
the database includes a predetermined a list of words and phrases that are associated with web pages having pornographic content.
-
-
3. A computer-readable medium storing a computer program for use in conjunction with a web browser client program to rate a web page according to claim 1 and further comprising means for storing a predetermined threshold rating, and means for comparing the calculated rating to the threshold rating to determine whether the web page likely has the selected characteristic.
-
4. A computer-readable medium storing a computer program for use in conjunction with a web browser client program to rate a web page according to claim 1 wherein the selected characteristic is hate-mongering content;
- and
the database includes a predetermined a list of words and phrases that are associated with web pages having hate-mongering content.
- and
-
5. A computer-readable medium storing a computer program for use in conjunction with a web browser client program to rate a web page according to claim 1 wherein the selected characteristic is racist content;
- and
the database includes a predetermined a list of words and phrases that are associated with web pages having racist content.
- and
-
6. A computer-readable medium storing a computer program for use in conjunction with a web browser client program to rate a web page according to claim 1 wherein the selected characteristic is terrorist content;
- and
the database includes a predetermined a list of words and phrases that are associated with web pages having terrorist content.
- and
-
7. A computer-readable medium storing a computer program for use in conjunction with a web browser client program to rate a web page according to claim 1 wherein the selected characteristic is neo-Nazi content;
- and
the database includes a predetermined a list of words and phrases that are associated with web pages having neo-Nazi content.
- and
-
8. A computer-readable medium storing a computer program for use in conjunction with a web browser client program to rate a web page according to claim 1 wherein the selected characteristic is illicit drugs content;
- and
the database includes a predetermined a list of words and phrases that are associated with web pages having content pertaining to illicit drugs.
- and
-
9. A computer-readable medium storing a computer program for use in conjunction with a web browser client program to rate a web page according to claim 1 wherein the selected characteristic is content selected as presenting a liability risk to persons having managerial responsibility for the web page material accessed by others;
- and
the database includes a predetermined a list of words and phrases that are associated with web pages having content likely to present a liability risk to persons having managerial responsibility for the web page material accessed by others.
- and
-
10. A method of analyzing content of a web page, the method comprising:
-
identifying natural language textual portions of the web page;
forming a word listing including all natural language words that appear in the textual portion of the web page;
for each word in the word list, querying a preexisting database of selected words to determine whether or not a match exists in the database;
for each word having a match in the database, reading a corresponding weight from the database so as to form a weighted set of terms; and
in a neural network system, calculating a rating for the web page responsive to the weighted set of terms. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
identifying meta-content in the web page; and
identifying words from the meta-content of the web page in the word list so that the meta-content is taken into account in calculating the rating for the web page.
-
-
12. A method according to claim 10 wherein said calculating step includes:
-
summing the weighted set of terms together to form a sum;
multiplying the sum by a predetermined modifier to scale the sum;
determining a total number of words on the web page; and
dividing the scaled sum by the total number of words on the web page to form the rating.
-
-
13. A method according to claim 10 wherein the preexisting database comprises words selected as indicative of pornographic content.
-
14. A method according to claim 10 wherein the preexisting database comprises words selected as indicative of hate-mongering content.
-
15. A method according to claim 10 wherein the preexisting database comprises words selected as indicative of racist content.
-
16. A method according to claim 10 wherein the preexisting database comprises words selected as indicative of terrorist content.
-
17. A method according to claim 10 wherein the preexisting database comprises words selected as indicative of neo-Nazi content.
-
18. A method according to claim 10 wherein the preexisting database comprises words selected as indicative content pertaining to illicit drugs.
-
19. A method of building a target attribute set for use in analyzing content of a web page, the method comprising:
-
acquiring a plurality of sample web pages for use as training web pages;
designating each of the training data sets as “
yes”
or “
no”
with respect to a predetermined content characteristic;
parsing through the content of all the training web pages to form a list of regular expressions that appear in the training web pages;
forming data reflecting a frequency of occurrence of each regular expression in the training web pages;
analyzing the frequency of occurrence data, in view of the “
yes’
or “
no’
designation of each web page, to identify and select a set of regular expressions that are indicative of either a “
yes’
designation or a “
no”
designation of a web page with respect to the predetermined characteristic; and
storing the selected set of regular expressions to form a target attribute set based on the acquired training web pages, whereby the target attribute set provides a set of regular expressions that are useful in a neural network system in discriminating web page content relative to the predetermined content characteristic.
-
-
20. A method of assigning weights to a list of regular expressions for use in analyzing content of a web page, the method comprising:
-
providing a predetermined target attribute set associated with a predetermined group of training web pages, the target attribute set including a list of regular expressions that are deemed useful in a neural network system for discriminating web page content relative to a predetermined content characteristic;
assigning an initial weight to each of the regular expressions in the target attribute set, thereby forming a weight database;
designating each of the group of training web pages as either “
yes”
or “
no”
relative to whether it exhibits the predetermined content characteristic;
examining one of the group of training web pages to identify all regular expressions within the web page that also appear in the target attribute set, thereby forming a match list for said web page;
in a neural network system, rating the examined web page using the weightings in the weight database;
comparing the rating the examined web page to the corresponding “
yes”
or “
no”
designation to form a first error term;
repeating said examining, rating and comparing operations for each of the remaining web pages in the group of training web pages to form additional error terms; and
adjusting the weights in the weight database in response to the first and the additional error terms. - View Dependent Claims (21)
-
-
22. A method of controlling access to potentially offensive or harmful web pages comprising:
-
in conjunction with a web browser client program executing on a digital computer, examining a downloaded web page before the web page is displayed to the user;
said examining operation including analyzing the web page natural language content relative to a predetermined database of regular expressions, and using a neural network system to form a rating, the database including regular expressions previously associated with potentially offensive or harmful web pages; and
the database further including a relative weighting associated with each regular expression in the database for use in forming the rating;
comparing the rating of the downloaded web page to a predetermined threshold rating; and
if the rating indicated that the downloaded web page is more likely to be offensive or harmful than a web page having the threshold rating, blocking the downloaded web page from being displayed to the user. - View Dependent Claims (23, 24, 25, 26)
if the downloaded web page is blocked, displaying an alternative web page to the user.
-
-
24. A method according to claim 23 wherein said displaying an alternative web page includes generating or selecting the alternative web page responsive to a predetermined categorization of the user.
-
25. A method according to claim 23 wherein the alternative web page includes an indication of the reason that the downloaded web page was blocked.
-
26. A method according to claim 22 wherein the alternative web page includes one or more links to other web pages selected as age-appropriate in view of a predetermined categorization of the user.
-
27. A computer-readable medium storing a web search engine server program, the program comprising:
-
a data acquisition component for acquiring meta-content from target web sites into an internal database; and
an inquiry component for selecting and presenting meta-content from the internal database in response to an end-user request;
the data acquisition component including an analysis component that analyzes the content of web pages corresponding to the meta-content stored in the internal database, and a neural network subsystem that returns a rating for each such web page based on the result of said analysis; and
means for adding said returned ratings into the internal database as additional meta-content in association with the corresponding web pages. - View Dependent Claims (28, 29, 30)
first means for identifying natural language textual portions of the web page and forming a list of words that appear in the identified natural language textual portions of the web page;
a second internal database of predetermined words that are associated with the selected characteristic;
second means for querying the second internal database to determine which of the list of words has a match in the database;
third means for acquiring a corresponding weight from the second internal database for each such word having a match in the second internal database so as to form a weighted set of terms; and
fourth means for calculating a rating for the web page responsive to the weighted set of terms, the calculating means including means for determining and taking into account a total number of natural language words that appear in the identified natural language textual portions of the web page.
-
-
29. A computer-readable medium storing a web search engine server program according to claim 27, and further comprising means for including the additional meta-content in said presenting meta-content from the internal database in response to an end-user request.
-
30. A computer-readable medium storing a web search engine server program according to claim 27, and further comprising means for modifying the meta-content results presented in response to an end-user request based upon the said ratings.
Specification