Identifying corrupted text segments
First Claim
Patent Images
1. A computer-implemented method comprising:
- selecting a set of web pages containing text associated with a single language grouping;
determining a set of text segments within the set of web pages;
determining a language affinity indicator corresponding to each text segment in the set of text segments, the language affinity indicator being a comparison value of a text segment with a set of predefined rules corresponding to the single language grouping;
responsive to each language affinity indicator indicating an affinity to the single language grouping, identifying a set of text artefacts within the text segments;
generating an indexing repository based on the set of text artefacts;
creating an occurrence table from the indexing repository;
determining a compliance threshold value for the occurrence table;
identifying an individual occurrence value for each unique text artefact in the set of text artefacts, the individual occurrence value being the probability that a text artefact occurs within the occurrence table based on the single language grouping; and
determining a compliance value for the set of text segments by, for each text segment in the set of text segments;
computing a compliance sum value for a first text segment in the set of text segments;
adjusting the compliance sum value according to the individual occurrence values of a subset of text artefacts occurring in the first text segment;
determining a segment length value associated with the first text segment; and
adjusting the compliance sum value according to the segment length value;
responsive to computing a set of compliance sum values for each text segment in the set of text segments, computing the compliance value based on an average value of the set of compliance sum values;
computing a compliance indicator for the set of text segments by comparing the compliance value and the compliance threshold; and
responsive to the compliance indicator indicating that the compliance value is less than the compliance threshold, taking a corrective action;
wherein;
the corrective action is an action selected from the group consisting of;
notifying a user of a corrupted set of text segments in the selected set of web pages; and
preventing the selected set of web pages from being accessed again.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-implemented method for taking a corrective action upon determination of an existence of a corrupted text segment within a set of web pages. Determination includes: determining a language affinity indicator corresponding to text segments within the set of web pages; generating an indexing repository based on a set of text artefacts within the text segments; creating an occurrence table for the set of text artefacts; and determining compliance of the text artefacts and text segments based on the single language grouping on which the set of text segments are based.
9 Citations
1 Claim
-
1. A computer-implemented method comprising:
-
selecting a set of web pages containing text associated with a single language grouping; determining a set of text segments within the set of web pages; determining a language affinity indicator corresponding to each text segment in the set of text segments, the language affinity indicator being a comparison value of a text segment with a set of predefined rules corresponding to the single language grouping; responsive to each language affinity indicator indicating an affinity to the single language grouping, identifying a set of text artefacts within the text segments; generating an indexing repository based on the set of text artefacts; creating an occurrence table from the indexing repository; determining a compliance threshold value for the occurrence table; identifying an individual occurrence value for each unique text artefact in the set of text artefacts, the individual occurrence value being the probability that a text artefact occurs within the occurrence table based on the single language grouping; and determining a compliance value for the set of text segments by, for each text segment in the set of text segments; computing a compliance sum value for a first text segment in the set of text segments; adjusting the compliance sum value according to the individual occurrence values of a subset of text artefacts occurring in the first text segment; determining a segment length value associated with the first text segment; and adjusting the compliance sum value according to the segment length value; responsive to computing a set of compliance sum values for each text segment in the set of text segments, computing the compliance value based on an average value of the set of compliance sum values; computing a compliance indicator for the set of text segments by comparing the compliance value and the compliance threshold; and responsive to the compliance indicator indicating that the compliance value is less than the compliance threshold, taking a corrective action; wherein; the corrective action is an action selected from the group consisting of; notifying a user of a corrupted set of text segments in the selected set of web pages; and preventing the selected set of web pages from being accessed again.
-
Specification