Method and apparatus for correcting a uniform resource identifier
First Claim
1. A program storage medium readable by a computer, tangibly embodying a program of instructions executable by the computer to perform method steps for correcting a uniform resource identifier (URI) in a noisy source document, said method steps comprising:
- detecting the URI within the noisy source document;
attempting to find a first resource identified by the URI; and
correcting the URI when the first resource is not found including;
identifying a potential-separator-confused character within the URI,testing for validity a beginning portion of the URI, the beginning portion starting with a first character of the URI and ending with an alphanumeric character immediately preceding the identified potential-separator-confused character, andreplacing the identified potential-separator-confused character with a component separator character when the beginning portion tests as valid.
3 Assignments
0 Petitions
Accused Products
Abstract
A computer program which causes a computer to correct a uniform resource identifier (URI) in a noisy source document. The program finds and corrects potential errors within a URI before turning the URI into a hyperlink. Testing the corrected URI is done by seeking the resource described by the corrected URI. Testing the URI also includes parsing the URI, identifying potential syntax errors within each portion of the URI, creating alternative URI combinations, and prioritizing the alternative URI combinations. Syntax errors corrected include incorrect protocol, incorrect or missing component separator characters, incorrect spacing, incorrect or missing dot character, and alphanumeric character replacement.
-
Citations
12 Claims
-
1. A program storage medium readable by a computer, tangibly embodying a program of instructions executable by the computer to perform method steps for correcting a uniform resource identifier (URI) in a noisy source document, said method steps comprising:
-
detecting the URI within the noisy source document; attempting to find a first resource identified by the URI; and correcting the URI when the first resource is not found including; identifying a potential-separator-confused character within the URI, testing for validity a beginning portion of the URI, the beginning portion starting with a first character of the URI and ending with an alphanumeric character immediately preceding the identified potential-separator-confused character, and replacing the identified potential-separator-confused character with a component separator character when the beginning portion tests as valid. - View Dependent Claims (2, 3, 6)
-
-
4. A program storage medium readable by a computer, tangibly embodying a program of instructions executable by the computer to perform method steps for correcting a uniform resource identifier (URI) in a noisy source document, said method steps comprising:
-
detecting the URI within the noisy source document; attempting to find a first resource identified by the URI; and correcting the URI when the first resource is not found, including; testing for validity a combination, the combination including the URI and a word immediately following the URI in the source document, and substituting the combination for the URI when the combination tests as valid. - View Dependent Claims (5)
-
-
7. A program storage medium readable by a computer, tangibly embodying a program of instructions executable by the computer to perform method steps for correcting a corrupted uniform resource identifier in noisy text, said method steps comprising:
-
identifying a potential mistake within the uniform resource identifier (URI); testing for validity a portion of the URI prior to the potential mistake; creating a first list, the first list containing a plurality of alternative URI combinations derived from the portion; adding the first list to a first queue when the portion tests as invalid; adding the first list to a second queue when the portion tests as valid; taking a first alternative URI combination from the first queue when the second queue is empty; taking the first alternative URI combination from the second queue when the second queue is not empty; and testing for validity the first alternative URI combination. - View Dependent Claims (8)
-
-
9. A method for correcting a uniform resource identifier (URI) in a noisy source document comprising:
-
detecting the URI within the noisy source document; attempting to find a first resource identified by the URI; and correcting the URI when the first resource is not found, including; identifying a potential-separator-confused character within the URI, testing for validity a beginning portion of the URI, the beginning portion starting with a first character of the URI and ending with an alphanumeric character immediately preceding the identified potential-separator-confused character, and replacing the identified potential-separator-confused character with a component separator character when the beginning portion tests as valid. - View Dependent Claims (10)
-
-
11. A method for correcting a uniform resource identifier (URI) in a noisy source document comprising:
-
detecting the URI within the noisy source document; attempting to find a first resource identified by the URI; and correcting the URI when the first resource is not found, including; identifying a potential mistake within the uniform resource identifier (URI), testing for validity a portion of the URI prior to the potential mistake, creating a first list, the first list containing a plurality of alternative URI combinations derived from the portion, adding the first list to a first queue when the portion tests as invalid, adding the first list to a second queue when the portion tests as valid, taking a first alternative URI combination from the first queue when the second queue is empty, taking the first alternative URI combination from the second queue when the second queue is not empty, testing for validity the first alternative URI combination, creating a second list when the first alternative URI combination does not test as invalid and does not end in a separator character, the second list containing a plurality of alternative URI combinations derived from the first alternative URI combination, adding the second list to the first queue when the first alternative URI combination tests as invalid, adding the second list to the second queue when the first alternative URI combination tests as valid, taking a second alternative URI combination from the first queue when the second queue is empty, taking a second alternative URI combination from the second queue when the second queue is not empty, and testing for validity the second alternative URI combination.
-
-
12. A program storage medium readable by a computer, tangibly embodying a program of instructions executable by the computer to perform method steps for correcting a corrupted uniform resource identifier in noisy text, said method steps comprising:
-
identifying a potential mistake within the uniform resource identifier (URI); testing for validity a portion of the URI prior to the potential mistake; creating a list containing a plurality of alternative URI combinations derived from the portion; adding the list to a queue when the portion tests as invalid; taking a first alternative URI combination from the queue; and testing for validity the first alternative URI combination.
-
Specification