Record boundary identification and extraction through pattern mining
First Claim
1. A method of encoding data, the method comprising:
- locating, by a computer in a set of data, one or more primary data items that match one or more patterns in a set of specified patterns, wherein the patterns in the set of specified patterns correspond to symbols;
for each primary data item of the one or more primary data items, performing steps comprising;
determining the symbol that is associated with the pattern that the primary data item matches, andreplacing the primary data item in the set of data with the symbol;
after performing the replacing, locating, in the set of data, one or more secondary data items that existed in the set of data prior to the replacing;
for each secondary data item of the one or more secondary data items, performing particular steps comprising;
generating a hash value based on the secondary data item; and
replacing the secondary data item in the set of data with a symbol that is associated with the hash value;
wherein the set of data corresponds to a first document that contains multiple records; and
determining boundaries for each record in the first document based on symbols that have replaced the secondary data items in the set of data.
9 Assignments
0 Petitions
Accused Products
Abstract
Techniques for identifying discrete records within a multi-record document are provided. According to one technique, a document is encoded based on some combination of visual tag encoding, text category encoding, and text content encoding that produces hash values based on the contents of portions of the document. According to one technique, repeating candidate patterns are identified in a document so encoded. The candidate patterns may be identified in a “fuzzy” manner that allows for some inconsistencies in the individual pattern instances. According to one technique, the identified candidate patterns are validated based on specified factors to determine a “best” pattern. According to one technique, the boundaries of discrete records in a multi-record document are marked based on the portions of the document that correspond to an identified repeating pattern.
-
Citations
26 Claims
-
1. A method of encoding data, the method comprising:
-
locating, by a computer in a set of data, one or more primary data items that match one or more patterns in a set of specified patterns, wherein the patterns in the set of specified patterns correspond to symbols; for each primary data item of the one or more primary data items, performing steps comprising; determining the symbol that is associated with the pattern that the primary data item matches, and replacing the primary data item in the set of data with the symbol; after performing the replacing, locating, in the set of data, one or more secondary data items that existed in the set of data prior to the replacing; for each secondary data item of the one or more secondary data items, performing particular steps comprising; generating a hash value based on the secondary data item; and replacing the secondary data item in the set of data with a symbol that is associated with the hash value; wherein the set of data corresponds to a first document that contains multiple records; and determining boundaries for each record in the first document based on symbols that have replaced the secondary data items in the set of data. - View Dependent Claims (2, 3, 5, 14, 15, 16, 18)
-
-
4. A method of encoding data, the method comprising:
-
locating, by a computer in a set of data, one or more primary data items that match one or more patterns in a set of specified patterns, wherein the patterns in the set of specified patterns correspond to symbols; for each primary data item of the one or more primary data items, performing steps comprising; determining the symbol that is associated with the pattern that the primary data item matches, and replacing the primary data item in the set of data with the symbol; after performing the replacing, locating, in the set of data, one or more secondary data items that existed in the set of data prior to the replacing; for each secondary data item of the one or more secondary data items, generating a hash value based on the secondary data item; determining whether a particular primary data item is an HTML tag; in response to determining that the particular primary data item is an HTML tag, determining whether a type of the HTML tag is a type that is included in a specified set of HTML tag types; and in response to determining that the type is not included in the specified set of HTML tag types, removing the particular primary data item from the set of data. - View Dependent Claims (17)
-
-
6. A method of identifying patterns, the method comprising:
-
locating, by a computer in a primary sequence of symbols, one or more secondary sequences, each of which consists of two or more consecutive occurrences of a particular symbol; for each secondary sequence of the one or more secondary sequences, replacing the secondary sequence in the primary sequence with a single occurrence of a symbol that occurs within the secondary sequence; after performing the replacing for each secondary sequence of the one or more secondary sequences, locating, in the primary sequence, one or more tertiary sequences, each of which consists of two or more consecutive occurrences of a particular two-symbol sequence; and for each tertiary sequence of the one or more tertiary sequences, replacing the tertiary sequence in the primary sequence with a single occurrence of a two-symbol sequence that occurs within the tertiary sequence; for each different secondary sequence of the one or more secondary sequences, adding, to a list of candidate repeating patterns, a single occurrence of a symbol that occurs within the different secondary sequence; and
for each different tertiary sequence of the one or more tertiary sequences, adding, to the list of candidate repeating patterns, a single occurrence of a two-symbol sequence that occurs within the different tertiary sequence;determining boundaries for each record in the first document based on one or more candidate repeating patterns in the list of candidate repeating patterns; extracting the records from the first document based on the boundaries, and compiling the records with additional records that are contained in a second document to form a third document that contains records from the first document and records from the second document; and wherein the primary sequence of symbols corresponds to a first document that contains multiple records. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 19, 20, 21, 22, 23, 24, 25, 26)
-
Specification