Document alignment systems for legacy document conversions
First Claim
1. A document alignment method comprising:
- inputting source leaves of a source document in first tree structured format, the first tree structured format comprising nodes which are ultimately connected with the source leaves by paths, text content of the source document being distributed among the source leaves;
inputting target leaves of a target document in second tree structured format, the second tree structured format comprising nodes which are ultimately connected with the target leaves by paths, text content of the target document being distributed among the target leaves;
assigning a cost to each of a plurality of matches based on text content of the leaves, each match comprising elements selected from the group consisting of;
a source leaf and a target leaf,an unmatched source leaf, andan unmatched target leaf;
identifying a set of matches for which a total cost is minimal, wherein each of the input source and target leaves is in at least one of the identified matches;
identifying, from the set of identified matches, groups of matches wherein each match in the group has a leaf in common;
identifying, from the groups, probable matches in which more than one target leaf is matched with at least one source leaf and probable matches where more than one source leaf is matched with a target leaf;
outputting an alignment between leaves of the target document and leaves of the source document which includes the probable matches.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for aligning documents which may be in different XML formats includes inputting source and target leaves of a source and documents in first and second tree structured formats and assigning a cost to each of a plurality of matches. Each match may include a source leaf and a target leaf or be an unmatched source or target leaf. Matches are identified for which a total cost is minimal, wherein each of the leaves is in at least one of the identified matches. From the identified matches, groups of two or more matches are identified which have a leaf in common. From the groups, probable matches are identified in which more that one target leaf is matched with at least one source leaf or more than one source leaf is matched with a target leaf. An alignment between leaves of the target document and leaves of the source document is output which includes the probable matches.
-
Citations
26 Claims
-
1. A document alignment method comprising:
-
inputting source leaves of a source document in first tree structured format, the first tree structured format comprising nodes which are ultimately connected with the source leaves by paths, text content of the source document being distributed among the source leaves; inputting target leaves of a target document in second tree structured format, the second tree structured format comprising nodes which are ultimately connected with the target leaves by paths, text content of the target document being distributed among the target leaves; assigning a cost to each of a plurality of matches based on text content of the leaves, each match comprising elements selected from the group consisting of; a source leaf and a target leaf, an unmatched source leaf, and an unmatched target leaf; identifying a set of matches for which a total cost is minimal, wherein each of the input source and target leaves is in at least one of the identified matches; identifying, from the set of identified matches, groups of matches wherein each match in the group has a leaf in common; identifying, from the groups, probable matches in which more than one target leaf is matched with at least one source leaf and probable matches where more than one source leaf is matched with a target leaf; outputting an alignment between leaves of the target document and leaves of the source document which includes the probable matches. - View Dependent Claims (2, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 21, 22, 23, 24)
-
-
5. A document alignment method comprising:
-
inputting source leaves of a source document in first tree structured format, the first tree structured format comprising nodes which are ultimately connected with the source leaves by paths, each source leaf comprising text content; inputting target leaves of a target document in second tree structured format, the second tree structured format comprising nodes which are ultimately connected with the target leaves by paths, each target leaf comprising text content; subdividing the leaves of the source document and the leaves of the target document into blocks, each block including a set of the source leaves and a set of the target leaves; for each block, assigning a cost to each of a plurality of matches, each match comprising a pair of elements selected from the group consisting of a source leaf and a target leaf, an unmatched source leaf, and an unmatched target leaf from the same block; identifying a set of matches for which a total cost is minimal, wherein each of the input source and target leaves is in at least one of the identified matches; identifying, from the set of identified matches, groups of matches wherein each match in the group has a leaf in common; identifying, from the groups, probable matches in which more than one target leaf is matched with at least one source leaf and probable matches where more than one source leaf is matched with a target leaf; and outputting an alignment between leaves of the target document and leaves of the source document which includes the probable matches. - View Dependent Claims (6, 25, 26)
-
-
15. A supervised learning method comprising:
-
learning a transformation which converts a source document to a target document using source and target documents aligned by a method comprising; inputting source leaves of a source document in first tree structured format, each source leaf comprising text content; inputting target leaves of a target document in second tree structured format, each target leaf comprising text content; assigning a cost to each of a plurality of matches based on leaf text content, each match consisting of one of the group consisting of; a source leaf and a target leaf, an unmatched source leaf, and an unmatched target leaf; identifying matches for which a total cost is minimal, wherein each of the leaves is in at least one of the identified matches; identifying, from the identified matches, groups of matches wherein each match in the group has a leaf in common; identifying, from the groups, probable matches in which more than one target leaf is matched with at least one source leaf and probable matches where more than one source leaf is matched with a target leaf; outputting an alignment between leaves of the target document and leaves of the source document which includes the probable matches.
-
-
18. A document alignment apparatus comprising:
-
an input device for inputting source leaves of a source document in first tree structured format, the first tree structured format comprising nodes which are ultimately connected with source leaves by paths, and inputting target leaves of a target document in second tree structured format, the second tree structured format comprising nodes which are ultimately connected with the target leaves by paths, text content of the source document being distributed among the source leaves and text content of the target document being distributed among the target leaves; memory for storing the input source and target leaves; a processing module which assigns a cost to each of a plurality of matches based on leaf text content and not on the tree structure of the source and target documents, each match being selected from the group consisting of; a source leaf and a target leaf, an unmatched source leaf, and an unmatched target leaf; a processing module which identifies matches for which a total cost is minimal, wherein each of the leaves is in at least one of the identified matches; a processing module which identifies, from the identified matches, groups of matches wherein each match in the group has a leaf in common; a processing module which identifies, from the groups, probable matches in which more than one target leaf is matched with at least one source leaf and probable matches where more than one source leaf is matched with a target leaf; and an output device for outputting an alignment between leaves of the target document and leaves of the source document which includes the identified probable matches. - View Dependent Claims (19, 20)
-
Specification