Methods for obtaining improved text similarity measures which replace similar characters with a string pattern representation by using a semantic data tree
First Claim
1. A computer-implemented method for data integration between electronic documents, said method comprising:
- identifying, by a computer, first similar terms between said electronic documents, each of said first similar terms referring to a same kind of entity and comprising an alphanumeric string, including a first number of letter characters followed by a second number of numeric characters;
replacing, by said computer, each of said first similar terms with a single equivalent term that replaces one or more of said second number of numeric characters with a common string pattern representation to create transformed electronic documents;
identifying, by said computer, second similar terms between said two documents, each of said second similar terms belonging to a different level of a same hierarchical semantic data tree, wherein a lower node of said hierarchical semantic data tree represent a specific term and a higher node represents a general term;
replacing, by said computer, said specific term of said lower node of said two documents with said general term of said higher node to create transformed electronic documents;
performing, by said computer, a similarity comparison on said transformed electronic documents; and
outputting, by said computer, a unified view to a user, such that said unified view comprises said transformed electronic documents.
1 Assignment
0 Petitions
Accused Products
Abstract
The embodiments of the invention provide methods for obtaining improved text similarity measures. More specifically, a method of measuring similarity between at least two electronic documents begins by identifying similar terms between the electronic documents. This includes basing similarity between the similar terms on patterns, wherein the patterns can include word patterns, letter patterns, numeric patterns, and/or alphanumeric patterns. The identifying of the similar terms also includes identifying multiple pattern types between the electronic documents. Moreover, the basing of the similarity on patterns identifies terms within the electronic documents that are within a category of a hierarchy. Specifically, the identifying of the terms reviews a hierarchical data tree, wherein nodes of the tree represent terms within the electronic documents. Lower nodes of the tree have specific terms; and, wherein higher nodes of the tree have general terms.
-
Citations
12 Claims
-
1. A computer-implemented method for data integration between electronic documents, said method comprising:
-
identifying, by a computer, first similar terms between said electronic documents, each of said first similar terms referring to a same kind of entity and comprising an alphanumeric string, including a first number of letter characters followed by a second number of numeric characters; replacing, by said computer, each of said first similar terms with a single equivalent term that replaces one or more of said second number of numeric characters with a common string pattern representation to create transformed electronic documents; identifying, by said computer, second similar terms between said two documents, each of said second similar terms belonging to a different level of a same hierarchical semantic data tree, wherein a lower node of said hierarchical semantic data tree represent a specific term and a higher node represents a general term; replacing, by said computer, said specific term of said lower node of said two documents with said general term of said higher node to create transformed electronic documents; performing, by said computer, a similarity comparison on said transformed electronic documents; and outputting, by said computer, a unified view to a user, such that said unified view comprises said transformed electronic documents. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer-implemented method for data integration between electronic documents, said method comprising:
-
identifying, by a computer, similar terms between said electronic documents, each of said similar terms referring to a same kind of entity and comprising an alphanumeric string, including a first number of letter characters and a second number of numeric characters; replacing, by said computer, each of said similar terms with a single equivalent term that replaces one or more of said first number of letter characters and one or more of said second number of numeric characters with a common string pattern representation to create transformed electronic documents; performing, by said computer, a similarity comparison on said transformed electronic documents; and outputting, by said computer, a unified view to a user, such that said unified view comprises said transformed electronic documents. - View Dependent Claims (7, 8)
-
-
9. A computer-implemented method for data integration between electronic documents, said method comprising:
-
identifying, by a computer, similar terms between said electronic documents, said identifying comprising basing similarity between said similar terms on string patterns of letter or numeric characters, said basing of said similarity on string patterns comprising identifying terms within said electronic documents, wherein said string patterns comprise at least one of letter patterns, numeric patterns, and alphanumeric patterns including letter and numeric characters; replacing, by said computer, each of said similar terms with a single equivalent term that replaces said string patterns with a common string pattern representation to create transformed electronic documents; performing, by said computer, a similarity comparison on said transformed electronic documents; and outputting, by said computer, a unified view to a user, such that said unified view comprises said transformed electronic documents comprising said equivalent terms. - View Dependent Claims (10, 11)
-
-
12. A non-transitory computer program storage device, storing computer readable instructions executable by a computer for performing a method of data integration between electronic documents, said method comprising:
-
identifying first similar terms between said electronic documents, each of said first similar terms referring to a same kind of entity and comprising an alphanumeric string, including a first number of letter characters followed by a second number of numeric characters; replacing each of said first similar terms with a single equivalent term that replaces one or more of said second number of numeric characters with a common string pattern representation to create transformed electronic documents; identifying second similar terms between said two documents, each of said second similar terms belonging to a different level of a same hierarchical semantic data tree, wherein a lower node of said hierarchical semantic data tree represent a specific term and a higher node represents a general term; replacing said specific term of said lower node of said two documents with said general term of said higher node to create transformed electronic documents; performing a similarity comparison on said transformed electronic documents; and outputting a unified view to a user, such that said unified view comprises said transformed electronic documents.
-
Specification