Entity normalization via name normalization
First Claim
1. A computer-implemented method of identifying duplicate objects in a plurality of objects, wherein each object in the plurality of objects is associated with one or more facts, and each of the one or more facts has an attributed and a value, the method comprising, using a computer processor to perform:
- associating facts extracted from web documents with a plurality of objects; and
for each of the plurality of objects, normalizing the value of a name fact by applying at least one normalization rule from a group of normalization rules to the value of the name fact, the name fact being among one or more facts associated with the object;
based on the normalized values of the name facts, grouping the plurality of objects into a plurality of buckets, each object in a bucket having the same normalized value of a name fact; and
processing the plurality of objects in a bucket in accordance with the normalized value of the name fact of the plurality of objects to identify at least one pair of duplicate objects in the plurality of objects in the bucket, based on a similarity of values of facts other than the name fact for the objects in the bucket; and
removing one of the duplicate objects from a memory repository, wherein the group of normalization rules includes at least one rule selected from the group of;
removing social titles;
removing predefined adjective words;
removing single letter words;
removing punctuation marks;
removing stop words; and
converting uppercase characters into lowercase.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for normalizing entities via name normalization are disclosed. In some implementations, a computer-implemented method of identifying duplicate objects in a plurality of objects is provided. Each object in the plurality of objects is associated with one or more facts, and each of the one or more facts having a value. The method includes: using a computer processor to perform: associating facts extracted from web documents with a plurality of objects; and for each of the plurality of objects, normalizing the value of a name fact, the name fact being among one or more facts associated with the object; processing the plurality of objects in accordance with the normalized value of the name facts of the plurality of objects. In some implementations, normalizing the value of the name fact is optionally carried out by applying a group of normalization rules to the value of the name fact.
186 Citations
19 Claims
-
1. A computer-implemented method of identifying duplicate objects in a plurality of objects, wherein each object in the plurality of objects is associated with one or more facts, and each of the one or more facts has an attributed and a value, the method comprising, using a computer processor to perform:
-
associating facts extracted from web documents with a plurality of objects; and for each of the plurality of objects, normalizing the value of a name fact by applying at least one normalization rule from a group of normalization rules to the value of the name fact, the name fact being among one or more facts associated with the object; based on the normalized values of the name facts, grouping the plurality of objects into a plurality of buckets, each object in a bucket having the same normalized value of a name fact; and processing the plurality of objects in a bucket in accordance with the normalized value of the name fact of the plurality of objects to identify at least one pair of duplicate objects in the plurality of objects in the bucket, based on a similarity of values of facts other than the name fact for the objects in the bucket; and removing one of the duplicate objects from a memory repository, wherein the group of normalization rules includes at least one rule selected from the group of; removing social titles; removing predefined adjective words; removing single letter words; removing punctuation marks; removing stop words; and converting uppercase characters into lowercase. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for identifying duplicate objects in a plurality of objects, wherein each object in the plurality of objects is associated with one or more facts, and each of the one or more facts has a value, the system comprising:
-
memory; one or more processors; and one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for; associating facts extracted from web documents with a plurality of objects; and for each of the plurality of objects, normalizing the value of a name fact, the name fact being among one or more facts associated with the object by applying at least one normalization rule from a group of normalization rules to the value of the name fact; based on the normalized values of the name facts, grouping the plurality of objects into a plurality of buckets, each object in a bucket having the same normalized value of a name fact; and processing the plurality of objects in a bucket in accordance with the normalized value of the name fact of the plurality of objects to identify at least one pair of duplicate objects in the plurality of objects in the bucket, based on a similarity of values of facts other than the name fact for the objects in the bucket; and removing one of the duplicate objects from a memory repository, wherein the group of normalization rules includes at least one rule selected from the group of; removing social titles; removing predefined adjective words; removing single letter words; removing punctuation marks; removing stop words; and converting uppercase characters into lowercase. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A non-transitory computer readable storage medium storing one or more programs, the one or more programs, being executable by one or more processors to identify duplicate objects in a plurality of objects, wherein each object in the plurality of objects is associated with one or more facts, and each of the one or more facts has a value, the one or more programs comprising instructions for:
-
associating facts extracted from web documents with a plurality of objects; and for each of the plurality of objects, normalizing the value of a name fact, the name fact being among one or more facts associated with the object, wherein normalizing the value of the name fact includes normalizing the value of the name fact by applying a group of normalization rules to the value of the name fact; processing the plurality of objects in accordance with the normalized value of the name facts of the plurality of objects to identify at least one duplicate object in the plurality of objects, based on the normalized values of the name facts, grouping the plurality of objects into a plurality of buckets, each object in a bucket having the same normalized value of a name fact; and processing the plurality of objects in a bucket in accordance with the normalized value of the name fact of the plurality of objects to identify at least one pair of duplicate objects in the plurality of objects in the bucket, based on a similarity of values of facts other than the name fact for the objects in the bucket; and removing one of the duplicate objects from a memory repository, wherein the group of normalization rules comprises at least one rule selected from the group of; removing social titles; removing predefined adjective words;
removing single letterwords;
removing punctuationmarks;
removing stop words; andconverting uppercase characters into lowercase. - View Dependent Claims (16, 17, 18, 19)
-
Specification