Entity normalization via name normalization

US 10,223,406 B2
Filed: 06/29/2017
Issued: 03/05/2019
Est. Priority Date: 02/17/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of identifying duplicate objects in a plurality of objects, wherein each object in the plurality of objects is associated with one or more facts, and each of the one or more facts has an attributed and a value, the method comprising, using a computer processor to perform:

associating facts extracted from web documents with a plurality of objects; and

for each of the plurality of objects, normalizing the value of a name fact by applying at least one normalization rule from a group of normalization rules to the value of the name fact, the name fact being among one or more facts associated with the object;

based on the normalized values of the name facts, grouping the plurality of objects into a plurality of buckets, each object in a bucket having the same normalized value of a name fact; and

processing the plurality of objects in a bucket in accordance with the normalized value of the name fact of the plurality of objects to identify at least one pair of duplicate objects in the plurality of objects in the bucket, based on a similarity of values of facts other than the name fact for the objects in the bucket; and

removing one of the duplicate objects from a memory repository, wherein the group of normalization rules includes at least one rule selected from the group of;

removing social titles;

removing predefined adjective words;

removing single letter words;

removing punctuation marks;

removing stop words; and

converting uppercase characters into lowercase.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for normalizing entities via name normalization are disclosed. In some implementations, a computer-implemented method of identifying duplicate objects in a plurality of objects is provided. Each object in the plurality of objects is associated with one or more facts, and each of the one or more facts having a value. The method includes: using a computer processor to perform: associating facts extracted from web documents with a plurality of objects; and for each of the plurality of objects, normalizing the value of a name fact, the name fact being among one or more facts associated with the object; processing the plurality of objects in accordance with the normalized value of the name facts of the plurality of objects. In some implementations, normalizing the value of the name fact is optionally carried out by applying a group of normalization rules to the value of the name fact.

186 Citations

19 Claims

1. A computer-implemented method of identifying duplicate objects in a plurality of objects, wherein each object in the plurality of objects is associated with one or more facts, and each of the one or more facts has an attributed and a value, the method comprising, using a computer processor to perform:
- associating facts extracted from web documents with a plurality of objects; and
  
  for each of the plurality of objects, normalizing the value of a name fact by applying at least one normalization rule from a group of normalization rules to the value of the name fact, the name fact being among one or more facts associated with the object;
  
  based on the normalized values of the name facts, grouping the plurality of objects into a plurality of buckets, each object in a bucket having the same normalized value of a name fact; and
  
  processing the plurality of objects in a bucket in accordance with the normalized value of the name fact of the plurality of objects to identify at least one pair of duplicate objects in the plurality of objects in the bucket, based on a similarity of values of facts other than the name fact for the objects in the bucket; and
  
  removing one of the duplicate objects from a memory repository, wherein the group of normalization rules includes at least one rule selected from the group of;
  
  removing social titles;
  
  removing predefined adjective words;
  
  removing single letter words;
  
  removing punctuation marks;
  
  removing stop words; and
  
  converting uppercase characters into lowercase.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the group of normalization rules includes sorting the value of the name fact for each of the plurality of objects in alphabetic order.
  - 3. The method of claim 1, wherein the grouping comprises:
    - generating a signature for each of the plurality of objects based at least in part on the normalized value of the name fact of each of the plurality of objects; and
      
      responsive to an identifier of an existing bucket being the same as the signature of an object, adding the object to the existing bucket, otherwise establishing a new bucket including the object, an identifier of the new bucket being same as the signature of the object.
  - 4. The method of claim 1, wherein processing the plurality of objects in accordance with the normalized value of the name facts of the plurality of objects includes applying a matcher to a pair of objects in the bucket.
  - 5. The method of claim 4, wherein applying the matcher to a pair of objects in the bucket includes:
    - for each common fact of the pair of objects, determining a similarity of the values of the common fact based on a similarity measure; and
      
      determining that the pair of objects are duplicates based on the similarity.
  - 6. The method of claim 5, wherein determining that the pair of objects are duplicates includes determining that the pair of objects are duplicates based on the number of the common facts with similar values and the number of common facts.
  - 7. The method of claim 4, wherein applying the matcher includes applying the matcher to each pair of objects in the bucket to determine if the pair of objects are duplicates.
  - 8. The method of claim 4, further comprising:
    - selecting the matcher from a collection of matchers, wherein applying the matcher includes applying the selected matcher to a pair of objects in the bucket to determine if the pair of objects are duplicates.

9. A system for identifying duplicate objects in a plurality of objects, wherein each object in the plurality of objects is associated with one or more facts, and each of the one or more facts has a value, the system comprising:
- memory;
  
  one or more processors; and
  
  one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for;
  
  associating facts extracted from web documents with a plurality of objects; and
  
  for each of the plurality of objects, normalizing the value of a name fact, the name fact being among one or more facts associated with the object by applying at least one normalization rule from a group of normalization rules to the value of the name fact;
  
  based on the normalized values of the name facts, grouping the plurality of objects into a plurality of buckets, each object in a bucket having the same normalized value of a name fact; and
  
  processing the plurality of objects in a bucket in accordance with the normalized value of the name fact of the plurality of objects to identify at least one pair of duplicate objects in the plurality of objects in the bucket, based on a similarity of values of facts other than the name fact for the objects in the bucket; and
  
  removing one of the duplicate objects from a memory repository,wherein the group of normalization rules includes at least one rule selected from the group of;
  
  removing social titles;
  
  removing predefined adjective words;
  
  removing single letter words;
  
  removing punctuation marks;
  
  removing stop words; and
  
  converting uppercase characters into lowercase.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The system of claim 9, wherein the group of normalization rules includes sorting the value in alphabetic order.
  - 11. The system of claim 9, wherein the grouping includes:
    - generating a signature for each of the plurality of objects based at least in part on the normalized value of the name fact of each of the plurality of objects; and
      
      responsive to an identifier of an existing bucket being the same as the signature of an object, adding the object to the existing bucket, otherwise establishing a new bucket including the object, an identifier of the new bucket being same as the signature of the object.
  - 12. The system of claim 9, wherein processing the plurality of objects in accordance with the normalized value of the name facts of the plurality of objects includes applying a matcher to a pair of objects in the bucket.
  - 13. The system of claim 12, wherein applying the matcher to a pair of objects in the bucket includes:
    - for each common fact of the pair of objects, determining a similarity of the values of the common fact based on a similarity measure; and
      
      determining that the pair of objects are duplicates based on the similarity.
  - 14. The system of claim 13, wherein determining that the pair of objects are duplicates includes determining that the pair of objects are duplicates based on the number of the common facts with similar values and the number of common facts.

15. A non-transitory computer readable storage medium storing one or more programs, the one or more programs, being executable by one or more processors to identify duplicate objects in a plurality of objects, wherein each object in the plurality of objects is associated with one or more facts, and each of the one or more facts has a value, the one or more programs comprising instructions for:
- associating facts extracted from web documents with a plurality of objects; and
  
  for each of the plurality of objects, normalizing the value of a name fact, the name fact being among one or more facts associated with the object, wherein normalizing the value of the name fact includes normalizing the value of the name fact by applying a group of normalization rules to the value of the name fact;
  
  processing the plurality of objects in accordance with the normalized value of the name facts of the plurality of objects to identify at least one duplicate object in the plurality of objects,based on the normalized values of the name facts, grouping the plurality of objects into a plurality of buckets, each object in a bucket having the same normalized value of a name fact; and
  
  processing the plurality of objects in a bucket in accordance with the normalized value of the name fact of the plurality of objects to identify at least one pair of duplicate objects in the plurality of objects in the bucket, based on a similarity of values of facts other than the name fact for the objects in the bucket; and
  
  removing one of the duplicate objects from a memory repository,wherein the group of normalization rules comprises at least one rule selected from the group of;
  
  removing social titles;
  
  removing predefined adjectivewords;
  
  removing single letterwords;
  
  removing punctuationmarks;
  
  removing stop words; and
  
  converting uppercase characters into lowercase.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The non-transitory computer readable storage medium of claim 15, wherein the group of normalization rules comprises sorting the value in alphabetic order.
  - 17. The non-transitory computer readable storage medium of claim 15, wherein the grouping comprises:
    - generating a signature for each of the plurality of objects based at least in part on the normalized value of the name fact of each of the plurality of objects; and
      
      responsive to an identifier of an existing bucket being the same as the signature of an object, adding the object to the existing bucket, otherwise establishing a new bucket including the object, an identifier of the new bucket being same as the signature of the object.
  - 18. The non-transitory computer readable storage medium of claim 15, wherein processing the plurality of objects in accordance with the normalized value of the name facts of the plurality of objects includes applying a matcher to a pair of objects in the bucket.
  - 19. The non-transitory computer readable storage medium of claim 15, wherein applying the matcher to a pair of objects in the bucket includes:
    - for each common fact of the pair of objects, determining a similarity of the values of the common fact based on a similarity measure; and
      
      determining that the pair of objects are duplicates based on the similarity.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Betz, Jonathan T.
Primary Examiner(s)
Alam, Shahid A

Application Number

US15/637,438
Publication Number

US 20170300524A1
Time in Patent Office

614 Days
Field of Search

707612
US Class Current
CPC Class Codes

G06F 16/1748   De-duplication implemented ...

G06F 16/215   Improving data quality; Dat...

G06F 16/2365   Ensuring data consistency a...

G06F 16/2379   Updates performed during on...

G06F 16/2458   Special types of queries, e...

G06F 16/273   Asynchronous replication or...

G06F 16/338   Presentation of query results

Entity normalization via name normalization

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

186 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Entity normalization via name normalization

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

186 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links