Address extraction from a communication

US 10,013,672 B2
Filed: 11/02/2012
Issued: 07/03/2018
Est. Priority Date: 11/02/2012
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

receiving a communication from a sender comprising a plurality of words, wherein at least one of the words is a zip code comprising five numerical digits;

assigning, via a computing apparatus, a score to each of the words, wherein the score assigned to each of the words is based on a ratio of a first frequency of usage of the respective word in a language relative to a second frequency of usage of the respective word in the language, and wherein a first set of words comprises a first total number of words used as an address, a second set of words comprises a second total number of words including words used other than as an address, the first frequency is determined by counting occurrence of the respective word in the first set of words relative to the first total, the second frequency is determined by counting occurrence of the respective word in the second set relative to the second total, and the first total is less than the second total, and wherein the assigning the score further comprises determining a score for a numerical digit sequence based on treating any numerical digit sequence of a given digit length as being the same word;

determining, via the computing apparatus, a respective total sum for each of a plurality of word sequences in the communication, the respective total sum determined as a sum of the scores for each word in the respective word sequence;

identifying a first word sequence of the word sequences having a total sum that is greater than a threshold value;

applying a at least one filter to the first word sequence, the at least one filter comprising determining a ratio of number tokens to character tokens in the first word sequence, and comparing the ratio to a predetermined value to determine whether the first word sequence passes the at least one filter, and the at least one filter further comprising determining whether the first word sequence includes a token that scores below a predetermined threshold, wherein determining that the first word sequence includes a token that scores below the predetermined threshold disqualifies the first word sequence from being identified as an address;

in response to determining that the first word sequence passes the at least one filter, extracting the first word sequence from the plurality of words of the received communication as a first address of the sender, wherein the first word sequence contains the zip code; and

storing, in a data repository, the first address in a first person profile of the sender, wherein the data repository stores a plurality of person profiles including the first person profile.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods to extract a string from a communication. A method includes: receiving a communication comprising a plurality of strings; assigning a score to each of the strings, wherein the score assigned to each of the strings corresponds to a frequency of usage of the respective string for a first function relative to an overall frequency of usage of the respective string; determining a respective total sum for each of a plurality of sequences in the communication, the respective total sum determined as a sum of the scores for each string in the respective sequence; and extracting a first sequence of the sequences from the communication based on the total sum for the first sequence. In one embodiment, the total sum includes an additional score for each of a starting word and an ending word of the first word sequence, wherein each respective additional score is associated with a probability that the starting (or ending) word is used as the first (or last word) of an address.

451 Citations

13 Claims

1. A method, comprising:
- receiving a communication from a sender comprising a plurality of words, wherein at least one of the words is a zip code comprising five numerical digits;
  
  assigning, via a computing apparatus, a score to each of the words, wherein the score assigned to each of the words is based on a ratio of a first frequency of usage of the respective word in a language relative to a second frequency of usage of the respective word in the language, and wherein a first set of words comprises a first total number of words used as an address, a second set of words comprises a second total number of words including words used other than as an address, the first frequency is determined by counting occurrence of the respective word in the first set of words relative to the first total, the second frequency is determined by counting occurrence of the respective word in the second set relative to the second total, and the first total is less than the second total, and wherein the assigning the score further comprises determining a score for a numerical digit sequence based on treating any numerical digit sequence of a given digit length as being the same word;
  
  determining, via the computing apparatus, a respective total sum for each of a plurality of word sequences in the communication, the respective total sum determined as a sum of the scores for each word in the respective word sequence;
  
  identifying a first word sequence of the word sequences having a total sum that is greater than a threshold value;
  
  applying a at least one filter to the first word sequence, the at least one filter comprising determining a ratio of number tokens to character tokens in the first word sequence, and comparing the ratio to a predetermined value to determine whether the first word sequence passes the at least one filter, and the at least one filter further comprising determining whether the first word sequence includes a token that scores below a predetermined threshold, wherein determining that the first word sequence includes a token that scores below the predetermined threshold disqualifies the first word sequence from being identified as an address;
  
  in response to determining that the first word sequence passes the at least one filter, extracting the first word sequence from the plurality of words of the received communication as a first address of the sender, wherein the first word sequence contains the zip code; and
  
  storing, in a data repository, the first address in a first person profile of the sender, wherein the data repository stores a plurality of person profiles including the first person profile.
- View Dependent Claims (2, 3, 4, 5, 13)
- - 2. The method of claim 1, wherein the assigning the score further comprises forming a plurality of tokens from the communication, each token corresponding to one of the words, and assigning a score to each of the tokens.
  - 3. The method of claim 2, wherein a first word of the words solely comprises numerical digits, and the forming the tokens comprises forming a single token corresponding to the first word.
  - 4. The method of claim 1, wherein each of the word sequences has a length of between 2 and 20 words.
  - 5. The method of claim 1, wherein the determining the total sum for the first word sequence comprises:
    - determining an additional score for a starting word of the first word sequence, wherein the additional score for the starting word is associated with a probability that the starting word is part of an address;
      
      determining an additional score for an ending word of the first word sequence, wherein the additional score for the ending word is associated with a probability that the ending word is part of an address; and
      
      adding the additional score for the starting word and the additional score for the ending word to the sum of scores for the first word sequence to obtain the total sum.
  - 13. The method of claim 1, wherein the determining the total sum for the first word sequence comprises:
    - determining an additional score for a starting word of the first word sequence;
      
      determining an additional score for an ending word of the first word sequence; and
      
      adding the additional score for the starting word and the additional score for the ending word to the sum of scores for the first word sequence to obtain the total sum.

6. A system, comprising:
- at least one processor;
  
  memory storing instructions, that when executed by the processor, cause the system to;
  
  receive a communication from a sender comprising a plurality of words, wherein at least one of the words is a zip code comprising five numerical digits;
  
  assign a score to each of the words, wherein the score assigned to each of the words is based on a ratio of a first frequency of usage of the respective word in a language relative to a second frequency of usage of the respective word in the language, and wherein a first set of words comprises a first total number of words used as an address, a second set of words comprises a second total number of words including words used other than as an address, the first frequency is determined by counting occurrence of the respective word in the first set of words relative to the first total, the second frequency is determined by counting occurrence of the respective word in the second set relative to the second total, and the first total is less than the second total, and wherein the assigning the score further comprises determining a score for a numerical digit sequence based on treating any numerical digit sequence of a given digit length as being the same word;
  
  determine a respective total sum for each of a plurality of contiguous word sequences in the communication, the respective total sum determined as a sum of the scores for each word in the respective contiguous word sequence;
  
  identify a first word sequence of the word sequences having a total sum that is greater than a threshold value;
  
  apply at least one filter to the first word sequence, the at least one filter comprising determining a ratio of number tokens to character tokens in the first word sequence, and comparing the ratio to a predetermined value to determine whether the first word sequence passes the at least one filter, and the at least one filter further comprising determining whether the first word sequence includes a token that scores below a predetermined threshold, wherein determining that the first word sequence includes a token that scores below the predetermined threshold disqualifies the first word sequence from being identified as an address;
  
  in response to determining that the first word sequence passes the at least one filter, extract the first word sequence from the plurality of words of the received communication as a first address of the sender, wherein the first word sequence contains the zip code; and
  
  store, in a data repository, the first address in a first person profile of the sender, wherein the data repository stores a plurality of person profiles including the first person profile.
- View Dependent Claims (7, 8, 9)
- - 7. The system of claim 6, wherein the determining the total sum for the first word sequence comprises:
    - determining an additional score for a starting word of the first word sequence, wherein the additional score for the starting word is associated with a probability that the starting word is part of an address;
      
      determining an additional score for an ending word of the first word sequence, wherein the additional score for the ending word is associated with a probability that the ending word is part of an address; and
      
      adding the additional score for the starting word and the additional score for the ending word to the sum of scores for the first word sequence to obtain the total sum.
  - 8. The system of claim 6, wherein the assigning the score further comprises forming a plurality of tokens from the communication, each token corresponding to one of the words, and assigning a score to each of the tokens.
  - 9. The system of claim 8, wherein a first word of the words solely comprises numerical digits, and the forming the tokens comprises forming a single token corresponding to the first word.

10. A non-transitory computer-readable storage medium storing computer-readable instructions, which when executed, cause a system to:
- receive a communication from a sender comprising a plurality of words, wherein at least one of the words is a zip code comprising five numerical digits;
  
  assign, via at least one processor, a score to each of the words, wherein the score assigned to each of the words is based on a ratio of a first frequency of usage of the respective word in a language relative to a second frequency of usage of the respective word in the language, and wherein a first set of words comprises a first total number of words used as an address, a second set of words comprises a second total number of words including words used other than as an address, the first frequency is determined by counting occurrence of the respective word in the first set of words relative to the first total, the second frequency is determined by counting occurrence of the respective word in the second set relative to the second total, and the first total is less than the second total, and wherein the assigning the score further comprises determining a score for a numerical digit sequence based on treating any numerical digit sequence of a given digit length as being the same word;
  
  determine a respective total sum for each of a plurality of word sequences in the communication, the respective total sum determined as a sum of the scores for each word in the respective word sequence;
  
  identify a first word sequence of the word sequences having a total sum that is greater than a threshold value;
  
  apply at least one filter to the first word sequence, the at least one filter comprising determining a ratio of number tokens to character tokens in the first word sequence, and comparing the ratio to a predetermined value to determine whether the first word sequence passes the at least one filter, and the at least one filter further comprising determining whether the first word sequence includes a token that scores below a predetermined threshold, wherein determining that the first word sequence includes a token that scores below the predetermined threshold disqualifies the first word sequence from being identified as an address;
  
  in response to determining that the first word sequence passes the at least one filter, extract the first word sequence from the plurality of words of the received communication as a first address of the sender, wherein the first word sequence contains the zip code; and
  
  store, in a data repository, the first address in a first person profile of the sender, wherein the data repository stores a plurality of person profiles including the first person profile.
- View Dependent Claims (11, 12)
- - 11. The non-transitory computer-readable storage medium of claim 10, wherein the determining the total sum for the first word sequence comprises:
    - determining an additional score for a starting word of the first word sequence, wherein the additional score for the starting word is associated with a probability that the starting word is part of an address;
      
      determining an additional score for an ending word of the first word sequence, wherein the additional score for the ending word is associated with a probability that the ending word is part of an address; and
      
      adding the additional score for the starting word and the additional score for the ending word to the sum of scores for the first word sequence to obtain the total sum.
  - 12. The non-transitory computer-readable storage medium of claim 10, wherein the assigning the score further comprises forming a plurality of tokens from the communication, each token corresponding to one of the words, and assigning a score to each of the tokens.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Yahoo Assets LLC
Original Assignee
Oath Inc. (Verizon Communications Inc.)
Inventors
Seeger, III, Richard Earle, Monaco, Peter
Primary Examiner(s)
Reyes, Mariela
Assistant Examiner(s)
HARMON, COURTNEY N

Application Number

US13/667,205
Publication Number

US 20140129569A1
Time in Patent Office

2,069 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/9535 Search customisation based ...

G06Q 10/101 Collaborative creation, e.g...

Address extraction from a communication

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

451 Citations

13 Claims

Specification

Use Cases

Quick Links

Others

Address extraction from a communication

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

451 Citations

13 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others