Probabilistic tree-structured learning system for extracting contact data from quotes

US 9,619,534 B2
Filed: 02/24/2011
Issued: 04/11/2017
Est. Priority Date: 09/10/2010
Status: Active Grant

First Claim

Patent Images

1. A method for creating or updating a data set stored as a record in a database, wherein a plurality of data sets are stored in the database, wherein each data set in the plurality of data sets is defined to include a plurality of fields corresponding to a plurality of predefined entities, the method comprising:

searching through a plurality of documents for current information about the data set;

upon locating a search result document, in the plurality of documents, containing the current information about the data set, copying and storing a data string having a plurality of tokens from content of the search result document containing the current information about the data set;

extracting a sequence of tokens corresponding to the data string;

recognizing a first set of tokens in the sequence of tokens as a first entity based on entity recognition probabilistic scoring derived from a machine evaluation of a training set of entities;

recognizing a second set of tokens in the sequence of tokens as a second entity based on identifying the first entity as a first node in a tree-like structure and identifying the second entity as by a second node in the tree-like structure, the first node connected to the second node by an arc representing a probability that the first entity is followed by the second entity in a probable entity sequence, the first node connected to another node by another arc representing another probability that the first entity is followed by another entity in another probable entity sequence, the tree-like structure created by a machine evaluation of a training set of input strings;

aligning one or more tokens of the first set of tokens as one of a plurality of probable entities using the probabilistic scoring of the first set of tokens and grammatical rules;

assigning the aligned one or more tokens to one entity field of the plurality of predefined entity fields of the data set; and

creating and storing a new record for the data set if none exists, or updating an existing record for the data set, using the assigned aligned one or more tokens.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for updating data stored in a database, such as contact information. An input string is obtained through a search for timely material associated with the stored contact. The input string is parsed using probabilistic tendencies to extract entities corresponding to those stored with the contact. Secondary entities are used to assist in the identification of the primary entities. The contact is then updated (or added if new) using the extracted primary entities.

Citations

19 Claims

1. A method for creating or updating a data set stored as a record in a database, wherein a plurality of data sets are stored in the database, wherein each data set in the plurality of data sets is defined to include a plurality of fields corresponding to a plurality of predefined entities, the method comprising:
- searching through a plurality of documents for current information about the data set;
  
  upon locating a search result document, in the plurality of documents, containing the current information about the data set, copying and storing a data string having a plurality of tokens from content of the search result document containing the current information about the data set;
  
  extracting a sequence of tokens corresponding to the data string;
  
  recognizing a first set of tokens in the sequence of tokens as a first entity based on entity recognition probabilistic scoring derived from a machine evaluation of a training set of entities;
  
  recognizing a second set of tokens in the sequence of tokens as a second entity based on identifying the first entity as a first node in a tree-like structure and identifying the second entity as by a second node in the tree-like structure, the first node connected to the second node by an arc representing a probability that the first entity is followed by the second entity in a probable entity sequence, the first node connected to another node by another arc representing another probability that the first entity is followed by another entity in another probable entity sequence, the tree-like structure created by a machine evaluation of a training set of input strings;
  
  aligning one or more tokens of the first set of tokens as one of a plurality of probable entities using the probabilistic scoring of the first set of tokens and grammatical rules;
  
  assigning the aligned one or more tokens to one entity field of the plurality of predefined entity fields of the data set; and
  
  creating and storing a new record for the data set if none exists, or updating an existing record for the data set, using the assigned aligned one or more tokens.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein extracting the sequence of tokens includes:
    - aligning the plurality of probable entities into a sequence;
      
      and wherein the creating and storing the new record includes;
      
      creating the new record or updating the existing record using the plurality of probable entities.
  - 3. The method of claim 2, wherein aligning one or more tokens of the first set of tokens includes emulating the linguistic rules to obtain an alignment of tokens representing a defined entity sequence.
  - 4. The method of claim 1, wherein the plurality of data sets store contact information including one or more entities, and wherein searching through the plurality of documents includes:
    - searching through the plurality of documents for current information instances of the one or more entities.
  - 5. The method of claim 1, wherein each data set is a corresponding contact, wherein the corresponding contact is configured to have one or more defined entity fields having stored values associated with the corresponding contact, wherein extracting the sequence of tokens includes aligning the sequence of tokens with a sequence of the defined entity fields using the probabilistic scoring.
  - 6. The method of claim 1, wherein extracting the sequence of tokens corresponding to the data string includes:
    - removing a trailing period if present;
      
      adding a space before an apostrophe or a comma; and
      
      splitting the data string into a plurality of entities at each space added before the apostrophe or the comma.
  - 7. The method claim 1, wherein extracting the sequence of tokens further includes:
    - identifying entities in the sequence of tokens;
      
      evaluating the alignment of the sequence of tokens using the identified entities; and
      
      providing the entity values for the identified entities to the updating of the existing record.
  - 8. The method of claim 1, wherein the probabilistic scoring for aligning the sequence of tokens is determined by analyzing the plurality of training sets of input strings to extract the accurate alignment of the entities.
  - 9. The method of claim 8, wherein the analyzing the plurality of training sets of input strings includes:
    - emulating linguistic rules using the entities.

10. A non-transitory machine-readable medium carrying one or more sequences of instructions for updating information associated with a contact stored in a multi-tenant database system, which instructions, when executed by one or more processors, cause the one or more processors to:
- obtain and store a data string having a plurality of tokens in content of a search result from a search for quoted material associated with the contact;
  
  extract a sequence of tokens corresponding to the data string;
  
  recognize a first set of tokens in the sequence of tokens as a first entity based on entity recognition probabilistic scoring derived from machine evaluation of a training set of entities;
  
  recognize a second set of tokens in the sequence of tokens as a second entity based on identifying the first entity as a first node in a tree-like structure and identifying the second entity as by a second node in the tree-like structure, the first node connected to the second node by an arc representing a probability that the first entity is followed by the second entity in a probable entity sequence, the first node connected to another node by another arc representing another probability that the first entity is followed by another entity in another probable entity sequence, the tree-like structure created by a machine evaluation of a training set of input strings;
  
  align one or more tokens of the first set of tokens as one of a plurality of probable entities using the probabilistic scoring of the first set of tokens and grammatical rules;
  
  assign the aligned one or more tokens to one entity field of corresponding predefined entity fields of the contact based on the probabilistic scoring and the linguistic cues of the probable secondary entities; and
  
  create and store a new record for the contact if none exists, or update an existing record for the contact, using the assigned aligned one or more tokens.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The machine-readable medium of claim 10, wherein the instructions for extracting the sequence of tokens includes:
    - parsing the data string using probabilistic scoring to extract the sequence of tokens corresponding to the data string, wherein the sequence of tokens represent entity values; and
      
      aligning the sequence of tokens with a sequence of predefined entity fields using probabilistic scoring.
  - 12. The machine-readable medium of claim 11, wherein the instructions for extracting the sequence of tokens includes:
    - identifying entities in the sequence of tokens;
      
      evaluating the alignment of the sequence of tokens using the identified entities; and
      
      providing the entity values for the identified entities to the updating of the existing record for the data set.
  - 13. The machine-readable medium of claim 11, wherein the instructions for parsing the data string includes:
    - removing a trailing period if present;
      
      adding a space before an apostrophe or a comma; and
      
      splitting the data string into a plurality of entities at each space before the apostrophe or the comma.
  - 14. The machine-readable medium of claim 11, wherein the instructions for aligning the one or more tokens of the first set of tokens includes emulating linguistic rules learned from training sets of input strings.
  - 15. The machine-readable medium of claim 13, wherein the instructions for aligning the one or more tokens of the first set of tokens includes applying stored probabilities, said stored probabilities learned from training sets of input strings.

16. An apparatus for extracting contact data from quotes, wherein a plurality of contacts are stored in a multi-tenant database, the apparatus comprising:
- a processor; and
  
  one or more stored sequences of instructions which, when executed by the processor, cause the processor to;
  
  obtain and store a data string having a plurality of tokens in content of a search result from a search for quoted material associated with a contact;
  
  extract a sequence of tokens corresponding to the data string;
  
  recognize a first set of tokens in the sequence of tokens as a first entity based on entity recognition probabilistic scoring derived from a machine evaluation of a training set of entities;
  
  recognize a second set of tokens in the sequence of tokens as a second entity based on identifying the first entity as a first node in a tree-like structure and identifying the second entity as by a second node in the tree-like structure, the first node connected to the second node by an arc representing a probability that the first entity is followed by the second entity in a probable entity sequence, the first node connected to another node by another arc representing another probability that the first entity is followed by another entity in another probable entity sequence, the tree-like structure created by a machine evaluation of a training set of input strings;
  
  align one or more tokens of the first set of tokens as one of a plurality of probable entities using the probabilistic scoring of the first set of tokens and grammatical rules;
  
  assign the aligned one or more tokens to one entity field of corresponding predefined entity fields of the contact based on the probabilistic scoring and the linguistic cues of the probable secondary entities; and
  
  create and store a new record for the contact if none exists, or updating an existing record for the contact, using the assigned aligned one or more tokens.
- View Dependent Claims (17)
- - 17. The apparatus of claim 16, wherein the probabilistic scoring is learned from a plurality of training sets of input strings.

18. A method for transmitting code for extracting contact data from quotes in a multi-tenant database system on a transmission medium, the method comprising:
- transmitting code to obtain and store a data string having a plurality of tokens in content of a search result from a search for quoted material associated with a contact;
  
  transmitting code to extract a sequence of tokens corresponding to the data string;
  
  transmitting code to recognize a first set of tokens in the sequence of tokens as a first entity based on entity recognition probabilistic scoring derived from a machine evaluation of a training set of entities;
  
  transmitting code to recognize a second set of tokens in the sequence of tokens as a second entity based on identifying the first entity as a first node in a tree-like structure and identifying the second entity as by a second node in the tree-like structure, the first node connected to the second node by an arc representing a probability that the first entity is followed by the second entity in a probable entity sequence, the first node connected to another node by another arc representing another probability that the first entity is followed b another entity in another probable entity sequence, the tree-like structure created by a machine evaluation of a training set of input strings;
  
  transmitting code to align one or more tokens of the first set of tokens as one of a plurality of probable entities using the probabilistic scoring of the first set of tokens and grammatical rules;
  
  transmitting code to assign the aligned one or more tokens to one entity field of the plurality of predetermined entity fields of the data set; and
  
  transmitting code to create and store a new record for the data set if none exists, or updating an existing record for the data set, using the assigned aligned one or more tokens.
- View Dependent Claims (19)
- - 19. The method of claim 18, wherein the probabilistic scoring is learned from training sets of input strings.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Salesforce.com, Inc.
Original Assignee
Salesforce.com, Inc.
Inventors
Jagota, Arun Kumar
Primary Examiner(s)
Spieler, William

Application Number

US13/034,463
Publication Number

US 20120066160A1
Time in Patent Office

2,238 Days
Field of Search

707602
US Class Current
CPC Class Codes

G06F 16/23   Updating

G06F 16/254   Extract, transform and load...

G06N 7/01   Probabilistic graphical mod...

G06Q 30/02   Marketing; Price estimation...

Probabilistic tree-structured learning system for extracting contact data from quotes

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Probabilistic tree-structured learning system for extracting contact data from quotes

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links