Automatic document source identification systems

US 10,331,950 B1
Filed: 06/19/2018
Issued: 06/25/2019
Est. Priority Date: 06/19/2018
Status: Active Grant

First Claim

Patent Images

1. A document source identification system comprising:

one or more processors; and

a memory in communication with the one or more processors and storing instructions that, when executed by the one or more processors, are configured to cause the system to;

receive a first uploaded document having a first document type of a plurality of predetermined document types each associated with a document category;

categorize the first uploaded document in a first document category based on the first document type;

extract at least one first data entry from the first uploaded document based on the first document category, the at least one first data entry comprising a first sensitive data entry comprising personally identifiable information;

identify a data entry type associated with the at least one first data entry at least in part based on one or more of a first proximate location of the at least one first data entry in the first uploaded document, a presence of one or more additional extractable data entries, and a second proximate location of the one or more additional extractable data entries;

normalize the first sensitive data entry according to a business ruleset to produce a normalized first data entry;

tokenize the normalized first sensitive data entry;

execute a deterministic identification search, comprising;

filtering a plurality of the user account data entries in a user account database based on one or more of the normalized first data entry or the data entry type associated with the normalized first data entry to identify a first subset of user account data entries, each user account data entry in the first subset of user account data entries being associated with an existing user account; and

determining that the normalized first data entry matches zero, one, or more than one user account data entries in the first subset of user account data entries;

execute, in response to determining that the normalized first data entry matches zero or more than one user account data entries in the first subset of user account data entries, a probabilistic identification search using a machine learning trained probabilistic model to identify a highest ranked user account data entry in the first subset of user account data entries;

link the first uploaded document to a first existing user account associated with either the one matching user account data entry in the first subset of user account data entries or the highest ranked user account data entry in the first subset of user account data entries.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A document source identification system includes one or more memory devices storing instructions, and one or more processors configured to execute the instructions to cause the system to receive uploaded document(s) having at least one extractable data entry. The system may categorize the document, and extract at least one data entry from the document. The system may normalize each extracted data entry and execute a deterministic ID search to determine that the normalized data entry matches zero, one, or more than one account data entries associated with user accounts. Responsive to an exact match, the system may link the uploaded document to a user account associated with the matching data entry. Responsive to zero or multiple matches, the system may execute a probabilistic ID search identifying a highest ranked user account data entry and link the document to a user account associated with the highest ranked user account data entry.

24 Citations

View as Search Results

20 Claims

1. A document source identification system comprising:
- one or more processors; and
  
  a memory in communication with the one or more processors and storing instructions that, when executed by the one or more processors, are configured to cause the system to;
  
  receive a first uploaded document having a first document type of a plurality of predetermined document types each associated with a document category;
  
  categorize the first uploaded document in a first document category based on the first document type;
  
  extract at least one first data entry from the first uploaded document based on the first document category, the at least one first data entry comprising a first sensitive data entry comprising personally identifiable information;
  
  identify a data entry type associated with the at least one first data entry at least in part based on one or more of a first proximate location of the at least one first data entry in the first uploaded document, a presence of one or more additional extractable data entries, and a second proximate location of the one or more additional extractable data entries;
  
  normalize the first sensitive data entry according to a business ruleset to produce a normalized first data entry;
  
  tokenize the normalized first sensitive data entry;
  
  execute a deterministic identification search, comprising;
  
  filtering a plurality of the user account data entries in a user account database based on one or more of the normalized first data entry or the data entry type associated with the normalized first data entry to identify a first subset of user account data entries, each user account data entry in the first subset of user account data entries being associated with an existing user account; and
  
  determining that the normalized first data entry matches zero, one, or more than one user account data entries in the first subset of user account data entries;
  
  execute, in response to determining that the normalized first data entry matches zero or more than one user account data entries in the first subset of user account data entries, a probabilistic identification search using a machine learning trained probabilistic model to identify a highest ranked user account data entry in the first subset of user account data entries;
  
  link the first uploaded document to a first existing user account associated with either the one matching user account data entry in the first subset of user account data entries or the highest ranked user account data entry in the first subset of user account data entries.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system of claim 1, wherein executing the probabilistic identification search further comprises:
    - scoring each user account data entry in the first subset of user account data entries using the machine learning trained probabilistic model; and
      
      identifying the highest ranked user account data entry in the first subset of user account data entries.
  - 3. The system of claim 2, wherein executing the probabilistic identification search further comprises using a random forest machine learning classifier.
  - 4. The system of claim 1, wherein the at least one first data entry comprises a verifiable data entry corresponding to a verified data source, and the system is further configured to verify, based on communication with the verified data source, the verifiable data entry.
  - 5. The system of claim 4, wherein the verifiable data entry comprises a user'"'"'s street address.
  - 6. The system of claim 1, wherein the system is further configured to:
    - receive a second uploaded document having a second document type of the plurality of predetermined document types;
      
      categorize the second uploaded document in a second document category based on the second document type, the second document category indicating the presence of at least a second sensitive data entry comprising additional personally identifiable information, the personally identifiable information of the first sensitive data entry and the additional personally identifiable information of the second sensitive data entry each comprising a bank statement entry, a tax return entry, a social security number, or a driver'"'"'s license entry;
      
      extract at least the second sensitive data entry from the second uploaded document; and
      
      tokenize the extracted second sensitive data entry.
  - 7. The system of claim 1, wherein categorizing the uploaded document further comprises using a deep learning machine library.
  - 8. The system of claim 1, wherein the plurality of predetermined document types comprises bank statements, tax returns, social security cards, and driver'"'"'s licenses.
  - 9. The system of claim 1, wherein extracting the at least one first data entry from the first uploaded document further comprises utilizing optical character recognition.
  - 10. The system of claim 1, wherein the normalized first data entry used to identify the first subset of user account data entries comprises one or more of a full name associated with an existing user account, a social security number associated with the existing user account, and a soundex of the full name.

11. A document source identification system comprising:
- one or more processors; and
  
  a memory in communication with the one or more processors and storing instructions that, when executed by the one or more processors, are configured to cause the system to;
  
  receive a first uploaded document having a first document type of a plurality of predetermined document types each associated with a document category;
  
  categorize the first uploaded document in a first document category based on the first document type;
  
  extract at least one first data entry from the first uploaded document based on the first document category;
  
  identify a data entry type associated with the at least one first data entry at least in part based on one or more of a proximate location of the at least one first data entry in the first uploaded document, a presence of one or more additional extractable data entries, and a proximate location of the one or more additional extractable data entries;
  
  normalize the at least one first data entry according to a business ruleset to produce a normalized first data entry;
  
  filter, based on one or more of the normalized first data entry or the data entry type associated with the normalized first data entry, a user account database having a plurality of existing user accounts data entries each corresponding with an existing user account to identify a first subset of existing user account data entries;
  
  determine whether the normalized first data entry matches a first existing user account data entry in the first subset of existing user account data entries;
  
  when the normalized first data entry is determined to match the first existing user account data entry, link the first uploaded document to a first existing user account associated with the first user account data entry; and
  
  when the normalized first data entry is determined to not match the first existing user account data entry;
  
  score each existing user account data entry in the first subset of existing user account data entries using a machine learning trained probabilistic model;
  
  identify, via the machine learning trained probabilistic model, a first probabilistic existing user account data entry having the highest score; and
  
  link the first uploaded document to a second existing user account associated with the first probabilistic existing user account data entry.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. The system of claim 11, wherein the at least one first data entry comprises a sensitive data entry comprising personally identifiable information comprising a bank statement entry, a tax return entry, a social security number, or a driver'"'"'s license entry, the instructions being further configured to cause the system to, in response to extracting the sensitive data entry the system, normalize the sensitive data entry and tokenize the normalized sensitive data entry.
  - 13. The system of claim 11, wherein categorizing the uploaded document further comprises using a deep learning machine library.
  - 14. The system of claim 11, wherein plurality of predetermined document types comprises bank statements, tax returns, social security cards, and driver'"'"'s licenses.
  - 15. The system of claim 11, wherein extracting the at least one first data entry from the first uploaded document further comprises utilizing optical character recognition.
  - 16. The system of claim 11, wherein the normalized first data entry used to identify the first subset of user account data entries comprises one or more of a full name associated with an existing user account, a social security number associated with the existing user account, and a soundex of the full name.
  - 17. The system of claim 11, wherein the machine learning trained probabilistic model further comprises using a random forest machine learning classifier.

18. A document source identification method comprising:
- receiving a first uploaded document;
  
  extracting at least one first data entry from the first uploaded document;
  
  identifying a data entry type associated with the at least one first data entry at least in part based on one or more of a proximate location of the at least one first data entry in the first uploaded document, a presence of one or more additional extractable data entries, and a proximate location of the one or more additional extractable data entries;
  
  normalizing the extracted at least one first data entry according to a business ruleset to produce a normalized first data entry;
  
  executing a deterministic identification search, comprising;
  
  filtering a plurality of the user account data entries in a user account database based on one or more of the normalized first data entry or the data entry type associated with the normalized first data entry to identify a first subset of user account data entries, each user account data entry in the first subset of user account data entries being associated with an existing user account; and
  
  determining that the normalized first data entry matches zero, one, or more than one user account data entries in the first subset of user account data entries;
  
  executing, in response to determining that the normalized first data entry matches zero or more than one user account data entries in the first subset of user account data entries, a probabilistic identification search using a machine learning trained probabilistic model to identify a highest ranked user account data entry in the first subset of user account data entries;
  
  linking the first uploaded document to a first existing user account associated with either the one matching user account data entry in the first subset of user account data entries or the highest ranked user account data entry in the first subset of user account data entries.
- View Dependent Claims (19, 20)
- - 19. The method of claim 18, wherein the first uploaded document has a first document type of a plurality of predetermined document types each associated with a corresponding document category, the method further comprising:
    - determining a first document category corresponding to the first document type; and
      
      extracting the at least one first data entry from the first uploaded document based on the first document type.
  - 20. The method of claim 18, wherein the at least one first data entry comprises a sensitive data entry and normalizing the extracted at least one first data entry comprises normalizing the sensitive data entry, the method further comprising tokenizing the normalized sensitive data entry before executing the deterministic identification search.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Capital One Services LLC (Capital One Financial Corporation)
Original Assignee
Capital One Services LLC (Capital One Financial Corporation)
Inventors
Suriyanarayanan, Vikram, Parker, Joel, Shin, Myoungho, Shine, Deepak, Lam, Nelson, Mokha, Jaideep, Sabbineni, Ravali
Primary Examiner(s)
Cese, Kenny A

Application Number

US16/011,717
Time in Patent Office

371 Days
Field of Search
US Class Current
CPC Class Codes

G06F 21/60   Protecting data

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/295   Named entity recognition

G06N 20/20   Ensemble learning

G06N 5/01   Dynamic search techniques; ...

G06N 7/01   Probabilistic graphical mod...

G06V 10/96   Management of image or vide...

G06V 30/40   Document-oriented image-bas...

G06V 30/416   Extracting the logical stru...

H04L 67/306   User profiles

Automatic document source identification systems

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

24 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic document source identification systems

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

24 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links