Method and system for discovering suspicious account groups

US 9,684,649 B2
Filed: 12/28/2012
Issued: 06/20/2017
Est. Priority Date: 08/21/2012
Status: Active Grant

First Claim

Patent Images

1. A method for discovering suspicious account groups, comprising:

under a control of at least one hardware processor,receiving a monitoring website table and at least one monitored vocabulary set containing a plurality of elements;

downloading a first group of accounts and one or more post contents corresponding to each account of the first group of accounts from the monitoring website during a first time interval;

establishing a language model, for each account of the first group of accounts, according to the one or more post contents from each account of the first group of accounts during the first time interval, to describe a linguistic fashion for each account, the language model being expressed at least partly as a probability of an occurrence of at least one element of the at least one monitored vocabulary set in an account;

comparing a similarity among a first group of language models of the first group of accounts to cluster the first group of accounts;

downloading newly added data including a second group of accounts and one or more post contents corresponding to each account of the second group of accounts from the monitoring website during a second time interval;

obtaining one or more homonyms synonyms in the newly added data of at least one element of the at least one monitored vocabulary set corresponding to the first group of accounts, comprising the sub-steps offetching one or more features through a previous feature window and a next feature window of each monitored vocabulary in the at least one monitored vocabulary set; and

converting a weight of an original word of the at least one monitored vocabulary set into a corresponding weight of a homonym synonym;

updating the first group of language models with the one or more homonyms synonyms;

integrating the first and the second groups of accounts to create an integrated group of accounts;

rebuilding a language model for each of the integrated group of accounts to create a second group of language models based on the step of updating the first group of language models with the one or more homonyms synonyms;

clustering the integrated group of accounts according to the determined similarity among the integrated group of accounts based on the second group of language models;

determining at least one suspicious account group after the step of clustering according to a level of homogeneity among at least account groups of the integrated group of accounts; and

determining interaction connection among accounts of the integrated group of accounts based on a result of the step of identifying at least one suspicious account group.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In one exemplary embodiment, a system for discovering suspicious account groups establishes a language model according to the post contents from each account of a first group of accounts during a first time interval, to describe the speech of the account, and compares the similarity among a plurality of language models of the first group of accounts to cluster the first group of accounts; and for a plurality of newly added data during a second time interval, discovers near-synonyms of at least a monitored vocabulary set, and updates the near-synonyms to a plurality of language models of a second group of accounts. The system further integrates the first and the second groups of accounts, and re-clusters an integrated group of accounts.

20 Citations

View as Search Results

17 Claims

1. A method for discovering suspicious account groups, comprising:
- under a control of at least one hardware processor,receiving a monitoring website table and at least one monitored vocabulary set containing a plurality of elements;
  
  downloading a first group of accounts and one or more post contents corresponding to each account of the first group of accounts from the monitoring website during a first time interval;
  
  establishing a language model, for each account of the first group of accounts, according to the one or more post contents from each account of the first group of accounts during the first time interval, to describe a linguistic fashion for each account, the language model being expressed at least partly as a probability of an occurrence of at least one element of the at least one monitored vocabulary set in an account;
  
  comparing a similarity among a first group of language models of the first group of accounts to cluster the first group of accounts;
  
  downloading newly added data including a second group of accounts and one or more post contents corresponding to each account of the second group of accounts from the monitoring website during a second time interval;
  
  obtaining one or more homonyms synonyms in the newly added data of at least one element of the at least one monitored vocabulary set corresponding to the first group of accounts, comprising the sub-steps offetching one or more features through a previous feature window and a next feature window of each monitored vocabulary in the at least one monitored vocabulary set; and
  
  converting a weight of an original word of the at least one monitored vocabulary set into a corresponding weight of a homonym synonym;
  
  updating the first group of language models with the one or more homonyms synonyms;
  
  integrating the first and the second groups of accounts to create an integrated group of accounts;
  
  rebuilding a language model for each of the integrated group of accounts to create a second group of language models based on the step of updating the first group of language models with the one or more homonyms synonyms;
  
  clustering the integrated group of accounts according to the determined similarity among the integrated group of accounts based on the second group of language models;
  
  determining at least one suspicious account group after the step of clustering according to a level of homogeneity among at least account groups of the integrated group of accounts; and
  
  determining interaction connection among accounts of the integrated group of accounts based on a result of the step of identifying at least one suspicious account group.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method as claimed in claim 1, said method further includes:
    - continuously determining at least one suspicious account group by downloading newly added data during a plurality of updated time intervals.
  - 3. The method as claimed in claim 2, said method further includes:
    - for said each updated time interval, updating said plurality of homonyms synonyms to at least one existing language model, and for each new account different from the previous group of accounts during said each update time interval, re-establishing a language model of said new account to describe its post contents.
  - 4. The method as claimed in claim 1, wherein the discovering the one or more homonyms synonyms of the at least a monitored vocabulary set is through the previous and the next feature windows of each monitored vocabulary of said at least a monitored vocabulary set to capture at least one feature, to determine whether one or more new words of the plurality of new added data belong to at least one homonym synonym of said monitored vocabulary.
  - 5. The method as claimed in claim 4, wherein said at least one feature is one or more features chosen from a group of features consisting of a keyword pattern, a part of speech pattern, a concept pattern, and a word string similarity.
  - 6. The method as claimed in claim 1, wherein establishing said language model of said account further includes:
    - training said language model by performing a word segmentation processing and a feature capturing of a linguistic fashion on a post contents corresponding to said account.
  - 7. The method as claimed in claim 1, said method further includes:
    - establishing a word pair table, wherein each word pair of said word pair table includes a first word and a second word, said first word is a monitored vocabulary of said at least one monitored vocabulary set, and said second word is a candidate homonym synonym.
  - 8. The method as claimed in claim 7, said method further includes:
    - establishing a target window and a candidate window of said word pair, and capturing one or more features from said target window and said candidate window; and
      
      integrating a word distance between said first word and said second word and one or more distances of said one or more different features to calculate the similarity of said first word and said second word according to an integrated distance.
  - 9. The method as claimed in claim 1, said method further includes:
    - converting a first weight of each word of said at least a monitored vocabulary set to obtain a second weight of each homonym synonym of the one or more homonyms synonyms to update the one or more homonyms synonyms to said second group of language models of said second group of accounts.
  - 10. The method as claimed in claim 1, said method further includes:
    - according to said second group of language models of said second group of accounts, re-clustering said integrated group of accounts by an increment clustering algorithm, to discover one or more new group of accounts.

11. A system for discovering suspicious account groups, comprising:
- a language model training device receiving a monitoring website table and at least one monitored vocabulary set containing a plurality of elements, receiving a first group of accounts and one or more post contents corresponding to each account of the first group of accounts downloaded from the monitoring website during a first time interval, and establishing a language model, for each account of the first group of accounts, according to the one or more post contents from each account of the first group of accounts during the first time interval, to describe a linguistic fashion for each account, the language model being expressed at least partly as a probability of an occurrence of at least one element of the at least one monitored vocabulary set in an account, the language model training device further receiving newly added data including a second group of accounts and one or more post contents corresponding to each account of the second group of accounts downloaded from the monitoring website during a second time interval;
  
  an account clustering device clustering the first group of accounts according to a similarity of a first group of language models of the first group of accounts;
  
  a near-synonym identification device discovering one or more near-synonyms of at least one element of the at least one monitored vocabulary set in the newly added data during a second time interval, and updating the one or more near-synonyms to a second group of language models of a second group of accounts; and
  
  an incremental account clustering device updating the first group of language models with the one or more homonyms synonyms, integrating the first and the second groups of accounts to create an integrated group of accounts, rebuilding a language model for each of the integrated group of accounts to create a second group of language models based on the step of updating the first group of language models with the one or more homonyms synonyms and re-clustering the integrated group of accounts according to the determined similarity among the integrated group of accounts based on the second group of language models;
  
  wherein to discover the one or more synonyms in the newly added data the system is configured tofetch one or more features through a previous feature window and a next feature window of each monitored vocabulary in the at least one monitored vocabulary set; and
  
  convert a weight of an original word of the at least one monitored vocabulary set into a corresponding weight of a homonym synonym; and
  
  wherein the system is further configured to determine at least one suspicious account group after the step of clustering according to a level of homogeneity among at least account groups of the integrated group of accounts, and determine interaction connection among accounts of the integrated group of accounts based on a result of the step of identifying at least one suspicious account group.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. The system as claimed in claim 11, wherein for each updated time interval of a plurality of updated time intervals, said homonym synonym identification device updates said one or more homonyms synonyms on at least one existing language model, and for each new account of at least one new account different from a previous group of accounts during the each updated time interval, re-establishes a language model of said new account to describe its post contents.
  - 13. The system as claimed in claim 11, wherein said homonym synonym identification device captures at least one feature from the previous and the next feature windows of each monitored vocabulary of said at least a monitored vocabulary set, to determine whether one or more new words of the plurality of new added data belong to at least one homonym synonym of said monitored vocabulary.
  - 14. The system as claimed in claim 13, wherein said at least one feature is one or more features chosen from a group of features consisting of a keyword pattern, a part of speech pattern, a concept pattern, and a word string similarity.
  - 15. The system as claimed in claim 11, said system further includes:
    - a word pair table, wherein each word pair in said word pair table includes a monitored vocabulary in said at least one monitored vocabulary set, and a candidate homonym synonym of said monitored vocabulary.
  - 16. The system as claimed in claim 15, wherein for the monitored vocabulary and a candidate homonym synonym of each pair in the word pair table, said homonym synonym identification device fetches one or more partial words from a corresponding post respectively, and saves the one or more partial words respectively corresponding to the monitored vocabulary and the candidate homonym synonym as a target window and a candidate window, respectively.
  - 17. The system as claimed in claim 16, wherein said homonym synonym identification device fetches one or more features from said target window and said candidate window.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Industrial Technology Research Institute
Original Assignee
Industrial Technology Research Institute
Inventors
Shen, Min-Hsin, Li, Ching-Hsien, Chiu, Chung-Jen
Primary Examiner(s)
Armstrong, Angela A

Application Number

US13/729,681
Publication Number

US 20140058723A1
Time in Patent Office

1,635 Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/247   Thesauruses; Synonyms

G06F 40/279   Recognition of textual enti...

G06F 40/30   Semantic analysis

Method and system for discovering suspicious account groups

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

20 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for discovering suspicious account groups

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

20 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links