Identifying terms

US 9,123,046 B1
Filed: 04/27/2012
Issued: 09/01/2015
Est. Priority Date: 04/29/2011
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, for each of multiple accounts, a document associated with the account;

identifying the accounts that have been designated as spam accounts;

merging the documents that are associated with the accounts that have been designated as spam accounts, into a single, merged document;

determining, for each of one or more terms that occur in the merged document, a blacklist term frequency (BTF) that represents a number of times that the term occurs in the merged document;

determining a number of accounts that have not been designated as spam accounts;

determining, for each of the terms, a number of the documents that are associated with accounts that have not been designated as spam accounts, in which the term occurs;

determining, for each of the terms, an inverse document frequency (IDF) for the term based on the number of accounts that have not been designated as spam accounts, and the number of documents that are associated with the accounts that have not been designated as spam accounts, in which the term occurs;

determining, for each of the one or more terms, a blacklist term frequency-inverse document frequency (BTF-IDF) score by multiplying the blacklist term frequency for the term by the inverse document frequency for the term;

selecting, as spam terms, one or more of the terms whose respective BT-IDF score satisfies a threshold; and

automatically determining whether to designate a new account as a spam account based at least on identifying an occurrence of one or more of the spam terms in a document associated with the new account.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, are described for identifying target terms, e.g., spam terms within a collection of documents. In one aspect, methods can include identifying spam terms by calculating a blacklist term frequency-inverse document frequency (BTF-IDF) score for multiple terms, and by selecting, as the spam terms, the terms that have scores above or below a threshold score. The multiple terms may be derived from documents that are associated with accounts that have been designated as spam accounts.

Citations

18 Claims

1. A method comprising:
- receiving, for each of multiple accounts, a document associated with the account;
  
  identifying the accounts that have been designated as spam accounts;
  
  merging the documents that are associated with the accounts that have been designated as spam accounts, into a single, merged document;
  
  determining, for each of one or more terms that occur in the merged document, a blacklist term frequency (BTF) that represents a number of times that the term occurs in the merged document;
  
  determining a number of accounts that have not been designated as spam accounts;
  
  determining, for each of the terms, a number of the documents that are associated with accounts that have not been designated as spam accounts, in which the term occurs;
  
  determining, for each of the terms, an inverse document frequency (IDF) for the term based on the number of accounts that have not been designated as spam accounts, and the number of documents that are associated with the accounts that have not been designated as spam accounts, in which the term occurs;
  
  determining, for each of the one or more terms, a blacklist term frequency-inverse document frequency (BTF-IDF) score by multiplying the blacklist term frequency for the term by the inverse document frequency for the term;
  
  selecting, as spam terms, one or more of the terms whose respective BT-IDF score satisfies a threshold; and
  
  automatically determining whether to designate a new account as a spam account based at least on identifying an occurrence of one or more of the spam terms in a document associated with the new account.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein determining the IDF for each of the terms is based only on the number of accounts that have not been designated as spam accounts and on the number of documents that are associated with the accounts that have not been designated as spam accounts, in which the term occurs.
  - 3. The method of claim 1, wherein the determined BTF-IDF score associated with a term in the merged document satisfies
- 4. The method of claim 1 comprising associating spam likelihood to each of the one or more terms in the merged document, the spam likelihood associated with a term being proportional to the determined BTF-IDF score associated with the term.
- 5. The method of claim 1 comprising:
  - generating a data structure of spam terms based on sorting the terms included in the merged document based on the respective determined BTF-IDF scores; and
    
    providing the generated data structure for assessing whether a given account is a spam account or a non-spam account.
- 6. The method of claim 1, wherein an account from among the multiple accounts is associated with an entity that requested to present entity-related information at a geo-location represented on an online map, and wherein the entity-related information is stored in the document associated with the account and includes the geo-location, the entity'"'"'s identifier, description of products or services offered by the entity, and one or more of the entity'"'"'s contact info and the entity'"'"'s website information.

7. A nonvolatile computer storage medium encoded with a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
- receiving, for each of multiple accounts, a document associated with the account;
  
  identifying the accounts that have been designated as spam accounts;
  
  merging the documents that are associated with the accounts that have been designated as spam accounts, into a single, merged document;
  
  determining, for each of one or more terms that occur in the merged document, a blacklist term frequency (BTF) that represents a number of times that the term occurs in the merged document;
  
  determining a number of accounts that have not been designated as spam accounts;
  
  determining, for each of the terms, a number of the documents that are associated with accounts that have not been designated as spam accounts, in which the term occurs;
  
  determining, for each of the terms, an inverse document frequency (IDF) for the term based on the number of accounts that have not been designated as spam accounts, and the number of documents that are associated with the accounts that have not been designated as spam accounts, in which the term occurs;
  
  determining, for each of the one or more terms, a blacklist term frequency-inverse document frequency (BTF-IDF) score by multiplying the blacklist term frequency for the term by the inverse document frequency for the term;
  
  selecting, as spam terms, one or more of the terms whose respective BT-IDF score satisfies a threshold; and
  
  automatically determining whether to designate a new account as a spam account based at least on identifying an occurrence of one or more of the spam terms in a document associated with the new account.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The nonvolatile computer storage medium of claim 7, wherein determining the IDF for each of the terms is based only on the number of accounts that have not been designated as spam accounts and on the number of documents that are associated with the accounts that have not been designated as spam accounts, in which the term occurs.
  - 9. The nonvolatile computer storage medium of claim 7, wherein the determined BTF-IDF score associated with a term in the merged document satisfies
- 10. The nonvolatile computer storage medium of claim 7, wherein the operations further comprise associating spam likelihood to each of the one or more terms in the merged document, the spam likelihood associated with a term being proportional to the determined BTF-IDF score associated with the term.
- 11. The nonvolatile computer storage medium of claim 7, wherein the operations further comprise:
  - generating a data structure of spam terms based on sorting the terms included in the merged document based on the respective determined BTF-IDF scores; and
    
    providing the generated data structure for assessing whether a given account is a spam account or a non-spam account.
- 12. The nonvolatile computer storage medium of claim 7, wherein an account from among the multiple accounts is associated with an entity that requested to present entity-related information at a geo-location represented on an online map, and wherein the entity-related information is stored in the document associated with the account and includes the geo-location, the entity'"'"'s identifier, description of products or services offered by the entity, and one or more of the entity'"'"'s contact info and the entity'"'"'s website information.

13. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving, for each of multiple accounts, a document associated with the account;
  
  identifying the accounts that have been designated as spam accounts;
  
  merging the documents that are associated with the accounts that have been designated as spam accounts, into a single, merged document;
  
  determining, for each of one or more terms that occur in the merged document, a blacklist term frequency (BTF) that represents a number of times that the term occurs in the merged document;
  
  determining a number of accounts that have not been designated as spam accounts;
  
  determining, for each of the terms, a number of the documents that are associated with accounts that have not been designated as spam accounts, in which the term occurs;
  
  determining, for each of the terms, an inverse document frequency (IDF) for the term based on the number of accounts that have not been designated as spam accounts, and the number of documents that are associated with the accounts that have not been designated as spam accounts, in which the term occurs;
  
  determining, for each of the one or more terms, a blacklist term frequency-inverse document frequency (BTF-IDF) score by multiplying the blacklist term frequency for the term by the inverse document frequency for the term;
  
  selecting, as spam terms, one or more of the terms whose respective BT-IDF score satisfies a threshold; and
  
  automatically determining whether to designate a new account as a spam account based at least on identifying an occurrence of one or more of the spam terms in a document associated with the new account.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The system of claim 13, wherein determining the IDF for each of the terms is based only on the number of accounts that have not been designated as spam accounts and on the number of documents that are associated with the accounts that have not been designated as spam accounts, in which the term occurs.
  - 15. The system of claim 13, wherein the determined BTF-IDF score associated with a term in the merged document satisfies
- 16. The system of claim 13, wherein the operations further comprise associating spam likelihood to each of the one or more terms in the merged document, the spam likelihood associated with a term being proportional to the determined BTF-IDF score associated with the term.
- 17. The system of claim 13, wherein the operations further comprise:
  - generating a data structure of spam terms based on sorting the terms included in the merged document based on the respective determined BTF-IDF scores; and
    
    providing the generated data structure for assessing whether a given account is a spam account or a non-spam account.
- 18. The system of claim 13, wherein an account from among the multiple accounts is associated with an entity that requested to present entity-related information at a geo-location represented on an online map, and wherein the entity-related information is stored in the document associated with the account and includes the geo-location, the entity'"'"'s identifier, description of products or services offered by the entity, and one or more of the entity'"'"'s contact info and the entity'"'"'s website information.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Yuksel, Baris, Krulec, Ana
Primary Examiner(s)
OUELLETTE, JONATHAN P

Application Number

US13/459,018
Time in Patent Office

1,222 Days
Field of Search

705 11-912, 706/45
US Class Current

1/1
CPC Class Codes

G06F 21/561   Virus type analysis

G06F 2221/2117   User registration

G06N 5/00   Computing arrangements usin...

G06Q 30/018   Certifying business or prod...

H04L 51/212   using filtering or selectiv...

H04L 63/12   Applying verification of th...

H04L 63/1441   Countermeasures against mal...

Identifying terms

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Identifying terms

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links