Identifying terms
First Claim
Patent Images
1. A method comprising:
- receiving, for each of multiple accounts, a document associated with the account;
identifying the accounts that have been designated as spam accounts;
merging the documents that are associated with the accounts that have been designated as spam accounts, into a single, merged document;
determining, for each of one or more terms that occur in the merged document, a blacklist term frequency (BTF) that represents a number of times that the term occurs in the merged document;
determining a number of accounts that have not been designated as spam accounts;
determining, for each of the terms, a number of the documents that are associated with accounts that have not been designated as spam accounts, in which the term occurs;
determining, for each of the terms, an inverse document frequency (IDF) for the term based on the number of accounts that have not been designated as spam accounts, and the number of documents that are associated with the accounts that have not been designated as spam accounts, in which the term occurs;
determining, for each of the one or more terms, a blacklist term frequency-inverse document frequency (BTF-IDF) score by multiplying the blacklist term frequency for the term by the inverse document frequency for the term;
selecting, as spam terms, one or more of the terms whose respective BT-IDF score satisfies a threshold; and
automatically determining whether to designate a new account as a spam account based at least on identifying an occurrence of one or more of the spam terms in a document associated with the new account.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, are described for identifying target terms, e.g., spam terms within a collection of documents. In one aspect, methods can include identifying spam terms by calculating a blacklist term frequency-inverse document frequency (BTF-IDF) score for multiple terms, and by selecting, as the spam terms, the terms that have scores above or below a threshold score. The multiple terms may be derived from documents that are associated with accounts that have been designated as spam accounts.
-
Citations
18 Claims
-
1. A method comprising:
-
receiving, for each of multiple accounts, a document associated with the account; identifying the accounts that have been designated as spam accounts; merging the documents that are associated with the accounts that have been designated as spam accounts, into a single, merged document; determining, for each of one or more terms that occur in the merged document, a blacklist term frequency (BTF) that represents a number of times that the term occurs in the merged document; determining a number of accounts that have not been designated as spam accounts; determining, for each of the terms, a number of the documents that are associated with accounts that have not been designated as spam accounts, in which the term occurs; determining, for each of the terms, an inverse document frequency (IDF) for the term based on the number of accounts that have not been designated as spam accounts, and the number of documents that are associated with the accounts that have not been designated as spam accounts, in which the term occurs; determining, for each of the one or more terms, a blacklist term frequency-inverse document frequency (BTF-IDF) score by multiplying the blacklist term frequency for the term by the inverse document frequency for the term; selecting, as spam terms, one or more of the terms whose respective BT-IDF score satisfies a threshold; and automatically determining whether to designate a new account as a spam account based at least on identifying an occurrence of one or more of the spam terms in a document associated with the new account. - View Dependent Claims (2, 3, 4, 5, 6)
-
4. The method of claim 1 comprising associating spam likelihood to each of the one or more terms in the merged document, the spam likelihood associated with a term being proportional to the determined BTF-IDF score associated with the term.
-
5. The method of claim 1 comprising:
-
generating a data structure of spam terms based on sorting the terms included in the merged document based on the respective determined BTF-IDF scores; and providing the generated data structure for assessing whether a given account is a spam account or a non-spam account.
-
-
6. The method of claim 1, wherein an account from among the multiple accounts is associated with an entity that requested to present entity-related information at a geo-location represented on an online map, and wherein the entity-related information is stored in the document associated with the account and includes the geo-location, the entity'"'"'s identifier, description of products or services offered by the entity, and one or more of the entity'"'"'s contact info and the entity'"'"'s website information.
-
-
7. A nonvolatile computer storage medium encoded with a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
-
receiving, for each of multiple accounts, a document associated with the account; identifying the accounts that have been designated as spam accounts; merging the documents that are associated with the accounts that have been designated as spam accounts, into a single, merged document; determining, for each of one or more terms that occur in the merged document, a blacklist term frequency (BTF) that represents a number of times that the term occurs in the merged document; determining a number of accounts that have not been designated as spam accounts; determining, for each of the terms, a number of the documents that are associated with accounts that have not been designated as spam accounts, in which the term occurs; determining, for each of the terms, an inverse document frequency (IDF) for the term based on the number of accounts that have not been designated as spam accounts, and the number of documents that are associated with the accounts that have not been designated as spam accounts, in which the term occurs; determining, for each of the one or more terms, a blacklist term frequency-inverse document frequency (BTF-IDF) score by multiplying the blacklist term frequency for the term by the inverse document frequency for the term; selecting, as spam terms, one or more of the terms whose respective BT-IDF score satisfies a threshold; and automatically determining whether to designate a new account as a spam account based at least on identifying an occurrence of one or more of the spam terms in a document associated with the new account. - View Dependent Claims (8, 9, 10, 11, 12)
-
10. The nonvolatile computer storage medium of claim 7, wherein the operations further comprise associating spam likelihood to each of the one or more terms in the merged document, the spam likelihood associated with a term being proportional to the determined BTF-IDF score associated with the term.
-
11. The nonvolatile computer storage medium of claim 7, wherein the operations further comprise:
-
generating a data structure of spam terms based on sorting the terms included in the merged document based on the respective determined BTF-IDF scores; and providing the generated data structure for assessing whether a given account is a spam account or a non-spam account.
-
-
12. The nonvolatile computer storage medium of claim 7, wherein an account from among the multiple accounts is associated with an entity that requested to present entity-related information at a geo-location represented on an online map, and wherein the entity-related information is stored in the document associated with the account and includes the geo-location, the entity'"'"'s identifier, description of products or services offered by the entity, and one or more of the entity'"'"'s contact info and the entity'"'"'s website information.
-
-
13. A system comprising:
-
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; receiving, for each of multiple accounts, a document associated with the account; identifying the accounts that have been designated as spam accounts; merging the documents that are associated with the accounts that have been designated as spam accounts, into a single, merged document; determining, for each of one or more terms that occur in the merged document, a blacklist term frequency (BTF) that represents a number of times that the term occurs in the merged document; determining a number of accounts that have not been designated as spam accounts; determining, for each of the terms, a number of the documents that are associated with accounts that have not been designated as spam accounts, in which the term occurs; determining, for each of the terms, an inverse document frequency (IDF) for the term based on the number of accounts that have not been designated as spam accounts, and the number of documents that are associated with the accounts that have not been designated as spam accounts, in which the term occurs; determining, for each of the one or more terms, a blacklist term frequency-inverse document frequency (BTF-IDF) score by multiplying the blacklist term frequency for the term by the inverse document frequency for the term; selecting, as spam terms, one or more of the terms whose respective BT-IDF score satisfies a threshold; and automatically determining whether to designate a new account as a spam account based at least on identifying an occurrence of one or more of the spam terms in a document associated with the new account. - View Dependent Claims (14, 15, 16, 17, 18)
-
16. The system of claim 13, wherein the operations further comprise associating spam likelihood to each of the one or more terms in the merged document, the spam likelihood associated with a term being proportional to the determined BTF-IDF score associated with the term.
-
17. The system of claim 13, wherein the operations further comprise:
generating a data structure of spam terms based on sorting the terms included in the merged document based on the respective determined BTF-IDF scores; and providing the generated data structure for assessing whether a given account is a spam account or a non-spam account.
-
18. The system of claim 13, wherein an account from among the multiple accounts is associated with an entity that requested to present entity-related information at a geo-location represented on an online map, and wherein the entity-related information is stored in the document associated with the account and includes the geo-location, the entity'"'"'s identifier, description of products or services offered by the entity, and one or more of the entity'"'"'s contact info and the entity'"'"'s website information.
-
Specification