Semantic analysis of documents to rank terms
First Claim
1. A computer-implemented method comprising:
- extracting, by a processor, text from a document;
identifying, by the processor, terms within the extracted text, each of the terms comprising a contiguous grouping of two or more tokens, each token comprising a word;
determining, by the processor, a token value representing a total number of times each token occurs in the document;
determining, by the processor, a token frequency for each of the terms as a function of the token values of tokens in each of the terms; and
ranking, by the processor, the terms using the token frequency determined for each of the terms.
2 Assignments
0 Petitions
Accused Products
Abstract
A method, apparatus and computer program product provides for a semantic analyzer to produce and rank semantic terms to reflect their relationship to the theme and topics of a document. The text and the document can have no relationship to any pre-selected keywords before the semantic analyzer performs text extraction. The semantic analyzer extracts text from a document and performs semantic analysis on the extracted text. The semantic analyzer provides a plurality of ranked semantic terms as a result of the semantic analysis and associates semantic terms with the document as semantic keywords. The semantic terms define content to be presented with the document where the content is an advertisement, a link to a remote information resource or a second document.
60 Citations
19 Claims
-
1. A computer-implemented method comprising:
-
extracting, by a processor, text from a document; identifying, by the processor, terms within the extracted text, each of the terms comprising a contiguous grouping of two or more tokens, each token comprising a word; determining, by the processor, a token value representing a total number of times each token occurs in the document; determining, by the processor, a token frequency for each of the terms as a function of the token values of tokens in each of the terms; and ranking, by the processor, the terms using the token frequency determined for each of the terms. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer-implemented method comprising:
-
extracting, by a processor, text from a document; identifying, by the processor, terms within the extracted text, each of the terms comprising a contiguous grouping of two or more tokens, each token comprising a word; determining, by the processor, a standard deviation of offset or gap for each of the terms using positions of individual occurrences of each of the terms in the document; and ranking the terms using the standard deviation of offset or gap for each of the terms. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A non-transitory computer-readable medium on which is encoded program code, the program code comprising:
-
program code for extracting text from a document; program code for identifying terms within the extracted text, each of the terms comprising a contiguous grouping of two or more tokens, each token comprising a word; program code for determining a token value representing a total number of times each token occurs in the document; program code for determining a token frequency for each of the terms as a function of the token values of tokens in each of the terms; and program code for ranking the terms using the token frequency determined for each of the terms.
-
-
19. A non-transitory computer-readable medium on which is encoded program code, the program code comprising:
-
program code for extracting text from a document; program code for identifying terms within the extracted text, each of the terms comprising a contiguous grouping of two or more tokens, each token comprising a word; program code for determining a standard deviation of offset or gap for each of the terms using positions of individual occurrences of each of the terms in the document; and program code for ranking the terms using the standard deviation of offset or gap for each of the terms.
-
Specification