System and methods for quantitative assessment of information in natural language contents and for determining relevance using association data
First Claim
1. A method implemented on a computer comprising a processor, and for determining relevance between a text content and an object or a topic, the method comprising:
- receiving a text content comprising one or more words or phrases or sentences as terms, and tokenizing the text content into one or more tokens, each being an instance of a term in the text content;
identifying a grammatical attribute, or a semantic attribute, or an external term frequency associated with the one or more tokens or terms in the text content, wherein the grammatical attribute includes at least a subject, a predicate or part of a predicate, a modifier in a phrase, a head of a phrase, a sub-phrase of a phrase, an object, a noun, a verb, an adjective, or an adverb, wherein the semantic attribute includes at least semantic roles and attribute values, wherein the external term frequency is obtained from text contents other than the received text content;
determining an importance measure for each token or term based on the grammatical attribute, or the semantic attribute, or the external term frequency;
receiving one or more datasets, wherein each dataset is associated with a name or description representing an object, wherein the object comprises a physical or conceptual object, a topic, or a pre-defined attribute, and wherein each dataset comprises one or more words or phrases as names of properties associated with the corresponding object, wherein the names of properties represent other objects or concepts or topics or attributes-related to the object, wherein the names of properties collectively represent a type of definition or representation of the object;
matching at least two tokens or terms in the text content with at least two property names in each of the one or more datasets;
for each of the one or more datasets, producing a score based at least on the importance measure of the token or term that matches a property name in the dataset, when the importance measure is in the form of a term importance score that is calculated based on the external frequency, or based on the grammatical attribute, or based on the semantic attribute or attribute value, and when the score based on the importance measure is in the form of a relevance score, the relevance score is produced as a function of the term importance score; and
marking or selecting one or more of the names or descriptions representing the one or more objects as being relevant to the text content if the corresponding score is above a predefined threshold.
1 Assignment
0 Petitions
Accused Products
Abstract
System and methods are disclosed for quantitatively assessing information in natural language contents related to an object name, or a concept or topic name, and for determining the relevance between a text content and one or more concepts or topics. The methods include identifying the grammatical or semantic attributes and other contextual information of terms in the text content, and retrieving an object-specific dataset related to the object name, or an equivalent to a concept or a topic name. The data set includes property names and association-strength values. The methods further include matching the terms in the text content with the property terms in the dataset, and calculating a score as a quantitative measure of the relevance between the text content and the concept or topic.
-
Citations
19 Claims
-
1. A method implemented on a computer comprising a processor, and for determining relevance between a text content and an object or a topic, the method comprising:
-
receiving a text content comprising one or more words or phrases or sentences as terms, and tokenizing the text content into one or more tokens, each being an instance of a term in the text content; identifying a grammatical attribute, or a semantic attribute, or an external term frequency associated with the one or more tokens or terms in the text content, wherein the grammatical attribute includes at least a subject, a predicate or part of a predicate, a modifier in a phrase, a head of a phrase, a sub-phrase of a phrase, an object, a noun, a verb, an adjective, or an adverb, wherein the semantic attribute includes at least semantic roles and attribute values, wherein the external term frequency is obtained from text contents other than the received text content; determining an importance measure for each token or term based on the grammatical attribute, or the semantic attribute, or the external term frequency; receiving one or more datasets, wherein each dataset is associated with a name or description representing an object, wherein the object comprises a physical or conceptual object, a topic, or a pre-defined attribute, and wherein each dataset comprises one or more words or phrases as names of properties associated with the corresponding object, wherein the names of properties represent other objects or concepts or topics or attributes-related to the object, wherein the names of properties collectively represent a type of definition or representation of the object; matching at least two tokens or terms in the text content with at least two property names in each of the one or more datasets; for each of the one or more datasets, producing a score based at least on the importance measure of the token or term that matches a property name in the dataset, when the importance measure is in the form of a term importance score that is calculated based on the external frequency, or based on the grammatical attribute, or based on the semantic attribute or attribute value, and when the score based on the importance measure is in the form of a relevance score, the relevance score is produced as a function of the term importance score; and marking or selecting one or more of the names or descriptions representing the one or more objects as being relevant to the text content if the corresponding score is above a predefined threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computer system for identifying one or more topics or topic domains or attributes related to a text content, comprising:
-
a computer processor operable to receive a text content comprising one or more words or phrases or sentences as terms, and to tokenize the text content into one or more tokens, each being an instance of a term in the text content; identify a grammatical attribute, or a semantic attribute, or an external term frequency associated with the one or more tokens or terms in the text content, wherein the grammatical attribute includes at least a subject, a predicate or part of a predicate, a modifier in a phrase, a head of a phrase, a sub-phrase of a phrase, an object, a noun, a verb, an adjective, or an adverb, wherein the semantic attribute includes at least semantic roles and attribute values, wherein the external term frequency is obtained from text contents other than the received text content; determine an importance measure for each token or term based on the grammatical attribute, or the semantic attribute, or the external term frequency; receive one or more datasets, wherein each dataset is associated with a name or description representing an object, wherein the object comprises a physical object or a concept, a topic, a topic domain, or a pre-defined attribute, wherein each dataset comprises one or more words or phrases as names of properties associated with the corresponding object, wherein the names of properties represent other objects or concepts or topics or attributes related to the object, wherein the names of properties collectively represent a type of definition or representation of the object; match at least two tokens or terms in the text content with at least two property names in each of the one or more datasets; produce, for each of the one or more datasets, a score based at least on the importance measure of each token or term that matches a property name in the dataset, when the importance measure is in the form of a term importance score that is calculated based on the external frequency, or based on the grammatical attribute, or based on the semantic attribute or attribute value, and when the score based on the importance measure is in the form of a relevance score, the relevance score is produced as a function of the term importance score; select one or more names or descriptions of the one or more datasets as relevant objects or topics or topic domains or attributes to the text content if the corresponding score is above a predefined threshold; and display, or provide an instruction to display in a user interface, or to store in the computer storage, the selected one or more names or descriptions, or the score associated with the one or more selected names or descriptions, wherein the function of the score includes serving as a quantitative measure of relevance between the text content and the selected one or more names or descriptions of the objects, or topics or topic domains or attributes. - View Dependent Claims (16, 17, 18)
-
-
19. A computer system for searching documents and ranking search results based on association, comprising:
-
a computer storage storing one or more datasets, each dataset having a name or description representing an object, wherein the object comprises a physical object or a topic or concept, and each dataset comprises one or more words or phrases as names of properties associated with the corresponding named object, wherein the names of properties represent other objects or concepts or topics or attributes related to the object, wherein the names of properties collectively represent a type of definition or representation of the object; and a computer processor operable to receive a search query, receive one or more documents each comprising one or more words or phrases or sentences as terms, tokenize each of the documents into one or more tokens, each being an instance of a term in the text content, identify a grammatical attribute, or a semantic attribute, or an external term frequency associated with the one or more tokens or terms in the text content, wherein the grammatical attribute includes at least a subject, a predicate or part of a predicate, a modifier in a phrase, a head of a phrase, a sub-phrase of a phrase, an object, a noun, a verb, an adjective, or an adverb, wherein the semantic attribute includes at least semantic roles and attribute values, wherein the external term frequency is obtained from text contents other than the received documents, determine an importance measure for each token or term based on the grammatical attribute, or the semantic attribute, or the external term frequency, receive one or more names or descriptions of the datasets, wherein the one or more names or descriptions match a word or phrase in the query, obtain one or more of the corresponding datasets, for at least one of the documents, match at least two tokens or terms in the document with at least two property names in at least one of the one or more datasets, for at least one of the one or more datasets or documents, produce a relevance score based at least on the importance measure of the token or term that matches a property name in the dataset, when the importance measure is in the form of a term importance score that is calculated based on the external frequency, or based on the grammatical attribute, or based on the semantic attribute or attribute value, the relevance score is produced as a function of the term importance score, select one or more of the documents as relevant documents to the query and rank the selected documents based at least on the relevance score, and output the ranked documents or document representations as a search result.
-
Specification