Generation of a semantic model from textual listings
First Claim
1. A method comprising:
- receiving, by a processing device, a corpus of textual listings,textual listings, in the corpus, including text without a grammatical structure;
tokenizing, by the processing device, each textual listing of the textual listings,tokenizing each textual listing including tokenizing at least one of an alphanumeric token or a token that comprises uppercase and lowercase characters;
identifying, by the processing device, main concept words and attribute words in the corpus after tokenizing each textual listing of the textual listings,identifying the main concept words and the attribute words including;
tagging, in each textual listing of the textual listings, at least one word as a head noun word based on at least one of;
a previously identified main concept word, ora head noun identification rule,tagging, in the textual listing and after tagging the at least one word, remaining nouns as at least one modifier word, andassigning one word of the at least one head noun word as a main concept word and one word of the at least one modifier word as an attribute word;
clustering, by the processing device, words in the corpus based on at least one of the main concept words or the attribute words according to at least one clustering rule,the at least one clustering rule including at least one of;
a first rule preventing two quantitative attribute tokens from being clustered based on a frequency of appearance of the two quantitative attribute tokens in a same listing,a second rule preventing clustering of a quantitative attribute token with a qualitative attribute token, ora third rule indicating that a first token is to be clustered with a second token when characters of the first token are included in the second token; and
providing, by the processing device and after clustering the words, the main concept words and the attribute words as at least a portion of a semantic model.
1 Assignment
0 Petitions
Accused Products
Abstract
A corpus of textual listings is received and main concept words and attribute words therein are identified via an iterative process of parsing listings and expanding a semantic model. During the parsing phase, the corpus of textual listings is parsed to tag one or more head noun words and/or one or more identifier words in each listing based on previously identified main concept words or using a head noun identification rule. Once substantially each listing in the corpus has been parsed in this manner, the expansion phase assigns head noun words as main concept words and modifier words as attribute words, where possible. During the next iteration, the newly identified main concept words and/or attribute words are used to further parse the listings. These iterations are repeated until a termination condition is reached. Remaining words in the corpus are clustered based on the main concept words and attribute words.
-
Citations
20 Claims
-
1. A method comprising:
-
receiving, by a processing device, a corpus of textual listings, textual listings, in the corpus, including text without a grammatical structure; tokenizing, by the processing device, each textual listing of the textual listings, tokenizing each textual listing including tokenizing at least one of an alphanumeric token or a token that comprises uppercase and lowercase characters; identifying, by the processing device, main concept words and attribute words in the corpus after tokenizing each textual listing of the textual listings, identifying the main concept words and the attribute words including; tagging, in each textual listing of the textual listings, at least one word as a head noun word based on at least one of; a previously identified main concept word, or a head noun identification rule, tagging, in the textual listing and after tagging the at least one word, remaining nouns as at least one modifier word, and assigning one word of the at least one head noun word as a main concept word and one word of the at least one modifier word as an attribute word; clustering, by the processing device, words in the corpus based on at least one of the main concept words or the attribute words according to at least one clustering rule, the at least one clustering rule including at least one of; a first rule preventing two quantitative attribute tokens from being clustered based on a frequency of appearance of the two quantitative attribute tokens in a same listing, a second rule preventing clustering of a quantitative attribute token with a qualitative attribute token, or a third rule indicating that a first token is to be clustered with a second token when characters of the first token are included in the second token; and providing, by the processing device and after clustering the words, the main concept words and the attribute words as at least a portion of a semantic model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. An apparatus comprising:
-
at least one storage device storing instructions; and a processor to execute the instructions to; receive a corpus of textual listings, textual listings, in the corpus, including one or more advertisements, the one or more advertisements including text without a grammatical structure; identify main concept words and attribute words in the corpus, when identifying the main concept words and the attribute words, the processor is to; tag, in each textual listing of the textual listings, at least one word as a head noun word based on at least one of; a previously identified main concept word, or a head noun identification rule, tag, in the textual listing and after tagging the at least one word, remaining nouns as at least one modifier word, and assign one word of the at least one head noun word as a main concept word and one word of the at least one modifier word as an attribute word; cluster words in the corpus based on at least one of the main concept words or the attribute words according to at least one clustering rule, the at least one clustering rule including at least one of; a first rule relating to clustering two quantitative attribute tokens based on a frequency of appearance of the two quantitative attribute tokens in a same listing, a second rule relating to clustering of a quantitative attribute token with a qualitative attribute token, or a third rule relating to clustering a first token and a second token based on characters of the first token being included in the second token; and provide, after clustering the words, the main concept words and the attribute words as at least a portion of a semantic model. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
-
-
18. A non-transitory computer-readable medium storing instructions, the instructions comprising:
-
one or more instructions that, when executed by a processor, cause the processor to; receive a corpus of textual listings, textual listings, of the corpus of textual listings, including at least one of an advertisement or a product listing, the at least one of the advertisement or the product listing including text without a grammatical structure; identify main concept words and attribute words in the corpus, the one or more instructions to identify the main concept words and the attribute words including; one or more instructions to tag, in each textual listing of the textual listings, at least one word as a head noun word based on at least one of a previously identified main concept word or a head noun identification rule,
the one or more instructions to tag the at least one word including one or more instructions to tag the at least one word as the at least one head noun word when the at least one word is a last noun, in a first noun phrase, that has not previously been tagged as a modifier word,one or more instructions to tag, in the textual listing and after tagging the at least one word, remaining nouns as at least one modifier word, and one or more instructions to assign one word of the at least one head noun word as a main concept word and one word of the at least one modifier word as an attribute word; cluster words in the corpus based on at least one of the main concept words or the attribute words according to at least one clustering rule, the at least one clustering rule including at least one of; a first rule relating to clustering two quantitative attribute tokens based on a frequency of appearance of the two quantitative attribute tokens in a same listing, a second rule relating to clustering of a quantitative attribute token with a qualitative attribute token, or a third rule relating to clustering a first token and a second token based on characters of the first token being included in the second token; and provide, after clustering the words, the main concept words and the attribute words as at least a portion of a semantic model. - View Dependent Claims (19, 20)
-
Specification