Generation of a semantic model from textual listings

US 9,244,908 B2
Filed: 08/24/2012
Issued: 01/26/2016
Est. Priority Date: 03/27/2012
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, by a processing device, a corpus of textual listings,textual listings, in the corpus, including text without a grammatical structure;

tokenizing, by the processing device, each textual listing of the textual listings,tokenizing each textual listing including tokenizing at least one of an alphanumeric token or a token that comprises uppercase and lowercase characters;

identifying, by the processing device, main concept words and attribute words in the corpus after tokenizing each textual listing of the textual listings,identifying the main concept words and the attribute words including;

tagging, in each textual listing of the textual listings, at least one word as a head noun word based on at least one of;

a previously identified main concept word, ora head noun identification rule,tagging, in the textual listing and after tagging the at least one word, remaining nouns as at least one modifier word, andassigning one word of the at least one head noun word as a main concept word and one word of the at least one modifier word as an attribute word;

clustering, by the processing device, words in the corpus based on at least one of the main concept words or the attribute words according to at least one clustering rule,the at least one clustering rule including at least one of;

a first rule preventing two quantitative attribute tokens from being clustered based on a frequency of appearance of the two quantitative attribute tokens in a same listing,a second rule preventing clustering of a quantitative attribute token with a qualitative attribute token, ora third rule indicating that a first token is to be clustered with a second token when characters of the first token are included in the second token; and

providing, by the processing device and after clustering the words, the main concept words and the attribute words as at least a portion of a semantic model.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A corpus of textual listings is received and main concept words and attribute words therein are identified via an iterative process of parsing listings and expanding a semantic model. During the parsing phase, the corpus of textual listings is parsed to tag one or more head noun words and/or one or more identifier words in each listing based on previously identified main concept words or using a head noun identification rule. Once substantially each listing in the corpus has been parsed in this manner, the expansion phase assigns head noun words as main concept words and modifier words as attribute words, where possible. During the next iteration, the newly identified main concept words and/or attribute words are used to further parse the listings. These iterations are repeated until a termination condition is reached. Remaining words in the corpus are clustered based on the main concept words and attribute words.

Citations

20 Claims

1. A method comprising:
- receiving, by a processing device, a corpus of textual listings,textual listings, in the corpus, including text without a grammatical structure;
  
  tokenizing, by the processing device, each textual listing of the textual listings,tokenizing each textual listing including tokenizing at least one of an alphanumeric token or a token that comprises uppercase and lowercase characters;
  
  identifying, by the processing device, main concept words and attribute words in the corpus after tokenizing each textual listing of the textual listings,identifying the main concept words and the attribute words including;
  
  tagging, in each textual listing of the textual listings, at least one word as a head noun word based on at least one of;
  
  a previously identified main concept word, ora head noun identification rule,tagging, in the textual listing and after tagging the at least one word, remaining nouns as at least one modifier word, andassigning one word of the at least one head noun word as a main concept word and one word of the at least one modifier word as an attribute word;
  
  clustering, by the processing device, words in the corpus based on at least one of the main concept words or the attribute words according to at least one clustering rule,the at least one clustering rule including at least one of;
  
  a first rule preventing two quantitative attribute tokens from being clustered based on a frequency of appearance of the two quantitative attribute tokens in a same listing,a second rule preventing clustering of a quantitative attribute token with a qualitative attribute token, ora third rule indicating that a first token is to be clustered with a second token when characters of the first token are included in the second token; and
  
  providing, by the processing device and after clustering the words, the main concept words and the attribute words as at least a portion of a semantic model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1,where the at least one clustering rule includes at least two of the first rule, the second rule, or the third rule.
  - 3. The method of claim 1, where identifying the main concept words and the attribute words in the corpus further comprises:
    - determining that each textual listing, of the textual listings, is parsable.
  - 4. The method of claim 1, where tagging the at least one word as the head noun word based on the at least one of the previously identified main concept word or the head noun identification rule comprises:
    - tagging the at least one word as the at least one head noun word when the at least one word matches the previously identified main concept word.
  - 5. The method of claim 1, where tagging the at least one word as the head noun word based on the at least one of the previously identified main concept word or the head noun identification rule comprises:
    - tagging the at least one word as the at least one head noun word when the at least one word is a last noun, in a first noun phrase, that has not previously been tagged as a modifier word.
  - 6. The method of claim 1, where assigning the one word, of the at least one head noun word, as the main concept word includes:
    - assigning the one word, of the at least one head noun word, as the main concept word when a ratio of a frequency of the one word, of the at least one head noun word, being tagged as a head noun word to a frequency of the one word, of the at least one head noun word, being tagged as a modifier word is greater than a main concept threshold.
  - 7. The method of claim 1, where assigning the one word, of the at least one modifier word, as the attribute word includes:
    - assigning the one word, of the at least one modifier word, as the attribute word when a ratio of a frequency of the one word, of the at least one modifier word, being tagged as a head noun word to a frequency of the one word, of the at least one modifier word, being tagged as a modifier word is less than an attribute threshold.
  - 8. The method of claim 1, where tokenizing each textual listing of the textual listings further includes:
    - tokenizing a particular textual listing, of the textual listings, to obtain a single token that includes multiple words or a dashed token that includes the multiple words.
  - 9. The method of claim 1, further comprising:
    - performing information extraction on at least another corpus of textual listings based on the semantic model.

10. An apparatus comprising:
- at least one storage device storing instructions; and
  
  a processor to execute the instructions to;
  
  receive a corpus of textual listings,textual listings, in the corpus, including one or more advertisements,the one or more advertisements including text without a grammatical structure;
  
  identify main concept words and attribute words in the corpus,when identifying the main concept words and the attribute words, the processor is to;
  
  tag, in each textual listing of the textual listings, at least one word as a head noun word based on at least one of;
  
  a previously identified main concept word, ora head noun identification rule,tag, in the textual listing and after tagging the at least one word, remaining nouns as at least one modifier word, andassign one word of the at least one head noun word as a main concept word and one word of the at least one modifier word as an attribute word;
  
  cluster words in the corpus based on at least one of the main concept words or the attribute words according to at least one clustering rule,the at least one clustering rule including at least one of;
  
  a first rule relating to clustering two quantitative attribute tokens based on a frequency of appearance of the two quantitative attribute tokens in a same listing,a second rule relating to clustering of a quantitative attribute token with a qualitative attribute token, ora third rule relating to clustering a first token and a second token based on characters of the first token being included in the second token; and
  
  provide, after clustering the words, the main concept words and the attribute words as at least a portion of a semantic model.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
- - 11. The apparatus of claim 10,where the at least one clustering rule includes at least two of the first rule, the second rule, or the third rule.
  - 12. The apparatus of claim 10, where, when identifying the main concept words and the attribute words in the corpus, the processor is further to:
    - determine that each textual listing, of the textual listings, is parsable.
  - 13. The apparatus of claim 10, where, when tagging the at least one word as the head noun word based on the at least one of the previously identified main concept word or the head noun identification rule, the processor is to:
    - tag the at least one word as the at least one head noun word when the at least one word matches the previously identified main concept word.
  - 14. The apparatus of claim 10, where, when tagging the at least one word as the head noun word based on the at least one of the previously identified main concept word or the head noun identification rule, the processor is to:
    - tag the at least one word as the at least one head noun word when the at least one word is a last noun, in a first noun phrase, that has not previously been tagged as a modifier word.
  - 15. The apparatus of claim 10, where, when assigning the one word of the at least one head noun word as the main concept word, the processor is to:
    - assign the one word, of the at least one head noun word, as the main concept word when a ratio of a frequency of the one word, of the at least one head noun word, being tagged as a head noun word to a frequency of the one word, of the at least one head noun word, being tagged as a modifier word is greater than a main concept threshold.
  - 16. The apparatus of claim 10, where, when assigning the one word of the at least one modifier word as the attribute word, the processor is to:
    - assign the one word, of the at least one modifier word, as an attribute word when a ratio of a frequency of the one word, of the at least one modifier word, being tagged as a head noun word to a frequency of the one word, of the at least one modifier word, being tagged as a modifier word is less than an attribute threshold.
  - 17. The apparatus of claim 10, where the processor is further to:
    - tokenize words in the corpus prior to identifying the main concept words and the attribute words; and
      
      perform information extraction on at least another corpus of textual listings based on the semantic model.

18. A non-transitory computer-readable medium storing instructions, the instructions comprising:
- one or more instructions that, when executed by a processor, cause the processor to;
  
  receive a corpus of textual listings,textual listings, of the corpus of textual listings, including at least one of an advertisement or a product listing,the at least one of the advertisement or the product listing including text without a grammatical structure;
  
  identify main concept words and attribute words in the corpus,the one or more instructions to identify the main concept words and the attribute words including;
  
  one or more instructions to tag, in each textual listing of the textual listings, at least one word as a head noun word based on at least one of a previously identified main concept word or a head noun identification rule,
  
  the one or more instructions to tag the at least one word including one or more instructions to tag the at least one word as the at least one head noun word when the at least one word is a last noun, in a first noun phrase, that has not previously been tagged as a modifier word,one or more instructions to tag, in the textual listing and after tagging the at least one word, remaining nouns as at least one modifier word, andone or more instructions to assign one word of the at least one head noun word as a main concept word and one word of the at least one modifier word as an attribute word;
  
  cluster words in the corpus based on at least one of the main concept words or the attribute words according to at least one clustering rule,the at least one clustering rule including at least one of;
  
  a first rule relating to clustering two quantitative attribute tokens based on a frequency of appearance of the two quantitative attribute tokens in a same listing,a second rule relating to clustering of a quantitative attribute token with a qualitative attribute token, ora third rule relating to clustering a first token and a second token based on characters of the first token being included in the second token; and
  
  provide, after clustering the words, the main concept words and the attribute words as at least a portion of a semantic model.
- View Dependent Claims (19, 20)
- - 19. The non-transitory computer-readable medium of claim 18, where the one or more instructions to assign the one word of the at least one head noun word as the main concept word comprise:
    - one or more instructions to assign the one word, of the at least one head noun word, as the main concept word when a ratio of a frequency of the one word, of the at least one head noun word, being tagged as a head noun word to a frequency of the one word, of the at least one head noun word, being tagged as a modifier word is greater than a main concept threshold.
  - 20. The non-transitory computer-readable medium of claim 18, where the one or more instructions to assign the one word of the at least one modifier word as the attribute word comprise:
    - one or more instructions to assign the one word, of the at least one modifier word, as an attribute word when a ratio of a frequency of the one word, of the at least one modifier word, being tagged as a head noun word to a frequency of the one word, of the at least one modifier word, being tagged as a modifier word is less than an attribute threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Accenture Global Services Limited (Accenture PLC)
Original Assignee
Accenture Global Services Limited (Accenture PLC)
Inventors
Kim, Doo Soon, Yeh, Peter Z., Verma, Kunal
Primary Examiner(s)
ROBERTS, SHAUN A

Application Number

US13/593,778
Publication Number

US 20130262086A1
Time in Patent Office

1,250 Days
Field of Search

704/9, 705/14.54
US Class Current

1/1
CPC Class Codes

G06F 40/205   Parsing

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/30   Semantic analysis

G06Q 30/0256   User search

Generation of a semantic model from textual listings

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Generation of a semantic model from textual listings

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links