Generation of a semantic model from textual listings

US 9,594,747 B2
Filed: 01/21/2016
Issued: 03/14/2017
Est. Priority Date: 03/27/2012
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a processing device to;

identify main concept words and attribute words in textual listings;

cluster words, in the textual listings, based on at least one of the main concept words or the attribute words according to at least one clustering rule,the at least one clustering rule including at least one of;

a first rule preventing clustering of words based on a frequency of appearance of words in a same textual listing,a second rule preventing clustering of a quantitative attribute word with a qualitative attribute word, ora third rule indicating clustering of two words when characters of a first word, of the two words, are included in a second word of the two words; and

provide, after clustering the words, the main concept words and the attribute words as at least a portion of a semantic model,the semantic model being used for subsequent clustering.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A corpus of textual listings is received and main concept words and attribute words therein are identified via an iterative process of parsing listings and expanding a semantic model. During the parsing phase, the corpus of textual listings is parsed to tag one or more head noun words and/or one or more identifier words in each listing based on previously identified main concept words or using a head noun identification rule. Once substantially each listing in the corpus has been parsed in this manner, the expansion phase assigns head noun words as main concept words and modifier words as attribute words, where possible. During the next iteration, the newly identified main concept words and/or attribute words are used to further parse the listings. These iterations are repeated until a termination condition is reached. Remaining words in the corpus are clustered based on the main concept words and attribute words.

Citations

20 Claims

1. A system comprising:
- a processing device to;
  
  identify main concept words and attribute words in textual listings;
  
  cluster words, in the textual listings, based on at least one of the main concept words or the attribute words according to at least one clustering rule,the at least one clustering rule including at least one of;
  
  a first rule preventing clustering of words based on a frequency of appearance of words in a same textual listing,a second rule preventing clustering of a quantitative attribute word with a qualitative attribute word, ora third rule indicating clustering of two words when characters of a first word, of the two words, are included in a second word of the two words; and
  
  provide, after clustering the words, the main concept words and the attribute words as at least a portion of a semantic model,the semantic model being used for subsequent clustering.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system of claim 1, where the processing device is further to:
    - receive user input that includes main concept seed words and attribute seed words;
      
      identify the main concept words based on the main concept seed words; and
      
      identify the attribute words based on the attribute seed words.
  - 3. The system of claim 1, where the processing device is further to:
    - identify multiword tokens based on a frequency of multiple words appearing as a single token; and
      
      identify the main concept words and the attribute words using the multiword tokens.
  - 4. The system of claim 1, where the processing device is further to:
    - prevent clustering of the words, according to the first rule, when the frequency of appearance of the words in the same textual listing satisfies a threshold.
  - 5. The system of claim 1, where the processing device is further to:
    - cluster the words, according to the third rule, when a first order of the characters of the first word matches a second order of the characters included in the second word.
  - 6. The system of claim 1, where the processing device is further to:
    - determine that the first rule, the second rule, and the third rule are not applicable to the words in the textual listings; and
      
      determine whether to cluster the words using context-based similarity, based on determining that the first rule, the second rule, and the third rule are not applicable to the words in the textual listings.
  - 7. The system of claim 1, where the processing device is further to:
    - determine whether a textual listing of the textual listings is parsable based on at least one of;
      
      a quantity of consecutive nouns included in the textual listing, ora threshold percentage of nouns included in the textual listing are not identified as the main concept words or the attribute words.

8. A non-transitory computer-readable medium storing instructions, the instructions comprising:
- one or more instructions that, when executed by one or more processors, cause the one or more processors to;
  
  identify main concept tokens and attribute tokens in text corpora;
  
  cluster tokens, in the text corpora, based on at least one of the main concept tokens or the attribute tokens according to at least one clustering rule,the at least one clustering rule including at least one of;
  
  a first rule associated with a frequency of appearance of tokens in a same text corpora,a second rule associated with a type of an attribute token, ora third rule associated with characters of a first token and a second token; and
  
  provide, after clustering the tokens, the main concept tokens and the attribute tokens as at least a portion of a semantic model,the semantic model being used for subsequent clustering.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The non-transitory computer-readable medium of claim 8, where the instructions further comprise:
    - one or more instructions, that when executed by the one or more processors, cause the one or more processors to;
      
      determine whether one or more words preceding a particular attribute token, of the attribute tokens, are numbers;
      
      determine a percentage of the one or more words that are numbers; and
      
      determine a type of the particular attribute token based on determining whether the percentage satisfies a threshold percentage,the type of the particular attribute token including;
      
      a quantitative attribute token, ora qualitative attribute token.
  - 10. The non-transitory computer-readable medium of claim 9, where:
    - the quantitative attribute token includes numbers and characters.
  - 11. The non-transitory computer-readable medium of claim 9, where:
    - the qualitative attribute token includes characters.
  - 12. The non-transitory computer-readable medium of claim 8, where the instructions further comprise:
    - one or more instructions that, when executed by the one or more processors, cause the one or more processors to;
      
      select one or more tokens as attribute seed tokens based on a likelihood that the one or more tokens are to be treated as one or more attribute tokens,the attribute seed tokens being utilized to identify the attribute tokens in the text corpora.
  - 13. The non-transitory computer-readable medium of claim 8, where the instructions further comprise:
    - one or more instructions that, when executed by the one or more processors, cause the one or more processors to;
      
      cluster the tokens based on other tokens that precede the tokens in the text corpora or that follow the tokens in the text corpora.
  - 14. The non-transitory computer-readable medium of claim 8, where the instructions further comprise:
    - one or more instructions that, when executed by the one or more processors, cause the one or more processors to one of;
      
      identify a particular token, of the tokens, as a head noun, when the particular token is included in a last noun position of a noun phrase in the text corpora;
      
      oridentify the particular token as the head noun when the particular token is included in a noun phrase that excludes a prepositional phrase.

15. A method, comprising:
- identifying, by a device, main concepts and attributes in listing corpora;
  
  clustering, by the device, words, in the listing corpora, based on at least one of the main concepts or the attributes according to one or more rules,the one or more rules including one or more of;
  
  a first rule preventing clustering of words based on a frequency of appearance of words in a same listing corpora,a second rule preventing clustering of a quantitative attribute word with a qualitative attribute word, ora third rule indicating clustering of two words when characters of a first word, of the two words, are included in a second word of the two words; and
  
  providing, by the device, after clustering the words, the main concept words and the attribute words as at least a portion of a semantic model,the semantic model being used for subsequent clustering.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The method of claim 15, further comprising:
    - identifying a noun phrase, included in the listing corpora, that includes one or more nouns;
      
      identifying a last noun, of the one or more nouns, included in the noun phrase; and
      
      designating the last noun as a head noun based on identifying the last noun included in the noun phrase.
  - 17. The method of claim 15, further comprising:
    - identifying a noun phrase, in the listing corpora, that includes one or more nouns and one or more adjectives,the noun phrase excluding prepositions; and
      
      designating a noun, of the one or more nouns, as a head noun based on identifying the noun phrase.
  - 18. The method of claim 15, further comprising:
    - clustering one or more first words, of the words, as main concepts;
      
      identifying one or more second words, of the words, that have not been clustered as main concepts; and
      
      clustering the one or more second words as attributes, based on identifying the one or more second words.
  - 19. The method of claim 15, further comprising:
    - clustering according to the first rule prior to clustering according to the second rule or the third rule;
      
      clustering according to the second rule after clustering according to the first rule and prior to clustering according to the third rule; and
      
      clustering according to the third rule after clustering according to the first rule and the second rule.
  - 20. The method of claim 15, where the device is a semantic model generation device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Accenture Global Services Limited (Accenture PLC)
Original Assignee
Accenture Global Services Limited (Accenture PLC)
Inventors
Kim, Doo Soon, Yeh, Peter Z., Verma, Kunal
Primary Examiner(s)
ROBERTS, SHAUN A

Application Number

US15/003,344
Publication Number

US 20160140109A1
Time in Patent Office

418 Days
Field of Search

704/9, 705/14.54, 707/2, 707/100, 707/737
US Class Current

1/1
CPC Class Codes

G06F 40/205   Parsing

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/30   Semantic analysis

G06Q 30/0256   User search

Generation of a semantic model from textual listings

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Generation of a semantic model from textual listings

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links