Systems and methods regarding keyword extraction

US 8,874,568 B2
Filed: 11/02/2011
Issued: 10/28/2014
Est. Priority Date: 11/05/2010
Status: Active Grant

First Claim

Patent Images

1. A computer system comprising one or more processors that function as:

(a) a preprocessing unit that extracts text from a webpage to produce at least a first set of candidate keywords, applies language processing to produce at least a second set of candidate keywords, and combines said first and second sets of candidate keywords into a first candidate pool;

(b) a candidate extraction unit that receives data from said preprocessing unit describing at least said first candidate pool and produces a second candidate pool;

(c) a feature extraction unit that receives data describing at least said second candidate pool and analyzes said second candidate pool for general features and linguistic features, wherein said general features include number of times a term appears in the text extracted from the webpage; and

(d) a classification unit that receives said data describing at least said second candidate pool and related data from said feature extraction unit, and determines a likelihood of each candidate in said second candidate pool being a primary or secondary keyword.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

One exemplary aspect comprises a computer system comprising: (a) a preprocessing unit that extracts text from a webpage to produce at least a first set of candidate keywords, applies language processing to produce at least a second set of candidate keywords, and combines said first and second sets of candidate keywords into a first candidate pool; (b) a candidate extraction unit that receives data from said preprocessing unit describing at least said first candidate pool and produces a second candidate pool; (c) a feature extraction unit that receives data describing at least said second candidate pool and analyzes said second candidate pool for general features and linguistic features; and (d) a classification unit that receives said data describing at least said second candidate pool and related data from said feature extraction unit, and determines a likelihood of each candidate in said second candidate pool being a primary or secondary keyword.

23 Citations

View as Search Results

19 Claims

1. A computer system comprising one or more processors that function as:
- (a) a preprocessing unit that extracts text from a webpage to produce at least a first set of candidate keywords, applies language processing to produce at least a second set of candidate keywords, and combines said first and second sets of candidate keywords into a first candidate pool;
  
  (b) a candidate extraction unit that receives data from said preprocessing unit describing at least said first candidate pool and produces a second candidate pool;
  
  (c) a feature extraction unit that receives data describing at least said second candidate pool and analyzes said second candidate pool for general features and linguistic features, wherein said general features include number of times a term appears in the text extracted from the webpage; and
  
  (d) a classification unit that receives said data describing at least said second candidate pool and related data from said feature extraction unit, and determines a likelihood of each candidate in said second candidate pool being a primary or secondary keyword.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. A computer system as in claim 1, wherein at least part of said language processing is performed by a tokenizer and a syntax parser.
  - 3. A computer system as in claim 1, wherein at least part of said language processing is performed by a tokenizer, a syntax parser, a part of speech tagger, and a named entity tagger.
  - 4. A computer system as in claim 1, wherein at least part of said language processing is performed by a tokenizer.
  - 5. A computer system as in claim 1, wherein at least part of said language processing is performed by a syntax parser.
  - 6. A computer system as in claim 1, wherein at least part of said language processing is performed by a part of speech tagger.
  - 7. A computer system as in claim 1, wherein at least part of said language processing is performed by a named entity tagger.
  - 8. A computer system as in claim 1, wherein said first set of candidate keywords comprises metadata text.
  - 9. A computer system as in claim 1, wherein said second candidate pool comprises noun phrases and noun sequences.
  - 10. A computer system as in claim 1, wherein said second candidate pool comprises noun phrases, noun sequences, and n-grams.
  - 11. A computer system as in claim 1, wherein said general features comprise one or more of frequency, position in the document, and capitalization.
  - 12. A computer system as in claim 1, wherein said linguistic features relate to one or more of part of speech, phrase structure, and named entity information.
  - 13. A computer system as in claim 1, wherein said general features comprise frequency features, and said frequency features comprise one or more of relative term frequency within said webpage and log of term frequency.
  - 14. A computer system as in claim 1, wherein said determination of likelihood of each candidate being a primary or secondary keyword is based on annotated training data.
  - 15. A computer system as in claim 1, wherein said determination of likelihood of each candidate being a primary or secondary keyword is based on training data created by combining annotation input from multiple annotators, and wherein each annotation includes a distinction between primary and secondary keywords.
  - 16. A computer system as in claim 1, wherein said general features comprise frequency, position in the document, and capitalization, and said linguistic features relate to part of speech, phrase structure, and named entity information.
  - 17. A computer system as in claim 1, wherein said general features comprise frequency features, said frequency features comprise one or more of relative term frequency within said webpage and log of term frequency, and said linguistic features relate to part of speech, phrase structure, and named entity information.

18. A method comprising steps implemented by a computer processing system, said steps comprising:
- (a) extracting text from a webpage to produce at least a first set of candidate keywords, applying language processing to produce at least a second set of candidate keywords, and combining said first and second sets of candidate keywords into a first candidate pool;
  
  (b) receiving data describing at least said first candidate pool and producing a second candidate pool;
  
  (c) receiving data describing at least said second candidate pool and analyzing said second candidate pool for general features and linguistic features, wherein said general features include number of times a term appears in the text extracted from the webpage; and
  
  (d) receiving said data describing at least said second candidate pool and related data from analyzing said second candidate pool, and determining a likelihood of each candidate in said second candidate pool being a primary or secondary keyword.

19. A non-transitory computer readable medium storing software instructions comprising:
- (a) extracting text from a webpage to produce at least a first set of candidate keywords, applying language processing to produce at least a second set of candidate keywords, and combining said first and second sets of candidate keywords into a first candidate pool;
  
  (b) receiving data describing at least said first candidate pool and producing a second candidate pool;
  
  (c) receiving data describing at least said second candidate pool and analyzing said second candidate pool for general features and linguistic features, wherein said general features include number of times a term appears in the text extracted from the webpage; and
  
  (d) receiving said data describing at least said second candidate pool and related data from analyzing said second candidate pool, and determining a likelihood of each candidate in said second candidate pool being a primary or secondary keyword.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Rakuten Group, Inc.
Original Assignee
Rakuten, Inc. (Rakuten Group, Inc.)
Inventors
Stankiewicz, Zofia, Sekine, Satoshi
Primary Examiner(s)
Ahn, Sangwoo

Application Number

US13/287,294
Publication Number

US 20120117092A1
Time in Patent Office

1,091 Days
Field of Search
US Class Current

707/728
CPC Class Codes

G06F 16/3329   Natural language query form...

G06F 16/3334   Selection or weighting of t...

G06F 16/35   Clustering; Classification

G06F 16/353   into predefined classes

Systems and methods regarding keyword extraction

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

23 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods regarding keyword extraction

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

23 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links