Document key phrase extraction method

US 8,935,260 B2
Filed: 05/12/2009
Issued: 01/13/2015
Est. Priority Date: 05/12/2009
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method of extracting key phrases from a document comprising:

accessing a repository comprising hyperlinked subjects, the repository comprising first and second data structures representing the relationship between said hyperlinked subjects using different representation criteria;

pruning the first data structure by removing hyperlinks between subjects based on a further relationship between said subjects in the second data structure;

matching phrases in said document to said subjects in the pruned first data structure;

further pruning the pruned first data structure by removing unmatched subjects that are not hyperlinked to matched subjects;

determining a ranking for each matched subject; and

selecting key phrases using the determined subject rankings, wherein the first data structure is a directional graph comprising the subjects as nodes and the hyperlinks between subjects as edges between nodes;

the second data structure is a directional graph comprising organized subject categories; and

the further relationship comprises the shortest distance between respective categories to which respective subjects belong in the second data structure, the hyperlink between said subjects being removed if the shortest distance exceeds a threshold value.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method of extracting key phrases from a document is disclosed comprising the steps of accessing a repository comprising linked subjects, the repository comprising first and second data structures representing the relationship between said subjects using different representation criteria; pruning the first data structure by removing links between subjects based on a further relationship between said subjects in the second data structure; matching phrases in said document to subjects in the pruned first data structure; further pruning the pruned first data structure by removing unmatched subjects that are not linked to matched subjects; determining a ranking for each matched subject; and selecting key phrases using the determined subject rankings. A computer program for implementing the steps of this method when executed on a computer is also disclosed.

11 Citations

View as Search Results

13 Claims

1. A computer-implemented method of extracting key phrases from a document comprising:
- accessing a repository comprising hyperlinked subjects, the repository comprising first and second data structures representing the relationship between said hyperlinked subjects using different representation criteria;
  
  pruning the first data structure by removing hyperlinks between subjects based on a further relationship between said subjects in the second data structure;
  
  matching phrases in said document to said subjects in the pruned first data structure;
  
  further pruning the pruned first data structure by removing unmatched subjects that are not hyperlinked to matched subjects;
  
  determining a ranking for each matched subject; and
  
  selecting key phrases using the determined subject rankings, wherein the first data structure is a directional graph comprising the subjects as nodes and the hyperlinks between subjects as edges between nodes;
  
  the second data structure is a directional graph comprising organized subject categories; and
  
  the further relationship comprises the shortest distance between respective categories to which respective subjects belong in the second data structure, the hyperlink between said subjects being removed if the shortest distance exceeds a threshold value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the threshold value is configurable.
  - 3. The method of claim 2, further comprising restoring a hyperlink between subjects in said pruned first data structure if a bidirectional hyperlink exists between the subjects in said repository.
  - 4. The method of claim 1, wherein the phrase matching step includes a disambiguation evaluation step.
  - 5. The method of claim 1, further comprising adding a bi-directional hyperlink between matched subjects prior to said further pruning step, wherein said bi-directional hyperlink is added if the phrases matched to said subjects occur in the document within a defined distance from each other.
  - 6. The method of claim 5, wherein the defined distance is configurable.
  - 7. The method of claim 1, wherein the matched subject ranking step utilizes an algorithm considering the number of hyperlinks to a subject and the ranking of the subjects from which said hyperlinks originate.
  - 8. The method of claim 1, wherein the subject ranking, step further comprises determining an initial ranking based on the number of occurrences of the corresponding phrase in the document.
  - 9. The method of claim 1, wherein the repository is an Internet-accessible database.
  - 10. The hod of claim 9, wherein the database is Wikipedia.
  - 11. The method of claim 1, further comprising extracting key phrases from a further document by repeating the phrase matching, further pruning, subject ranking, and key phrase selection steps for the further document.
  - 12. The method of claim 1, further comprising inserting the hyperlinks to the respective subjects corresponding to the selected key phrases into the document.

13. A non-transitory computer-readable data storage device comprising instructions which cause the computer program to:
- access a repository corn rising hyperlinked subjects, the repository comprising first and second data structures representing the relationship between said hyperlinked subjects using different representation criteria;
  
  prune the first data structure by removing hyperlinks between subjects based on a further relationship between said subjects in the second data structure;
  
  match phrases in said document to said subjects in the pruned first data structure;
  
  further prune the pruned first data structure by removing unmatched subjects that are not determine a ranking for each matched subject; and
  
  select key phrases using the determined subject rankings, wherein the first data structure is a directional graph comprising the subjects as nodes and the hyperlinks between subjects as edges between nodes;
  
  the second data structure is a directional graph comprising organized subject categories; and
  
  the further relationship comprises the shortest distance between respective categories to which respective subjects belong in the second data structure, the hyperlink between said subjects being removed if the shortest distance exceeds a threshold value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Zhou, Bao-Yao, Luo, Ping, Yang, Sheng-Wen, Xiong, Yuhong, Liu, Wei
Primary Examiner(s)
Trujillo, James
Assistant Examiner(s)
VU, THONG H

Application Number

US13/264,806
Publication Number

US 20120047149A1
Time in Patent Office

2,072 Days
Field of Search

707/102, 707/749, 707/798, 707/705, 707/3, 707/5, 707/767
US Class Current

707/748
CPC Class Codes

G06F 16/2246   Trees, e.g. B+trees

G06F 16/9027   Trees

G06F 40/205   Parsing

G06F 40/258   Heading extraction; Automat...

G06N 5/022   Knowledge engineering; Know...

Document key phrase extraction method

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

11 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Document key phrase extraction method

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

11 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links