Document Key Phrase Extraction Method

US 20120047149A1
Filed: 05/12/2009
Published: 02/23/2012
Est. Priority Date: 05/12/2009
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of extracting key phrases from a document comprising:

accessing a repository comprising linked subjects, the repository comprising first and second data structures representing the relationship between said subjects using different representation criteria;

pruning the first data structure by removing links between subjects based on a further relationship between said subjects in the second data structure;

matching phrases in said document to subjects in the pruned first data structure;

further pruning the pruned first data structure by removing unmatched subjects that are not linked to matched subjects;

determining a ranking for each matched subject; and

selecting key phrases using the determined subject rankings.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method of extracting key phrases from a document is disclosed comprising the steps of accessing a repository comprising linked subjects, the repository comprising first and second data structures representing the relationship between said subjects using different representation criteria; pruning the first data structure by removing links between subjects based on a further relationship between said subjects in the second data structure; matching phrases in said document to subjects in the pruned first data structure; further pruning the pruned first data structure by removing unmatched subjects that are not linked to matched subjects; determining a ranking for each matched subject; and selecting key phrases using the determined subject rankings. A computer program for implementing the steps of this method when executed on a computer is also disclosed.

33 Citations

View as Search Results

15 Claims

1. A computer-implemented method of extracting key phrases from a document comprising:
- accessing a repository comprising linked subjects, the repository comprising first and second data structures representing the relationship between said subjects using different representation criteria;
  
  pruning the first data structure by removing links between subjects based on a further relationship between said subjects in the second data structure;
  
  matching phrases in said document to subjects in the pruned first data structure;
  
  further pruning the pruned first data structure by removing unmatched subjects that are not linked to matched subjects;
  
  determining a ranking for each matched subject; and
  
  selecting key phrases using the determined subject rankings.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein:
    - the first data structure is a directional graph comprising the subjects as nodes and the links between subjects as edges between nodes;
      
      the second data structure is a directional graph comprising organized subject categories; and
      
      the further relationship comprises the shortest distance between the respective categories to which the respective subjects belong, the link between said subjects being removed if the shortest distance exceeds a threshold value.
  - 3. The method of claim 2, wherein the threshold value is configurable.
  - 4. The method of claim 2, further comprising restoring a link between subjects in said pruned first data structure if a bidirectional link exists between the subjects in said repository.
  - 5. The method of claim 1, wherein the phrase matching step includes a disambiguation evaluation step.
  - 6. The method of claim 1, further comprising adding a bi-directional link between matched subjects prior to said further pruning step, wherein said bi-directional link is added if the phrases matched to said subjects occur in the document within a defined distance from each other.
  - 7. The method of claim 6, wherein the defined distance is configurable.
  - 8. The method of claim 1, wherein the matched subject ranking step utilizes an algorithm considering the number of links to a subject and the ranking of the subjects from which said links originate.
  - 9. The method of claim 1, wherein the subject ranking step further comprises determining an initial ranking based on the number of occurrences of the corresponding phrase in the document.
  - 10. The method of claim 1, wherein the repository is an Internet-accessible database.
  - 11. The method of claim 10, wherein the database is Wikipedia.
  - 12. The method of claim 1, further comprising extracting key phrases from a further document by repeating the phrase matching, further pruning, subject ranking, and key phrase selection steps for the further document.
  - 13. The method of claim 1, further comprising inserting hyperlinks to the respective subjects corresponding to the selected key phrases into the document.
  - 14. A computer program product for, when loaded onto a computer, executing the steps of the method of any of claims 1-13.
  - 15. A computer-readable data storage device comprising the computer program product of claim 14.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Yang, Sheng-Wen, Liu, Wei, Zhou, Bao-Yao, Luo, Ping, Xiong, Yuhong

Granted Patent

US 8,935,260 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/748
CPC Class Codes

G06F 16/2246   Trees, e.g. B+trees

G06F 16/9027   Trees

G06F 40/205   Parsing

G06F 40/258   Heading extraction; Automat...

G06N 5/022   Knowledge engineering; Know...

Document Key Phrase Extraction Method

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

33 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Document Key Phrase Extraction Method

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

33 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links