Document key phrase extraction method
First Claim
1. A computer-implemented method of extracting key phrases from a document comprising:
- accessing a repository comprising hyperlinked subjects, the repository comprising first and second data structures representing the relationship between said hyperlinked subjects using different representation criteria;
pruning the first data structure by removing hyperlinks between subjects based on a further relationship between said subjects in the second data structure;
matching phrases in said document to said subjects in the pruned first data structure;
further pruning the pruned first data structure by removing unmatched subjects that are not hyperlinked to matched subjects;
determining a ranking for each matched subject; and
selecting key phrases using the determined subject rankings, wherein the first data structure is a directional graph comprising the subjects as nodes and the hyperlinks between subjects as edges between nodes;
the second data structure is a directional graph comprising organized subject categories; and
the further relationship comprises the shortest distance between respective categories to which respective subjects belong in the second data structure, the hyperlink between said subjects being removed if the shortest distance exceeds a threshold value.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented method of extracting key phrases from a document is disclosed comprising the steps of accessing a repository comprising linked subjects, the repository comprising first and second data structures representing the relationship between said subjects using different representation criteria; pruning the first data structure by removing links between subjects based on a further relationship between said subjects in the second data structure; matching phrases in said document to subjects in the pruned first data structure; further pruning the pruned first data structure by removing unmatched subjects that are not linked to matched subjects; determining a ranking for each matched subject; and selecting key phrases using the determined subject rankings. A computer program for implementing the steps of this method when executed on a computer is also disclosed.
11 Citations
13 Claims
-
1. A computer-implemented method of extracting key phrases from a document comprising:
-
accessing a repository comprising hyperlinked subjects, the repository comprising first and second data structures representing the relationship between said hyperlinked subjects using different representation criteria; pruning the first data structure by removing hyperlinks between subjects based on a further relationship between said subjects in the second data structure; matching phrases in said document to said subjects in the pruned first data structure; further pruning the pruned first data structure by removing unmatched subjects that are not hyperlinked to matched subjects; determining a ranking for each matched subject; and selecting key phrases using the determined subject rankings, wherein the first data structure is a directional graph comprising the subjects as nodes and the hyperlinks between subjects as edges between nodes;
the second data structure is a directional graph comprising organized subject categories; and
the further relationship comprises the shortest distance between respective categories to which respective subjects belong in the second data structure, the hyperlink between said subjects being removed if the shortest distance exceeds a threshold value. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A non-transitory computer-readable data storage device comprising instructions which cause the computer program to:
-
access a repository corn rising hyperlinked subjects, the repository comprising first and second data structures representing the relationship between said hyperlinked subjects using different representation criteria; prune the first data structure by removing hyperlinks between subjects based on a further relationship between said subjects in the second data structure; match phrases in said document to said subjects in the pruned first data structure; further prune the pruned first data structure by removing unmatched subjects that are not determine a ranking for each matched subject; and select key phrases using the determined subject rankings, wherein the first data structure is a directional graph comprising the subjects as nodes and the hyperlinks between subjects as edges between nodes;
the second data structure is a directional graph comprising organized subject categories; and
the further relationship comprises the shortest distance between respective categories to which respective subjects belong in the second data structure, the hyperlink between said subjects being removed if the shortest distance exceeds a threshold value.
-
Specification