Document processing method and system

US 8,359,327 B2
Filed: 05/25/2010
Issued: 01/22/2013
Est. Priority Date: 05/27/2009
Status: Expired due to Fees

First Claim

Patent Images

1. A method for expanding a seed document in a seed document set, wherein the seed document set comprises at least one seed document, the method comprising:

identifying one or more entity words of the seed document in memory by a processor, wherein the one or more identified entity words are words indicating focused entities of the seed document and the one or more identified entity words of the seed document are identified with focused named entity recognition (FNER) technology, the FNER technology comprising;

segmenting the seed document;

applying part of speech tagging;

identifying candidate entity words;

extracting feature values for each candidate entity word to form a feature vector;

setting a threshold and setting a weight for each feature value in the feature vector;

calculating a score of each candidate entity word with the feature vector and the weight; and

comparing the score with the set threshold and determining entity words from the candidate entity words as the one or more identified entity words;

identifying by the processor, based on each of the one or more identified entity words of the seed document, one or more topic words related to each of the one or more identified entity words, the one or more identified topic words located in the seed document, wherein the one or more identified topic words of the seed document are identified with focused topic detection (FTD) technology using the segmenting of the seed document and the part of speech tagging of the FNER technology as a basis for identifying the one or more topic words;

forming, by the processor, an entity word-topic word pair from each of the one or more identified topic words and each of the one or more identified entity words upon which each of the one or more identified topic words is identified; and

obtaining one or more expanded documents by the processor by taking the entity word and topic word in each entity word-topic word pair as key words for web searching at the same time, wherein the expanded documents comprise not only the entity word in the each entity word-topic word pair but also the topic word in the each entity word-topic word pair.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for expanding a document set as a search data source in the field of business related search. The present invention provides a method of expanding a seed document in a seed document set. The method includes identifying one or more entity words of the seed document; identifying one or more topic words identifying one or more topic words related to the based entity word in the seed document where the entity word is located; forming an entity word-topic word pair from each identified topic word and the entity word on the basis of which each topic word is identified; and obtaining one or more expanded documents through web by taking the entity word and topic word in the each entity word-topic word pair as key words at the same time. A system for executing the above method is also provided.

Citations

5 Claims

1. A method for expanding a seed document in a seed document set, wherein the seed document set comprises at least one seed document, the method comprising:
- identifying one or more entity words of the seed document in memory by a processor, wherein the one or more identified entity words are words indicating focused entities of the seed document and the one or more identified entity words of the seed document are identified with focused named entity recognition (FNER) technology, the FNER technology comprising;
  
  segmenting the seed document;
  
  applying part of speech tagging;
  
  identifying candidate entity words;
  
  extracting feature values for each candidate entity word to form a feature vector;
  
  setting a threshold and setting a weight for each feature value in the feature vector;
  
  calculating a score of each candidate entity word with the feature vector and the weight; and
  
  comparing the score with the set threshold and determining entity words from the candidate entity words as the one or more identified entity words;
  
  identifying by the processor, based on each of the one or more identified entity words of the seed document, one or more topic words related to each of the one or more identified entity words, the one or more identified topic words located in the seed document, wherein the one or more identified topic words of the seed document are identified with focused topic detection (FTD) technology using the segmenting of the seed document and the part of speech tagging of the FNER technology as a basis for identifying the one or more topic words;
  
  forming, by the processor, an entity word-topic word pair from each of the one or more identified topic words and each of the one or more identified entity words upon which each of the one or more identified topic words is identified; and
  
  obtaining one or more expanded documents by the processor by taking the entity word and topic word in each entity word-topic word pair as key words for web searching at the same time, wherein the expanded documents comprise not only the entity word in the each entity word-topic word pair but also the topic word in the each entity word-topic word pair.
- View Dependent Claims (2, 3, 4)
- - 2. The method according to claim 1, wherein the identifying by the processor, based on the each of the one or more identified entity words of the seed document, one or more topic words related to each of the one or more identified entity words further comprises:
    - identifying the one or more topic words based on a distance between words other than the one or more identified entity words in the seed document.
  - 3. The method according to claim 1, wherein the identifying by the processor, based on the each of the one or more identified entity words of the seed document, one or more topic words related to each of the one or more identified entity words further comprises:
    - identifying the one or more topic words based on a frequency of other words than the one or more identified entity words in the seed document and the one or more identified entity words upon which each of the one or more identified topic words is identified appearing in a same sentence in the seed document.
  - 4. The method according to claim 1, further comprising receiving a recommended seed document to form the seed document set.

5. A system for expanding a seed document in a seed document set, wherein the seed document set comprises at least one seed document, the system comprising:
- entity word identifying means for identifying one or more entity words of the seed document in memory by a processor, the one or more identified entity words being words indicating focused entities of the seed document, wherein the entity word identifying means includes focused named entity recognition (FNER) technology, the FNER technology comprising;
  
  segmenting the seed document;
  
  applying part of speech tagging;
  
  identifying candidate entity words;
  
  extracting feature values for each candidate entity word to form a feature vector;
  
  setting a threshold and setting a weight for each feature value in the feature vector;
  
  calculating a score of each candidate entity word with the feature vector and the weight; and
  
  comparing the score with the set threshold and determining entity words from the candidate entity words as the one or more identified entity words;
  
  topic word identifying means for identifying by the processor, based on each of the one or more identified entity words of the seed document, one or more topic words related to each of the one or more identified entity words, the one or more identified topic words located in the seed document, wherein the topic word identifying means is configured to identify the one or more topic words of the seed document with focused topic detection (FTD) technology using the segmenting of the seed document and the part of speech tagging of the FNER technology as a basis for identifying the one or more topic words;
  
  pairing means for forming, by the processor, an entity word-topic word pair from each of the one or more identified topic words and each of the one or more identified entity words upon which each of the one or more identified topic words is identified; and
  
  document expanding means for obtaining one or more expanded documents by the processor by taking the entity word and topic word in the each entity word-topic word pair as key words for web searching at the same time, the expanded documents comprising not only the entity word in each entity word-topic word pair but also the topic word in the each entity word-topic word pair.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Bao, Sheng Hua, Cui, Jie, Su, Hui, Su, Zhong, Zhang, Li
Primary Examiner(s)
Lee, Wilson
Assistant Examiner(s)
Le, Jessica N

Application Number

US12/786,557
Publication Number

US 20100306248A1
Time in Patent Office

973 Days
Field of Search

None
US Class Current

707/769
CPC Class Codes

G06F 16/34 Browsing; Visualisation the...

G06F 16/93 Document management systems

Document processing method and system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

5 Claims

Specification

Solutions

Use Cases

Quick Links

Document processing method and system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

5 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links