System and method for extraction of factoids from textual repositories

US 8,706,730 B2
Filed: 12/29/2005
Issued: 04/22/2014
Est. Priority Date: 12/29/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method of extracting factoids associated with a given factoid category of a plurality of categories from text repositories, said method comprising the steps of:

training a classifier to recognise factoids relevant to said given factoid category;

collecting, by a processor within a computer, documents or document summaries relevant to said given factoid category from the text repositories and storing the documents or document summaries in an entity store;

extracting sentences having a predetermined association to said given factoid category from said documents or said document summaries; and

classifying, in a noisy environment, said sentences using said classifier to extract snippets containing phrases relevant to said given factoid category, said extracted snippets being said factoid associated with said given factoid category, and storing the snippets in a snippet store.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method (400) is disclosed of extracting factoids from text repositories, with the factoids being associated with a given factoid category. The method (400) starts by training a classifier (230) to recognize factoids relevant to that given factoid category. Documents or document summaries relevant to the given factoid category is next collected (410) from the text repositories. Sentences having a predetermined association to the given factoid category is extracted (420) from the documents or said document summaries. Those sentences are classified (440), in a noisy environment, using the classifier (230) to extract snippets containing phrases relevant to the given factoid category. It is the extracted snippets that are the factoid associated with the given factoid category.

15 Citations

View as Search Results

20 Claims

1. A method of extracting factoids associated with a given factoid category of a plurality of categories from text repositories, said method comprising the steps of:
- training a classifier to recognise factoids relevant to said given factoid category;
  
  collecting, by a processor within a computer, documents or document summaries relevant to said given factoid category from the text repositories and storing the documents or document summaries in an entity store;
  
  extracting sentences having a predetermined association to said given factoid category from said documents or said document summaries; and
  
  classifying, in a noisy environment, said sentences using said classifier to extract snippets containing phrases relevant to said given factoid category, said extracted snippets being said factoid associated with said given factoid category, and storing the snippets in a snippet store.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. A method according to claim 1 wherein said collecting step comprises performing a search of the text repositories, wherein said documents are referenced by the results of said search.
  - 3. A method according to claim 2 wherein said search is performed on text repositories using a search engine.
  - 4. A method according to claim 1 comprising the further step of annotating entities in said sentences according to said given factoid category before said classifying step is performed.
  - 5. A method according to claim 1 comprising the further step of ordering said factoids associated with said given factoid category.
  - 6. A method according to claim 1 wherein said training step includes generating a collection of documents related to said given factoid category by querying the text repositories.
  - 7. A method according to claim 2 wherein example instances of the given factoid category are treated as input queries for said search on the text repositories.
  - 8. A method according to claim 7 wherein strongly correlated example instances are treated as input queries for said search on the text repositories.
  - 9. A method according to claim 6, wherein said training step further includes appending to said collection of documents related to said given factoid category a collection of manually generated documents that are strongly related to said given factoid category.
  - 10. A method according to claim 4 wherein said entities are replaced in said sentences by associated annotation types.
  - 11. A method according to claim 4 comprising the further step of filtering said sentences after said annotation step and before said classifying step by selecting the sentences containing a pre-determined combination of entities having a pre-determined ordering of said entities.
  - 12. A method according to claim 4 wherein said collecting step comprises performing a search of the text repositories using a search phrase, wherein said documents are referenced by the results of said search, and said method comprises the further step of filtering said sentences after said annotation step and before said classifying step by selecting only said sentences containing said search phrase.
  - 13. A method according to claim 5 comprising the further step of ordering said factoids based upon a scoring function applied to the snippets associated with each factoid.
  - 14. A method according to claim 13 wherein said ordering is done based on a score that is assigned to each factoid, with said score being a function of a confidence score applied when classifying said sentences.
  - 15. A method according to claim 13 comprising the further step of annotating entities in said sentences according to said given factoid category before said classifying step is performed, and during said ordering step, all factoids related to respective entities are grouped in order to assign an overall score to respective entities, where said overall score is the basis for said scoring function.
  - 16. A method as claimed in claim 15 wherein said overall score assigned to respective entities is a function of the number of factoids related to respective entities.
  - 17. A method as claimed in claim 15 wherein said overall score assigned to respective entities is a function of the number of instances of each related factoid weighted by a confidence score associated with that factoid.
  - 18. A method according to claim 15 wherein the overall score assigned to respective entities is a function of the language being used in each related factoid.

19. An apparatus for extracting factoids associated with a given factoid category of a plurality of factoid categories from text repositories, said apparatus comprising:
- means for training a classifier to recognise factoids relevant to said given factoid category;
  
  means for collecting documents or document summaries relevant to said given factoid category from the text repositories;
  
  means for extracting sentences having a predetermined association to said given factoid category from said documents or said document summaries;
  
  means for classifying, in a noisy environment, said sentences to extract snippets containing phrases relevant to said given factoid category, said extracted snippets being said factoid associated with said given factoid category, anda memory to store the snippets in a snippet store within the memory.

20. A computer program product comprising a machine-readable storage medium having machine-readable program code recorded thereon for controlling the operation of a data processing apparatus on which the program code executes to perform a method of extracting factoids associated with a given factoid category of a plurality of factoid categories from text repositories, said method comprising the steps of:
- training a classifier to recognise factoids relevant to said given factoid category;
  
  collecting documents or document summaries relevant to said given factoid category from the text repositories;
  
  extracting sentences having a predetermined association to said given factoid category from said documents or said document summaries;
  
  classifying, in a noisy environment, said sentences using said classifier to extract snippets containing phrases relevant to said given factoid category, said extracted snippets being said factoid associated with said given factoid category; and
  
  storing the snippets on a snippet store.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Joshi, Sachindra, Krishnapuram, Raghuram, Kumar, Nimit, Mehta, Kiran, Negi, Sumit, Ramakrishnan, Ganesh, Holmes, Scott R
Primary Examiner(s)
LE, DEBBIE M

Application Number

US11/321,177
Publication Number

US 20070162447A1
Time in Patent Office

3,036 Days
Field of Search

None
US Class Current

707/737
CPC Class Codes

G06F 16/35 Clustering; Classification

G06F 16/951 Indexing; Web crawling tech...

System and method for extraction of factoids from textual repositories

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

15 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

System and method for extraction of factoids from textual repositories

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others