Method and system for mining a document containing dirty text
First Claim
1. A computer-implemented method for mining a document containing dirty text comprising:
- removing an instance of dirty text within said document to produce a cleaned document having a content; and
performing a data mining operation on said cleaned document thereby deriving relevant information from said cleaned document and providing a summary of the content of said document, and scoring and ranking each sentence of said document, wherein said removing further comprises removing an instance of computer code from said document, and removing a table from said document.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for mining a document containing dirty text. Dirty text is removed or replaced and the document is processed using a variety of text mining techniques. In one embodiment, dirty text removal and replacement occurs in two stages. In the first stage, a general cleaning occurs on all documents without regard to what domain they belong to or the mining task to be performed. In the second stage, document cleaning is more specific to the anomalies of the domain and the mining task to be performed. In the third stage, the document is processed using a variety of data mining techniques according to the mining task. In one embodiment, the present invention scores and ranks sentences in a document according to their relevance, extracts the highest ranked sentences, and presents a summary. The present invention allows users to leverage existing domain knowledge and can be customized according the domain and task requirements.
95 Citations
26 Claims
-
1. A computer-implemented method for mining a document containing dirty text comprising:
-
removing an instance of dirty text within said document to produce a cleaned document having a content; and performing a data mining operation on said cleaned document thereby deriving relevant information from said cleaned document and providing a summary of the content of said document, and scoring and ranking each sentence of said document, wherein said removing further comprises removing an instance of computer code from said document, and removing a table from said document. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer system comprising:
-
a bus; a memory unit coupled to said bus; and a processor coupled to said bus, said processor for executing a method for mining a document containing dirty text comprising; producing a cleaned document having a content comprising performing a general cleaning of said document by removing an instance of dirty text within said document including instances of misspelling and grammatical errors, and performing a domain and task specific cleaning of said document including removing instances of computer code and tables to produce a cleaned document; and performing a data mining operation on said cleaned document including providing a summary of the content of said document including scoring and ranking each sentence. - View Dependent Claims (10)
-
-
11. A computer system comprising:
-
a bus; a memory unit coupled to said bus; and a processor coupled to said bus, said processor for executing a method for mining a document containing dirty text comprising; producing a cleaned document having a content comprising performing a general cleaning of said document by removing an instance of dirty text within said document including instances of misspelling and grammatical errors, and performing a domain and task specific cleaning of said document including removing instances of computer code and tables to produce a cleaned document; and performing a data mining operation on said cleaned document including providing a summary of the content of said document, wherein said performing a data mining operation further comprises identifying a sentence within said cleaned document by identifying a beginning and an end of said sentence, wherein said performing a data mining operation further comprises scoring and ranking said sentence and wherein scoring said sentence further comprises; selecting scoring techniques operable for summarizing non-narrative, grammatically incorrect text; selecting scoring techniques operable for summarizing narrative, grammatically correct text; and using said scoring techniques to score said sentence. - View Dependent Claims (9, 12, 13, 14)
-
-
15. A computer-useable medium having computer-readable program code embodied therein for causing a computer system to perform the steps of:
-
removing an instance of dirty text within said document to produce a cleaned document having a content; and performing a data mining operation on said cleaned document to provide a summary of said content, removing an instance of computer code from said document and removing a table from said document, and scoring and ranking each sentence. - View Dependent Claims (16, 17)
-
-
18. A computer-useable medium having computer-readable program code embodied therein for causing a computer system to perform the steps of:
-
removing an instance of dirty text within said document to produce a cleaned document having a content; and performing a data mining operation on said cleaned document to provide a summary of said content, wherein said performing a data mining operation further comprises identifying a sentence within said cleaned document by identifying a beginning and an end of said sentence, wherein said performing a data mining operation further comprises, wherein scoring said sentence further comprises; selecting scoring techniques operable for summarizing non-narrative, grammatically incorrect text; selecting scoring techniques operable for summarizing narrative, grammatically correct text; and using said scoring techniques to score said sentence. - View Dependent Claims (19, 20, 21)
-
-
22. A computer-implemented method for mining a document containing dirty text comprising:
-
producing a cleaned document having a content comprising performing a general cleaning of said document by removing one or more instance of dirty text within said document including instances of misspelling and grammatical errors, and performing a domain and task specific cleaning of said document including removing instances of computer code and tables; and performing a data mining operation on said cleaned document, including determining a sentence score for each sentence of said cleaned document and ranking the sentences from highest to lowest based on the sentence score; generating a summary of the content of the document using the highest ranked sentences. - View Dependent Claims (23, 24, 25, 26)
-
Specification