Efficient method for information extraction
First Claim
1. A system for extracting information from text documents, comprising:
- an input module for receiving a plurality of text documents for information extraction, wherein said plurality of documents may be formatted in accordance with any one of a plurality of formats;
an input conversion module for converting said plurality of text documents into a single format for processing;
a tokenizer module for generating and assigning tokens to symbols contained in said plurality of text documents;
an extraction module for receiving said tokens from said tokenizer module and extracting desired information from each of said plurality of text documents;
an output conversion module for converting said extracted information into a single output format; and
an output module for outputting said converted extracted information, wherein each of the above modules operate simultaneous and independently of one another so as to process said plurality of text documents in a pipeline fashion.
2 Assignments
0 Petitions
Accused Products
Abstract
The invention provides a method and system for extracting information from text documents. A document intake module receives and stores a plurality of text documents for processing, an input format conversion module converts each document into a standard format for processing, an extraction module identifies and extracts desired information from each text document, and an output format conversion module converts the information extracted from each document into a standard output format. These modules operate simultaneously on multiple documents in a pipeline fashion so as to maximize the speed and efficiency of extracting information from the plurality of documents.
135 Citations
60 Claims
-
1. A system for extracting information from text documents, comprising:
-
an input module for receiving a plurality of text documents for information extraction, wherein said plurality of documents may be formatted in accordance with any one of a plurality of formats;
an input conversion module for converting said plurality of text documents into a single format for processing;
a tokenizer module for generating and assigning tokens to symbols contained in said plurality of text documents;
an extraction module for receiving said tokens from said tokenizer module and extracting desired information from each of said plurality of text documents;
an output conversion module for converting said extracted information into a single output format; and
an output module for outputting said converted extracted information, wherein each of the above modules operate simultaneous and independently of one another so as to process said plurality of text documents in a pipeline fashion. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 46, 47, 48)
-
-
12. A method of extracting information from a plurality of text documents, comprising the acts of:
-
receiving a plurality of text documents for information extraction, wherein said plurality of documents may be formatted in accordance with any one of a plurality of formats;
converting said plurality of text documents into a single format for processing;
generating and assigning tokens to symbols contained in said plurality of text documents;
extracting desired information from each of said plurality of text documents based in part on said token assignments;
converting said extracted information into a single output format; and
outputting the converted information, wherein each of the above acts are performed simultaneous and independently of one another so as to process said plurality of text documents in a pipeline fashion.
-
-
23. A system for extracting information from a plurality of text documents, comprising:
-
means for receiving a plurality of text documents for information extraction, wherein said plurality of documents may be formatted in accordance with any one of a plurality of formats;
means for converting said plurality of text documents into a single format for processing;
means for generating and assigning tokens to symbols contained in said plurality of text documents;
means for extracting desired information from each of said plurality of text documents based in part on said token assignments;
means for converting said extracted information into a single output format; and
means for outputting the converted information, wherein each of the above means operate simultaneous and independently of one another so as to process said plurality of text documents in a pipeline fashion.
-
-
34. A computer-readable medium having computer executable instructions for performing a method of extracting information from a plurality of text documents, the method comprising:
-
receiving a plurality of text documents for information extraction, wherein said plurality of documents may be formatted in accordance with any one of a plurality of formats;
converting said plurality of text documents into a single format for processing;
generating and assigning tokens to symbols contained in said plurality of text documents;
extracting desired information from each of said plurality of text documents based in part on said token assignments;
converting said extracted information into a single output format; and
outputting the converted information, wherein each of the above acts are performed simultaneous and independently of one another so as to process said plurality of text documents in a pipeline fashion.
-
-
45. A method of extracting information from a text document, comprising:
-
finding a best path sequence of states in a HMM, wherein said HMM is trained using a plurality of training documents each having a sequence of tagged states;
extracting information from said text document based on said best path sequence of states; and
calculating a confidence score for said extracted information, wherein said confidence score is based on a measure of similarity between said best path sequence of states and at least one of said sequence of tagged states from at least one of said plurality of training documents.
-
-
49. A method of extracting information from a text document, comprising:
-
finding a best path sequence of states in a HMM, wherein said IBMM is trained using a plurality of training documents each having a sequence of tagged states and said HMM states are modeled with non-exponential length distributions so as to allow their probability length distributions to be changed dynamically during information extraction; and
extracting information from said text document based on said best path sequence of states, wherein if a first state'"'"'s best transition was from itself, its self-transition probability is adjusted to (1−
cdf(t+1))/(1−
cdf(t)) and all other outgoing transitions from said first state are scaled by (cdf(t+1)−
cdf(t))/(1−
cdf(t)), and if said first state is transitioned to by another state, its self-transition probability is reset to its original value of (1−
cdf(1))/(1−
cdf(0)), where cdf is the cumulative probability distribution function for said first state'"'"'s length distribution, and t is the number of symbols emitted by said first state in said best path.
-
-
50. A computer-readable medium having computer executable instructions for performing a method of extracting information from a text document, said method comprising:
-
finding a best path sequence of states in a HMM, wherein said HMM is trained using a plurality of training documents each having a sequence of tagged states;
extracting information from said text document based on said best path sequence of states; and
calculating a confidence score for said extracted information, wherein said confidence score is based on a measure of similarity between said best path sequence of states and at least one of said sequence of tagged states from at least one of said plurality of training documents. - View Dependent Claims (51, 52, 53)
-
-
54. A computer-readable medium having computer executable instructions for performing a method of extracting information from a text document, said method comprising:
-
finding a best path sequence of states in a HMM, wherein said HMM is trained using a plurality of training documents each having a sequence of tagged states and said HMM states are modeled with non-exponential length distributions so as to allow their probability length distributions to be changed dynamically during information extraction; and
extracting information from said text document based on said best path sequence of states, wherein if a first HMM state'"'"'s best transition was from itself, its self-transition probability is adjusted to (1−
cdf(t+1))/(1−
cdf(t)) and all other outgoing transitions from said first HMM state are scaled by (cdf(t+1)−
cdf(t))/(1−
cdf(t)), and if said first HMM state is transitioned to by another state, its self-transition probability is reset to its original value of (1−
cdf(1))/(1−
cdf(0)), where cdf is the cumulative probability distribution function for said first state'"'"'s length distribution, and t is the number of symbols emitted by said first state in said best path.
-
-
55. A method of extracting information from a text document, comprising:
-
creating a HMM using a plurality of training documents of a known type, wherein said training documents comprise tagged sequences of states;
generalizing said HMM by merging repeating sequences of states;
finding a best path through said HMM representative of said text document, wherein information is extracted from said text document based on said best path.
-
-
56. A method of extracting information from a text document, comprising:
-
creating a HMM using a plurality of training documents of a known type, wherein said training documents comprise tagged sequences of states and said HMM comprises HMM states that are modeled with non-exponential length distributions so as to allow their probability length distributions to be changed dynamically during information extraction;
finding a best path through said HMM representative of said text document, wherein information is extracted from said text document based on said best path, and wherein if a first HMM state'"'"'s best transition was from itself, its self-transition probability is adjusted to (1−
cdf(t+1))/(1−
cdf(t)) and all other outgoing transitions from said first HMM state are scaled by (cdf(t+1)−
cdf(t))/(1−
cdf(t)), and if said first HMM state is transitioned to by another state, its self-transition probability is reset to its original value of (1−
cdf(1))/(1−
cdf(0)), where cdf is the cumulative probability distribution function for said first state'"'"'s length distribution, and t is the number of symbols emitted by said first state in said best path.
-
-
57. A computer-readable medium having computer executable instructions for performing a method of extracting information from a text document, said method comprising:
-
creating a HMM using a plurality of training documents of a known type, wherein said training documents comprise tagged sequences of states;
generalizing said HMM by merging repeating sequences of states;
finding a best path through said HMM representative of said text document, wherein information is extracted from said text document based on said best path.
-
-
58. A computer-readable medium having computer executable instructions for performing a method of extracting information from a text document, said method comprising:
-
creating a HMM using a plurality of training documents of a known type, wherein said training documents comprise tagged sequences of states and said HMM comprises HMM states that are modeled with non-exponential length distributions so as to allow their probability length distributions to be changed dynamically during information extraction;
finding a best path through said HMM representative of said text document, wherein information is extracted from said text document based on said best path, and wherein if a first HMM state'"'"'s best transition was from itself, its self-transition probability is adjusted to (1−
cdf(t+1))/(1−
cdf(t)) and all other outgoing transitions from said first HMM state are scaled by (cdf(t+1)−
cdf(t))/(1−
cdf(t)), and if said first HMM state is transitioned to by another state, its self-transition probability is reset to its original value of (1−
cdf(1))/(1−
cdf(0)), where cdf is the cumulative probability distribution function for said first state'"'"'s length distribution, and t is the number of symbols emitted by said first state in said best path.
-
-
59. A computer readable storage medium encoded with information comprising a HMM data structure including a plurality of states in which at least one sequence of states in said HMM data structure is created by merging a repeated sequence of states.
-
60. A computer readable storage medium encoded with information comprising a HMM data structure including a plurality of states in which at least one sequence of more than two states in said HMM data structure includes a transition from a last state in the at least one sequence to the first state in the sequence.
Specification