Data detection
First Claim
1. A machine-implemented method of detecting data of a plurality of types in a sequence of characters, the method comprising:
- combining the use of a pattern detection method and a statistical learning method to detect the data, the statistical learning method converting the sequence of characters into a first sequence of tokens, each token comprising a lexeme and a token type relating to the function of the lexeme within the sequence of characters and having at least a predetermined probability that the corresponding data is of at least one of said types, the pattern detection method converting the sequence of characters into a second sequence of tokens, each token corresponding to data that matches a predetermined pattern indicative of the at least one of said types, the pattern detection method further parsing a combination of the first and second sequence of tokens;
comparing the first and second sequence of tokens, wherein when corresponding tokens from the first and second sequence of tokens for a portion of the sequence of characters are the same, parsing only one of the corresponding tokens, when the tokens are not name tokens and the corresponding tokens are different, parsing both corresponding tokens, and when the tokens are name tokens and the corresponding tokens are different, parsing the corresponding token only from the statistical learning method; and
outputting the data corresponding to the combination of tokens as the data that matches the predetermined pattern.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for detecting data in a sequence of characters or text using both a statistical engine and a pattern engine. The statistical engine is trained to recognize certain types of data and the pattern engine is programmed to recognize the grammatical pattern of certain types of data. The statistical engine may scan the sequence of characters to output first data, and the pattern engine may break down the first data into subsets of data. Alternatively, the statistical engine may output items that have a predetermined probability or greater of being a certain type of data and the pattern engine may then detect the data from the output items and/or remove incorrect information from the output items.
158 Citations
29 Claims
-
1. A machine-implemented method of detecting data of a plurality of types in a sequence of characters, the method comprising:
-
combining the use of a pattern detection method and a statistical learning method to detect the data, the statistical learning method converting the sequence of characters into a first sequence of tokens, each token comprising a lexeme and a token type relating to the function of the lexeme within the sequence of characters and having at least a predetermined probability that the corresponding data is of at least one of said types, the pattern detection method converting the sequence of characters into a second sequence of tokens, each token corresponding to data that matches a predetermined pattern indicative of the at least one of said types, the pattern detection method further parsing a combination of the first and second sequence of tokens; comparing the first and second sequence of tokens, wherein when corresponding tokens from the first and second sequence of tokens for a portion of the sequence of characters are the same, parsing only one of the corresponding tokens, when the tokens are not name tokens and the corresponding tokens are different, parsing both corresponding tokens, and when the tokens are name tokens and the corresponding tokens are different, parsing the corresponding token only from the statistical learning method; and outputting the data corresponding to the combination of tokens as the data that matches the predetermined pattern. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A machine-implemented method of processing data comprising:
-
receiving text; processing the text using both a pattern engine and a statistical engine to detect data of a plurality of predetermined types, the statistical engine converting the text into a first sequence of tokens, each token comprising a lexeme and a token type relating to the function of the lexeme within the text and having at least a predetermined probability that the corresponding data is of at least one of the predetermined types, the pattern engine converting the text into a second sequence of tokens, each token corresponding to data that matches a predetermined pattern indicative of the at least one of the predetermined types, the pattern engine further parsing a combination of the first and second sequence of tokens; and comparing the first and second sequence of tokens, wherein when corresponding tokens from the first and second sequence of tokens for a portion of the text are the same, parsing only one of the corresponding tokens, when the tokens are not name tokens and the corresponding tokens are different, parsing both corresponding tokens, and when the tokens are name tokens and the corresponding tokens are different, parsing the corresponding token only from the statistical engine; and outputting the data corresponding to the combination of tokens as the data that matches the predetermined pattern. - View Dependent Claims (12, 13, 14)
-
-
15. An article of manufacture comprising:
-
a non-transitory computer readable storage medium including instructions that, when accessed by a machine, causes the machine to detect data of a plurality of types in an input sequence of characters by performing operations comprising; combining the use of a pattern detection method and a statistical learning method to detect the data, the statistical learning method converting the sequence of characters into a first sequence of tokens, each token comprising a lexeme and a token type relating to the function of the lexeme within the sequence of characters and having at least a predetermined probability that the corresponding data is of at least one of the types, the pattern detection method converting the sequence of characters into a second sequence of tokens, each token corresponding to data that matches a predetermined pattern indicative of the at least one of the types, the pattern detection method further parsing a combination of the first and second sequence of tokens; and comparing the first and second sequence of tokens, wherein when corresponding tokens from the first and second sequence of tokens for a portion of the sequence of characters are the same, parsing only one of the corresponding tokens, when the tokens are not name tokens and the corresponding tokens are different, parsing both corresponding tokens, and when the tokens are name tokens and the corresponding tokens are different, parsing the corresponding token only from the statistical learning method; and outputting the data corresponding to the combination of tokens as the data that matches the predetermined pattern. - View Dependent Claims (16)
-
-
17. An article of manufacture comprising:
-
a non-transitory computer readable storage including data that, when accessed by a machine, causes the machine to perform operations comprising; receiving text; processing the text using both a pattern engine and a statistical engine to detect data of a plurality of predetermined types, the statistical engine converting the text into a first sequence of tokens, each token comprising a lexeme and a token type relating to the function of the lexeme within the text and having at least a predetermined probability that the corresponding data is of at least one of the predetermined types, the pattern engine converting the text into a second sequence of tokens, each token corresponding to data that matches a predetermined pattern indicative of the at least one of the predetermined types, the pattern engine further parsing a combination of the first and second sequence of tokens; and comparing the first and second sequence of tokens, wherein when corresponding tokens from the first and second sequence of tokens for a portion of the text are the same, parsing only one of the corresponding tokens, when the tokens are not name tokens and the corresponding tokens are different, parsing both corresponding tokens, and when the tokens are name tokens and the corresponding tokens are different, parsing the corresponding token only from the statistical engine; and outputting the data corresponding to the combination of tokens as the data that matches the predetermined pattern. - View Dependent Claims (18, 19, 20)
-
-
21. A data processing system, the system comprising:
-
an input for receiving text, the input coupled to a processor through a bus; a pattern engine executing on the processor; a statistical engine executing on the processor, wherein the pattern engine and the statistical engine together detect data of a plurality of predetermined types in the text, the statistical engine converting the text into a first sequence of tokens, each token comprising a lexeme and a token type relating to the function of the lexeme within the text and having a predetermined probability that the corresponding data is of at least one of the predetermined types, the pattern engine converting the text into a second sequence of tokens, each token corresponding to data that matchers a predetermined pattern indicative of the at least one of the predetermined types, the pattern engine further parsing a combination of the first and second sequence of tokens; and a comparison engine executing on the processor to compare the first and second sequence of tokens, wherein when corresponding tokens from the first and second sequence of tokens for a portion of the text are the same, parsing only one of the corresponding tokens, when the token are not name tokens and the corresponding tokens are different, parsing both corresponding tokens, and when the tokens are name tokens and the corresponding tokens are different, parsing the corresponding token only from the statistical engine; and an output for outputting the data corresponding to the combination of tokens as the data that matches the predetermined pattern, the output coupled to a processor through the bus. - View Dependent Claims (22, 23, 24)
-
-
25. A data detecting system for detecting data of a plurality of types in text, the system comprising:
-
a pattern detection means; a statistical engine means, wherein the pattern detection means and the statistical learning means are operative together to detect the data, the statistical engine means converting the text into a first sequence of tokens, each token comprising a lexeme and a token type relating to the function of the lexeme within the text and having at least a predetermined probability that the corresponding data is of at least one of said types, the pattern detection means converting the text into a second sequence of tokens, each token corresponding to data that matches a predetermined pattern indicative of the at least one of said types, the pattern detection means further parsing a combination of the first and second sequence of tokens; and a comparison means to compare the first and second sequence of tokens, wherein when corresponding tokens from the first and second sequence of tokens for a portion of the text are the same, parsing only one of the corresponding tokens, when the tokens are not name tokens and the corresponding tokens are different, parsing both corresponding tokens, and when the tokens are name tokens and the corresponding tokens are different, parsing the corresponding token only from the statistical engine means; and means for outputting the data corresponding to the combination of tokens as the data that matches the predetermined pattern. - View Dependent Claims (26, 27, 28, 29)
-
Specification