Active learning framework for automatic field extraction from network traffic

US 7,650,317 B2
Filed: 12/06/2006
Issued: 01/19/2010
Est. Priority Date: 12/06/2006
Status: Active Grant

First Claim

Patent Images

1. A method for extracting at least one data field from a stream of data received from a network, comprising:

inputting at least one positive example of an instance of the at least one data field; and

based on the at least one positive example, analyzing the stream of data received from the network to determine a first result set including at least one candidate for the at least one data field in the stream, wherein the analyzing is based on no knowledge or only partial knowledge of any protocol represented by the stream received from the network.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An active learning framework is provided to extract information from particular fields from a variety of protocols. Extraction is performed in an unknown protocol, in which the user presents the system with a small number of labeled instances. The system then automatically generates an abundance of features and negative examples. A boosting approach is then used for feature selection and classifier combination. The system then displays its results for the user to correct and/or add new examples. The process can be iterated until the user is satisfied with the performance of the extraction capabilities provided by the classifiers generated by the system.

Citations

20 Claims

1. A method for extracting at least one data field from a stream of data received from a network, comprising:
- inputting at least one positive example of an instance of the at least one data field; and
  
  based on the at least one positive example, analyzing the stream of data received from the network to determine a first result set including at least one candidate for the at least one data field in the stream, wherein the analyzing is based on no knowledge or only partial knowledge of any protocol represented by the stream received from the network.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further including:
    - at least one of designating at least one candidate of the first result set as incorrect or providing at least one positive example of an instance of the at least one data field, or both; and
      
      re-analyzing the stream of data based on the additional knowledge that the designated at least one incorrect candidate is incorrect, or the additional knowledge of the at least one positive example, or both, and generating a second result set that improves the performance over said first result set.
  - 3. The method of claim 2, further including:
    - utilizing each of said at least one designated incorrect candidates as a negative example of an instance of the at least one data field when performing said re-analyzing.
  - 4. The method of claim 2, further including:
    - iteratively performing said designating, providing, or both, and re-analyzing steps until a user is satisfied with a subsequent result set.
  - 5. The method of claim 1, further including:
    - automatically generating a plurality of negative examples of instances of the at least one data field for use by said analyzing step when determining the first result set.
  - 6. The method of claim 5, wherein said automatically generating includes generating a non-exhaustive plurality of negative examples according to a pre-determined algorithm to reduce the number of negative examples utilized during said analyzing step.
  - 7. The method of claim 1, further including:
    - automatically generating a plurality of features based on the at least one positive example from which a plurality of classifiers are formed.
  - 8. The method of claim 7, further including:
    - combining and weighting the plurality of classifiers to generate optimal results in said first result set.
  - 9. The method of claim 8, further including:
    - ascribing a weight associated with said at least one designated incorrect candidate higher than a weight associated with any randomly generated negative examples to emphasize the certainty of the at least one designated incorrect candidate as designated by the user.
  - 10. A computer readable medium comprising computer executable instructions for performing the method of claim 1.

11. A method for automatically generating features for classifying data in a data stream, comprising:
- specifying an item of interest in example data of the data stream; and
  
  automatically generating a plurality of features from the specified item, wherein the item and the plurality of features are used to form classifiers for classifying data in the data stream.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The method of claim 11, wherein said generating includes intelligently generating said plurality of features based on the item of interest by eliminating features that are less likely to form useful classifiers.
  - 13. The method of claim 11, wherein said specifying includes specifying an offset and a field length for the item in the example data.
  - 14. The method of claim 13, wherein said generating includes automatically generating negative examples from the positive example.
  - 15. A computing device comprising means for performing the method of claim 11.

16. A computing device for automatically extracting data of interest from a binary stream of data received or stored by the computing device without reference to full knowledge of the structure of any protocol of the binary stream of data, comprising:
- an analysis engine for analyzing the binary stream based on at least one classifier formed from at least one positive example of the data of interest provided to the analysis engine to determine a result set including at least one candidate from the binary stream of data as a potential match for the data of interest; and
  
  a user interface for outputting the result set and for receiving either at least one additional positive example or a designation of at least one candidate of the result set as an incorrect match for the data of interest, or both, wherein the analysis engine re-analyzes the binary stream based on the at least one additional positive example, or the at least one designated candidate, or both, to revise the at least one classifier and improve the accuracy of the result set.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computing device of claim 16, wherein the analysis engine iteratively re-determines the result set based on iterative receiving by the user interface of designations of at least one candidate of the result set as an incorrect match for the data of interest.
  - 18. The computing device of claim 16, wherein the analysis engine automatically analyzes the binary stream based on at least one classifier formed from a plurality of negative examples automatically generated from the at least one positive example.
  - 19. The computing device of claim 18, wherein the plurality of negative examples are generated according to a pre-determined algorithm that generates a non-exhaustive number of negative examples.
  - 20. The computing device of claim 16, wherein the positive example is specified by an offset and a field length.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Gopalratnam, Karthik, Dunagan, John David, Wang, Jiahe Helen, Basu, Sumit
Primary Examiner(s)
Holmes; Michael B

Application Number

US11/567,328
Publication Number

US 20080140589A1
Time in Patent Office

1,140 Days
Field of Search

706/12
US Class Current

706/12
CPC Class Codes

G06F 18/214   Generating training pattern...

G06N 20/00   Machine learning

H04L 69/18   Multiprotocol handlers, e.g...

H04L 69/22   Parsing or analysis of headers

Active learning framework for automatic field extraction from network traffic

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Active learning framework for automatic field extraction from network traffic

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links