System and method for extracting information from unstructured text

US 10,002,129 B1
Filed: 03/30/2017
Issued: 06/19/2018
Est. Priority Date: 02/15/2017
Status: Active Grant

First Claim

Patent Images

1. A method for extracting subject-verb-object (SVO) chunked text from unstructured text, the method comprising:

identifying, by a SVO chunked text computing device, a plurality of part of speech (PoS) tokens in an unstructured text; and

determining, by the SVO chunked text computing device, a SVO chunked text directly from the plurality of PoS tokens using a machine learning chunker model, wherein the machine learning chunker model is trained on an SVO annotated training data, wherein the SVO annotated training data comprises a plurality of tokens, a plurality of corresponding PoS tags, and a plurality of corresponding SVO tags, the plurality of corresponding SVO tags comprises one or more of a subject tag, a verb tag, an object tag, or an object-subject tag, and the plurality of corresponding SVO tags is in beginninginside-other (BIO) format, and wherein the SVO annotated training data is generated based on a plurality of corresponding span information for the plurality of tokens by for each of a plurality of PoS tokens in each of a plurality of sets of syntactically related PoS tokens in a sentence, detecting a span information for a PoS token and tagging the PoS token as a subject, a verb, an object, or an object-subject based on the span information and a pervious tagging of the PoS token.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

This disclosure relates generally to natural language processing, and more particularly to a system and method for extracting subject-verb-object (SVO) chunked text from an unstructured text. In one embodiment, a method is provided for extracting SVO chunked text from an unstructured text. The method comprises identifying a plurality of part of speech (PoS) tokens in the unstructured text, and determining a plurality of SVO chunked text directly from the plurality of PoS tokens using a machine learning chunker model. The machine learning chunker model is trained on a subject-verb-object (SVO) annotated training data.

Citations

15 Claims

1. A method for extracting subject-verb-object (SVO) chunked text from unstructured text, the method comprising:
- identifying, by a SVO chunked text computing device, a plurality of part of speech (PoS) tokens in an unstructured text; and
  
  determining, by the SVO chunked text computing device, a SVO chunked text directly from the plurality of PoS tokens using a machine learning chunker model, wherein the machine learning chunker model is trained on an SVO annotated training data, wherein the SVO annotated training data comprises a plurality of tokens, a plurality of corresponding PoS tags, and a plurality of corresponding SVO tags, the plurality of corresponding SVO tags comprises one or more of a subject tag, a verb tag, an object tag, or an object-subject tag, and the plurality of corresponding SVO tags is in beginninginside-other (BIO) format, and wherein the SVO annotated training data is generated based on a plurality of corresponding span information for the plurality of tokens by for each of a plurality of PoS tokens in each of a plurality of sets of syntactically related PoS tokens in a sentence, detecting a span information for a PoS token and tagging the PoS token as a subject, a verb, an object, or an object-subject based on the span information and a pervious tagging of the PoS token.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein identifying the plurality of PoS tokens comprises:
    - extracting a plurality of tokens from the input text, wherein each of the plurality of tokens comprises a word or a phrase; and
      
      determining a PoS tag for each of the plurality of tokens.
  - 3. The method of claim 1, wherein each of the plurality of SVO chunked text is a set of semantically related PoS tokens and comprises a verb phrase and at least two of a subject phrase, an object phrase, or an object-subject phrase and the object-subject phrase corresponds to an overlapping contiguous chunks that is an object phrase in an initial part of a sentence and a subject phrase in the subsequent part of the sentence.
  - 4. The method of claim 1, wherein the machine learning chunker model is trained on one or more of:
    - a non-overlapping SVO annotated training data comprising one set of subject, verb, and object in each of the sentences;
      
      oran overlapping SVO annotated training data comprising one or more sets of subject, verb, object, and object-subject in each of the sentences.
  - 5. The method of claim 1, wherein the machine learning chunker model determines the plurality of SVO chunked text directly from the plurality of PoS tokens without a set of heuristics or a set of rules.

6. A subject-verb-object (SVO) chunked computing device, comprising;
- at least one processor; and
  
  memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising;
  
  identify a plurality of part of speech (PoS) tokens in an unstructured text; and
  
  determine a SVO chunked text directly from the plurality of PoS tokens using a machine learning chunker model, wherein the machine learning chunker model is trained on an SVO annotated training data, wherein the SVO annotated training data comprises a plurality of tokens, a plurality of corresponding PoS tags, and a plurality of corresponding SVO tags, the plurality of corresponding SVO tags comprises one or more of a subject tag, a verb tag, an object tag, or an object-subject tag, and the plurality of corresponding SVO tags is in beginninginside-other (BIO) format, and wherein the SVO annotated training data is generated based on a plurality of corresponding span information for the plurality of tokens by for each of a plurality of PoS tokens in each of a plurality of sets of syntactically related PoS tokens in a sentence, detecting a span information for a PoS token and tagging the PoS token as a subject, a verb, an object, or an object-subject based on the span information and a pervious tagging of the PoS token.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The SVO chunked computing device of claim 6, wherein the instructions, when executed by the at least one processor, further cause the at least one processor to:
    - extract a plurality of tokens from the input text, wherein each of the plurality of tokens comprises a word or a phrase; and
      
      determine a PoS tag for each of the plurality of tokens.
  - 8. The SVO chunked computing device of claim 6, wherein each of the plurality of SVO chunked text is a set of semantically related PoS tokens and comprises a verb phrase and at least two of a subject phrase, an object phrase, or an object-subject phrase and the objectsubject phrase corresponds to an overlapping contiguous chunks that is an object phrase in an initial part of a sentence and a subject phrase in the subsequent part of the sentence.
  - 9. The SVO chunked computing device of claim 6, wherein the machine learning chunker model is trained on one or more of:
    - a non-overlapping SVO annotated training data comprising one set of subject, verb, and object in each of the sentences;
      
      oran overlapping SVO annotated training data comprising one or more sets of subject, verb, object, and object-subject in each of the sentences.
  - 10. The SVO chunked computing device of claim 6, wherein the machine learning chunker model determines the plurality of SVO chunked text directly from the plurality of PoS tokens without a set of heuristics or a set of rules.

11. A non-transitory computer-readable medium having stored thereon instructions for extracting subject-verb-object (SVO) chunked text from unstructured text comprising executable code which, when executed by one or more processors, causes the one or more processors to:
- identify a plurality of part of speech (PoS) tokens in the unstructured text; and
  
  determine a plurality of SVO chunked text directly from the plurality of PoS tokens using a machine learning chunker model, wherein the machine learning chunker model is trained on a subject-verb-object (SVO) annotated training data, wherein the SVO annotated training data comprises a plurality of tokens, a plurality of corresponding PoS tags, and a plurality of corresponding SVO tags, the plurality of corresponding SVO tags comprises one or more of a subject tag, a verb tag, an object tag, or an object-subject tag, and the plurality of corresponding SVO tags is in beginninginside-other (BIO) format, and wherein the SVO annotated training data is generated based on a plurality of corresponding span information for the plurality of tokens by for each of a plurality of PoS tokens in each of a plurality of sets of syntactically related PoS tokens in a sentence, detecting a span information for a PoS token and tagging the PoS token as a subject, a verb, an object, or an object-subject based on the span information and a pervious tagging of the PoS token.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The non-transitory computer-readable medium of claim 11, wherein the executable code, when executed by the one or more processor, further causes the one or more processor to:
    - extract a plurality of tokens from the input text, wherein each of the plurality of tokens comprises a word or a phrase; and
      
      determine a PoS tag for each of the plurality of tokens.
  - 13. The non-transitory computer-readable medium of claim 11, wherein each of the plurality of SVO chunked text is a set of semantically related PoS tokens and comprises a verb phrase and at least two of a subject phrase, an object phrase, or an object-subject phrase and the object-subject phrase corresponds to an overlapping contiguous chunks that is an object phrase in an initial part of a sentence and a subject phrase in the subsequent part of the sentence.
  - 14. The non-transitory computer-readable medium of claim 11, wherein the machine learning chunker model is trained on one or more of:
    - a non-overlapping SVO annotated training data comprising one set of subject, verb, and object in each of the sentences;
      
      oran overlapping SVO annotated training data comprising one or more sets of subject, verb, object, and object-subject in each of the sentences.
  - 15. The non-transitory computer-readable medium of claim 11, wherein the machine learning chunker model determines the plurality of SVO chunked text directly from the plurality of PoS tokens without a set of heuristics or a set of rules.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Wipro Limited
Original Assignee
Wipro Limited
Inventors
D'Souza, Shaun Cyprian
Primary Examiner(s)
Abebe, Daniel

Application Number

US15/474,194
Time in Patent Office

446 Days
Field of Search

704 9
US Class Current
CPC Class Codes

G06F 40/211 Syntactic parsing, e.g. bas...

G06F 40/289 Phrasal analysis, e.g. fini...

System and method for extracting information from unstructured text

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for extracting information from unstructured text

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links