Systems and methods for extracting attributes from text content

US 9,934,218 B2
Filed: 04/18/2012
Issued: 04/03/2018
Est. Priority Date: 12/05/2011
Status: Active Grant

First Claim

Patent Images

1. A method implemented by one or more computers for extracting one or more descriptors from text data associated with a specified term in the text data, the method comprising:

receiving, by at least one of the one or more computers, the text data;

receiving, by at least one of the one or more computers, the specified term to be located in the text data, the specified term being at least one word;

creating, by at least one of the one or more computers, a tagged information file by associating part of speech tags to words in the text data, including any descriptors present in the text data, wherein a descriptor comprises one or more words of the text data that succeed or precede the specified term;

identifying, by at least one of the one or more computers, a location of the specified term in the tagged information file using an approximate text matching technique, wherein the approximate text matching technique;

detects the specified term grouped together with the descriptors of the specified term in the text data using the tagged information file, the specified term grouped together with the descriptors of the specified term forming a variable region or variable window that is context sensitive and not of a fixed size; and

identifies, through a finite state machine, a grammatical context shift in the context sensitive region pertaining to the specified term in the text data by analyzing the part of speech tags of the tagged information file,wherein the grammatical context shift is indicated by an autonomous transition of the finite state machine from a first state associated with a first part of speech tag of the tagged information file to a second state associated with a second part of speech tag of the tagged information file for parts of speech associated with words before and after the specified term;

determining based on the determined grammatical context shift, by at least one of the one or more computers, the one or more descriptors of the specified term;

extracting, by at least one of the one or more computers, the one or more descriptors of the specified term from the text data; and

providing, by at least one of the one or more computers, a report comprising the extracted one or more descriptors of the specified term.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and method for extracting attributes from text content are described. Example embodiments may include a computer implemented method for extracting attributes from text data, wherein the text data is obtained from at least one information source. As described, the implementation may include receiving, from a user, an address for the at least one information source and an attribute name, creating a tagged information file by associating a part of speech tag to text data obtained from the at least one information source, identifying a location of the attribute name in the tagged information file using an approximate text matching technique and determining at least one attribute descriptor from the tagged information file wherein the tagged information file is parsed based on a part of speech tag associated with the attribute name to determine a conclusion of the attribute descriptor.

Citations

19 Claims

1. A method implemented by one or more computers for extracting one or more descriptors from text data associated with a specified term in the text data, the method comprising:
- receiving, by at least one of the one or more computers, the text data;
  
  receiving, by at least one of the one or more computers, the specified term to be located in the text data, the specified term being at least one word;
  
  creating, by at least one of the one or more computers, a tagged information file by associating part of speech tags to words in the text data, including any descriptors present in the text data, wherein a descriptor comprises one or more words of the text data that succeed or precede the specified term;
  
  identifying, by at least one of the one or more computers, a location of the specified term in the tagged information file using an approximate text matching technique, wherein the approximate text matching technique;
  
  detects the specified term grouped together with the descriptors of the specified term in the text data using the tagged information file, the specified term grouped together with the descriptors of the specified term forming a variable region or variable window that is context sensitive and not of a fixed size; and
  
  identifies, through a finite state machine, a grammatical context shift in the context sensitive region pertaining to the specified term in the text data by analyzing the part of speech tags of the tagged information file,wherein the grammatical context shift is indicated by an autonomous transition of the finite state machine from a first state associated with a first part of speech tag of the tagged information file to a second state associated with a second part of speech tag of the tagged information file for parts of speech associated with words before and after the specified term;
  
  determining based on the determined grammatical context shift, by at least one of the one or more computers, the one or more descriptors of the specified term;
  
  extracting, by at least one of the one or more computers, the one or more descriptors of the specified term from the text data; and
  
  providing, by at least one of the one or more computers, a report comprising the extracted one or more descriptors of the specified term.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein the text data is structured.
  - 3. The method of claim 1, wherein the text data is unstructured.
  - 4. The method of claim 1, wherein the text data is received from a specified information source, the information source comprising a website on the Internet, a text data repository, or text on the Intranet of an organization.
  - 5. The method of claim 1, wherein determining the one or more descriptors comprises:
    - storing at least one first string before an attribute name in the tagged information file until a word having a part of speech tag belonging to a reference group is determined.
  - 6. The method of claim 1, wherein the first and second part of speech tags are associated with adjacent words.

7. A system for extracting one or more descriptors from text data for a specified term in the text data, wherein the text data is obtained from at least one information source, the system comprising:
- a user interface configured to receive, from a user;
  
  an address for the at least one information source, the address being a uniform resource locator (URL) address or a location of a text file within a storage device, the term being at least one word to be located in the text data; and
  
  the specified term; and
  
  at least one hardware processor operatively coupled to a memory and a non-transitory storage storing instructions which when executed by at least one of the processors cause the at least one hardware processor to;
  
  generate a tagged information file by associating part of speech tags to the text data obtained from the at least one information source, including any descriptors present in the text data, wherein a descriptor comprises one or more words of the text data that succeed or precede the specified term;
  
  identify a location of the specified term in the tagged information file using an approximate text matching technique, wherein the approximate text matching technique;
  
  detects the specified term grouped together with the descriptors of the specified term in the text data using the tagged information file, the specified term grouped together with the descriptors of the specified term forming a variable region or variable window that is context sensitive and not of a fixed size; and
  
  ;
  
  identifies, through a finite state machine, a grammatical context shift in the context sensitive region pertaining to the specified term in the text data by analyzing the part of speech tags of the tagged information file,wherein the grammatical context shift is indicated by an autonomous transition of the finite state machine from a first state associated with a first part of speech tag of the tagged information file to a second state associated with a second part of speech tag of the tagged information file for parts of speech associated with words before and after the specified term;
  
  determine based on the grammatical context shift the one or more descriptors of the specified term;
  
  extract the one or more descriptors of the specified term from the text data; and
  
  return the one or more extracted descriptors of the specified term.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. The system of claim 7, wherein the text data is structured.
  - 9. The system of claim 7, wherein the text data is unstructured.
  - 10. The system of claim 7, wherein the at least one information source is a URL address.
  - 11. The system of claim 7, wherein the processor module is further configured to:
    - store at least one first string before the term in the text data in the tagged information file when the associated part of speech tag associated with a word of the at least one first string belongs to a first group until a word having a part of speech tag belonging to a second group is determined.
  - 12. The system of claim 11, further comprising a report generation module configured to generate a report comprising the extracted at least one descriptor of the specified term.
  - 13. The system of claim 7, wherein the instructions further cause the at least one hardware processor to store at least one first string before the specified term in the tagged information file until a word having a part of speech tag belonging to a second set is determined.

14. A non-transitory computer readable medium comprising a plurality of computer-executable instructions stored thereon that, when executed, cause a computing system to perform processing for extracting one or more descriptors of a specified term in text data from the text data, the processing comprising:
- receiving, from a user;
  
  an address for at least one information source, the address being a uniform resource locator (URL) address or a location of a text file within a storage device, the term being at least one word or other text token; and
  
  the specified term;
  
  creating a tagged information file by associating part of speech tags to text data obtained from the at least one information source, including to any descriptors present in the text data, wherein a descriptor comprises one or more words of the text data that succeed or precede the specified term;
  
  identifying a location of the specified term in the tagged information file using an approximate text matching technique, wherein the approximate text matching technique;
  
  detects the specified term grouped together with the descriptors of the specified term in the text data using the tagged information file, the specified term grouped together with the descriptors of the specified term forming a variable region or variable window that is context sensitive and not of a fixed size; and
  
  identifies, through a finite state machine, a grammatical context shift in the context sensitive region pertaining to the specified term in the text data by analyzing the part of speech tags of the tagged information file,wherein the grammatical context shift is indicated by an autonomous transition of the finite state machine from a first state associated with a first part of speech tag of the tagged information file to a second state associated with a second part of speech tag of the tagged information file for parts of speech associated with words before and after the specified term;
  
  determining based on the determined grammatical context shift the one or more descriptors of the specified term;
  
  extracting the one or more descriptors of the specified term from the text data; and
  
  providing a report comprising the extracted one or more descriptors of the specified term.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The non-transitory computer readable medium of claim 14, wherein the input text data is structured.
  - 16. The non-transitory computer readable medium of claim 14, wherein the input text data is unstructured.
  - 17. The non-transitory computer readable medium of claim 14, wherein the at least one information source is a URL.
  - 18. The non-transitory computer readable medium of claim 14, wherein determining the one or more descriptors comprises storing at least one first string before an attribute name in the tagged information file until a word or other text token having a part of speech tag belonging to a second set is determined.
  - 19. The non-transitory computer readable medium of claim 14, wherein the first and second part of speech tags are associated with adjacent words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Infosys Limited
Original Assignee
Infosys Limited
Inventors
Gopinathan, Madhu, Guha, Sarbendu, Basu, Indranil, Menon, Nikhil, Sampige, Tejas Prabhakara
Primary Examiner(s)
Sirjani, Fariba

Application Number

US13/450,435
Publication Number

US 20130144604A1
Time in Patent Office

2,176 Days
Field of Search

704 1- 8, 704 9- 10, 704 1- 10, 704251-257, 704270-277, 704E13001-E13014
US Class Current
CPC Class Codes

G06F 40/289 Phrasal analysis, e.g. fini...

G06F 40/30 Semantic analysis

Systems and methods for extracting attributes from text content

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for extracting attributes from text content

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links