Identifying glossary terms from natural language text documents
First Claim
1. A device, comprising:
- one or more processors to;
receive, using an input component, a request to process text of a document to identify glossary terms included in the text;
determine, using the one or more processors and based on the request, a plurality of sections of the text to process; and
process a first section, of the plurality of sections, in parallel with a second section, of the plurality of sections, to identify the glossary terms included in the text,when processing the first section in parallel with the second section, the one or more processors are, for each of the first section and the second section, to;
determine a linguistic unit analysis technique based on a file format of a file that includes the text;
perform, using the linguistic unit analysis technique, a linguistic unit analysis on a linguistic unit, included in the text, to generate a plurality of ambiguous linguistic units from the linguistic unit,
the one or more processors, when performing the linguistic unit analysis on the linguistic unit to generate the plurality of ambiguous linguistic units, being to;
perform at least one of;
a coordinating conjunction analysis that generates the plurality of ambiguous linguistic units from the linguistic unit when the linguistic unit includes a coordinating conjunction,
an adjectival modifier analysis that generates the plurality of ambiguous linguistic units from the linguistic unit when the linguistic unit includes an adjective, or
a headword analysis that generates the plurality of ambiguous linguistic units from the linguistic unit when the linguistic unit includes an abstract noun;
resolve the plurality of ambiguous linguistic units to generate a set of potential glossary terms that includes a subset of the plurality of ambiguous linguistic units;
perform a glossary term analysis on the set of potential glossary terms to generate a set of glossary terms that includes a subset of the set of potential glossary terms;
identify a set of included terms, of the set of potential glossary terms, that are included in the set of glossary terms;
identify a set of excluded terms, of the set of potential glossary terms, that are excluded from the set of glossary terms;
determine a semantic relatedness score between at least one excluded term, of the set of excluded terms, and at least one included term, of the set of included terms;
selectively add the at least one excluded term to the set of glossary terms to form a final set of glossary terms based on the semantic relatedness score; and
output, using an output component, the final set of glossary terms for the document for presentation via a user interface.
1 Assignment
0 Petitions
Accused Products
Abstract
A device may obtain text to be analyzed to identify glossary terms. The device may analyze a linguistic unit to generate multiple linguistic units related to the linguistic unit. The device may analyze the multiple linguistic units to generate potential glossary terms. The device may perform a glossary term analysis on the potential glossary terms to generate glossary terms that include a subset of the potential glossary terms. The device may identify included terms that are included in the glossary terms. The device may identify excluded terms that are excluded from the glossary terms. The device may determine a semantic relatedness score between at least one excluded term and at least one included term. The device may selectively add the excluded linguistic term to the glossary terms to form a final set of glossary terms based on the semantic relatedness score, and may output the final set of glossary terms.
22 Citations
20 Claims
-
1. A device, comprising:
one or more processors to; receive, using an input component, a request to process text of a document to identify glossary terms included in the text; determine, using the one or more processors and based on the request, a plurality of sections of the text to process; and process a first section, of the plurality of sections, in parallel with a second section, of the plurality of sections, to identify the glossary terms included in the text, when processing the first section in parallel with the second section, the one or more processors are, for each of the first section and the second section, to; determine a linguistic unit analysis technique based on a file format of a file that includes the text; perform, using the linguistic unit analysis technique, a linguistic unit analysis on a linguistic unit, included in the text, to generate a plurality of ambiguous linguistic units from the linguistic unit,
the one or more processors, when performing the linguistic unit analysis on the linguistic unit to generate the plurality of ambiguous linguistic units, being to;
perform at least one of;
a coordinating conjunction analysis that generates the plurality of ambiguous linguistic units from the linguistic unit when the linguistic unit includes a coordinating conjunction,
an adjectival modifier analysis that generates the plurality of ambiguous linguistic units from the linguistic unit when the linguistic unit includes an adjective, or
a headword analysis that generates the plurality of ambiguous linguistic units from the linguistic unit when the linguistic unit includes an abstract noun;resolve the plurality of ambiguous linguistic units to generate a set of potential glossary terms that includes a subset of the plurality of ambiguous linguistic units; perform a glossary term analysis on the set of potential glossary terms to generate a set of glossary terms that includes a subset of the set of potential glossary terms; identify a set of included terms, of the set of potential glossary terms, that are included in the set of glossary terms; identify a set of excluded terms, of the set of potential glossary terms, that are excluded from the set of glossary terms; determine a semantic relatedness score between at least one excluded term, of the set of excluded terms, and at least one included term, of the set of included terms; selectively add the at least one excluded term to the set of glossary terms to form a final set of glossary terms based on the semantic relatedness score; and output, using an output component, the final set of glossary terms for the document for presentation via a user interface. - View Dependent Claims (2, 3, 4, 5, 6)
-
7. A non-transitory computer-readable medium storing instructions, the instructions comprising:
one or more instructions that, when executed by one or more processors, cause the one or more processors to; receive, using an input component, a request to process text to identify glossary terms included in the text; determine, using the one or more processors and based on the request, a plurality of sections of the text to process; and process a first section, of the plurality of sections, in parallel with a second section, of the plurality of sections, to identify the glossary terms included in the text, the one or more instructions to process the first section in parallel with the section include; one or more instructions that, when executed by the one or more processors, cause the one or more processors, for each of the first section and the second section, to;
determine a linguistic unit analysis technique based on a file format of a file associated with the text;
perform, using the linguistic unit analysis technique, a linguistic unit analysis on a linguistic unit, included in the text, to generate a plurality of linguistic units related to the linguistic unit,
the one or more instructions, that cause the one or more processors to perform the linguistic unit analysis on the linguistic unit to generate the plurality of linguistic units, causing the one or more processors to;
perform at least one of;
a coordinating conjunction analysis that generates the plurality of linguistic units from the linguistic unit when the linguistic unit includes a coordinating conjunction,
an adjectival modifier analysis that generates the plurality of linguistic units from the linguistic unit when the linguistic unit includes an adjective, or
a headword analysis that generates the plurality of linguistic units from the linguistic unit when the linguistic unit includes an abstract noun;
analyze the plurality of linguistic units to generate a set of potential glossary terms that includes a subset of the plurality of linguistic units;
perform a glossary term analysis on the set of potential glossary terms to generate a set of glossary terms that includes a subset of the set of potential glossary terms;
identify a set of included terms, of the set of potential glossary terms, that are included in the set of glossary terms;
identify a set of excluded terms, of the set of potential glossary terms, that are excluded from the set of glossary terms;
determine a semantic relatedness score between at least one excluded term, of the set of excluded terms, and at least one included term, of the set of included terms;
selectively add the at least one excluded term to the set of glossary terms to form a final set of glossary terms based on the semantic relatedness score; and
output, using an output component, the final set of glossary terms for presentation via a user interface.- View Dependent Claims (8, 9, 10, 11, 12, 13)
-
14. A method, comprising:
-
receiving, by an input component of a device, a request to process text to be analyzed to identify glossary terms included in the text; determining, by one or more processors of the device and based on the request, a plurality of sections of the text to process; and processing, by the device, a first section, of the plurality of sections, in parallel with a second section, of the plurality of sections, to identify the glossary terms included in the text, processing the first section in parallel with the second section, for each of the first section and the second section, including; determining a linguistic unit analysis technique based on a file format of a file that associated with the text; performing, by the device and using the linguistic unit analysis technique, a linguistic unit analysis on a linguistic unit, included in the text, to generate a plurality of ambiguous linguistic units from the linguistic unit, the performing the linguistic unit analysis on the linguistic unit to generate the plurality of ambiguous linguistic units comprising;
performing at least one of;
a coordinating conjunction analysis that generates the plurality of ambiguous linguistic units from the linguistic unit when the linguistic unit includes a coordinating conjunction,
an adjectival modifier analysis that generates the plurality of ambiguous linguistic units from the linguistic unit when the linguistic unit includes an adjective, or
a headword analysis that generates the plurality of ambiguous linguistic units from the linguistic unit when the linguistic unit includes an abstract noun;analyzing, by the device, the plurality of ambiguous linguistic units to generate a set of potential glossary terms that includes a subset of the plurality of ambiguous linguistic units; performing, by the device, a glossary term analysis on the set of potential glossary terms to generate a set of glossary terms that includes a subset of the set of potential glossary terms; identifying, by the device, a set of included terms, of the set of potential glossary terms, that are included in the set of glossary terms; identifying, by the device, a set of excluded terms, of the set of potential glossary terms, that are excluded from the set of glossary terms; determining, by the device, a semantic relatedness score between an excluded term, of the set of excluded terms, and an included term, of the set of included terms; selectively adding, by the device, the excluded term to the set of glossary terms to form a final set of glossary terms based on the semantic relatedness score; and outputting, by an output component of the device, the final set of glossary terms for presentation via a user interface. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification