Preprocessing of text
First Claim
1. A method comprising:
- receiving, by a device, a document;
determining, by the device, a plurality of topics associated with the document;
each of the plurality of topics being associated with text,determining, by the device, one or more desired topics of the plurality of topics;
filtering, by the device, a first portion of text from the document without filtering a second portion of text from the document,the second portion of text being associated with the one or more desired topics,the first portion of text not being associated with the one or more desired topics,the first portion of text being removed from the document, andthe second portion of text being different than the first portion of text;
splitting, by the device, the second portion of text into a plurality of segments;
clustering, by the device, each of the plurality of segments into one or more clusters of a plurality of clusters,each cluster, of the plurality of clusters, including at least one of the plurality of segments, andeach cluster, of the plurality of clusters, being associated with the one or more desired topics;
identifying, by the device, at least one segment, of the plurality of segments, having low relevance to a cluster, of the plurality of clusters, that includes the at least one segment; and
removing, by the device, the at least one segment from the cluster.
1 Assignment
0 Petitions
Accused Products
Abstract
Performance of statistical machine learning techniques, particularly classification techniques applied to the extraction of attributes and values concerning products, is improved by preprocessing a body of text to be analyzed to remove extraneous information. The body of text is split into a plurality of segments. In an embodiment, sentence identification criteria are applied to identify sentences as the plurality of segments. Thereafter, the plurality of segments are clustered to provide a plurality of clusters. One or more of the resulting clusters are then analyzed to identify segments having low relevance to their respective clusters. Such low relevance segments are then removed from their respective clusters and, consequently, from the body of text. As the resulting relevance-filtered body of text no longer includes portions of the body of text containing mostly extraneous information, the reliability of any subsequent statistical machine learning techniques may be improved.
-
Citations
21 Claims
-
1. A method comprising:
-
receiving, by a device, a document; determining, by the device, a plurality of topics associated with the document; each of the plurality of topics being associated with text, determining, by the device, one or more desired topics of the plurality of topics; filtering, by the device, a first portion of text from the document without filtering a second portion of text from the document, the second portion of text being associated with the one or more desired topics, the first portion of text not being associated with the one or more desired topics, the first portion of text being removed from the document, and the second portion of text being different than the first portion of text; splitting, by the device, the second portion of text into a plurality of segments; clustering, by the device, each of the plurality of segments into one or more clusters of a plurality of clusters, each cluster, of the plurality of clusters, including at least one of the plurality of segments, and each cluster, of the plurality of clusters, being associated with the one or more desired topics; identifying, by the device, at least one segment, of the plurality of segments, having low relevance to a cluster, of the plurality of clusters, that includes the at least one segment; and removing, by the device, the at least one segment from the cluster. - View Dependent Claims (2, 3, 4, 5, 6, 19)
-
-
7. An apparatus comprising:
-
a memory including instructions; and a processor to execute the instructions to; receive a document; determine a plurality of topics associated with the document; each of the plurality of topics being associated with text, determine one or more desired topics of the plurality of topics; filter a first portion of text from the document without filtering a second portion of text from the document, the second portion of text being associated with the one or more desired topics, the first portion of text not being associated with the one or more desired topics, the first portion of text being removed from the document, and the second portion of text being different than the first portion of text; split the second portion of text into a plurality of segments; cluster each of the plurality of segments into one or more clusters of a plurality of clusters, each cluster, of the plurality of clusters, including at least one of the plurality of segments, and each cluster, of the plurality of clusters, being associated with the one or more desired topics; identify at least one segment, of the plurality of segments, having low relevance to a cluster, of the plurality of clusters, that includes the at least one segment; and remove the at least one segment from the cluster. - View Dependent Claims (8, 9, 10, 11, 12, 20)
-
-
13. A non-transitory computer-readable medium storing instructions, the instructions comprising:
one or more instructions which, when executed by at least one processor, cause the at least one processor to; receive a document; determine a plurality of topics associated with the document; each of the plurality of topics being associated with text, determine one or more desired topics of the plurality of topics; filter a first portion of text from the document without filtering a second portion of text from the document, the second portion of text being associated with the one or more desired topics, the first portion of text not being associated with the one or more desired topics, the first portion of text being removed from the document, and the second portion of text being different than the first portion of text; split the second portion of text into a plurality of segments; cluster each of the plurality of segments into one or more clusters of a plurality of clusters, each cluster, of the plurality of clusters, including at least one of the plurality of segments, and each cluster, of the plurality of clusters, being associated with the one or more desired topics; identify at least one segment, of the plurality of segments, having low relevance to a cluster, of the plurality of clusters, that includes the at least one segment; and remove the at least one segment from the cluster. - View Dependent Claims (14, 15, 16, 17, 18, 21)
Specification