Preprocessing of text

US 8,620,836 B2
Filed: 01/10/2011
Issued: 12/31/2013
Est. Priority Date: 01/10/2011
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, by a device, a document;

determining, by the device, a plurality of topics associated with the document;

each of the plurality of topics being associated with text,determining, by the device, one or more desired topics of the plurality of topics;

filtering, by the device, a first portion of text from the document without filtering a second portion of text from the document,the second portion of text being associated with the one or more desired topics,the first portion of text not being associated with the one or more desired topics,the first portion of text being removed from the document, andthe second portion of text being different than the first portion of text;

splitting, by the device, the second portion of text into a plurality of segments;

clustering, by the device, each of the plurality of segments into one or more clusters of a plurality of clusters,each cluster, of the plurality of clusters, including at least one of the plurality of segments, andeach cluster, of the plurality of clusters, being associated with the one or more desired topics;

identifying, by the device, at least one segment, of the plurality of segments, having low relevance to a cluster, of the plurality of clusters, that includes the at least one segment; and

removing, by the device, the at least one segment from the cluster.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Performance of statistical machine learning techniques, particularly classification techniques applied to the extraction of attributes and values concerning products, is improved by preprocessing a body of text to be analyzed to remove extraneous information. The body of text is split into a plurality of segments. In an embodiment, sentence identification criteria are applied to identify sentences as the plurality of segments. Thereafter, the plurality of segments are clustered to provide a plurality of clusters. One or more of the resulting clusters are then analyzed to identify segments having low relevance to their respective clusters. Such low relevance segments are then removed from their respective clusters and, consequently, from the body of text. As the resulting relevance-filtered body of text no longer includes portions of the body of text containing mostly extraneous information, the reliability of any subsequent statistical machine learning techniques may be improved.

Citations

21 Claims

1. A method comprising:
- receiving, by a device, a document;
  
  determining, by the device, a plurality of topics associated with the document;
  
  each of the plurality of topics being associated with text,determining, by the device, one or more desired topics of the plurality of topics;
  
  filtering, by the device, a first portion of text from the document without filtering a second portion of text from the document,the second portion of text being associated with the one or more desired topics,the first portion of text not being associated with the one or more desired topics,the first portion of text being removed from the document, andthe second portion of text being different than the first portion of text;
  
  splitting, by the device, the second portion of text into a plurality of segments;
  
  clustering, by the device, each of the plurality of segments into one or more clusters of a plurality of clusters,each cluster, of the plurality of clusters, including at least one of the plurality of segments, andeach cluster, of the plurality of clusters, being associated with the one or more desired topics;
  
  identifying, by the device, at least one segment, of the plurality of segments, having low relevance to a cluster, of the plurality of clusters, that includes the at least one segment; and
  
  removing, by the device, the at least one segment from the cluster.
- View Dependent Claims (2, 3, 4, 5, 6, 19)
- - 2. The method of claim 1, further comprising:
    - extracting the first portion of text and the second portion of text from the document.
  - 3. The method of claim 1, further comprising:
    - identifying one or more headings in the document;
      
      associating the first portion of text with a first heading of the one or more headings; and
      
      filtering the first portion of text based on the association.
  - 4. The method of claim 1, where, when splitting the second portion of text into the plurality of segments, the method includes:
    - applying sentence identification criteria to the text to identify sentences in the text; and
      
      associating each identified sentence with a segment of the plurality of segments.
  - 5. The method of claim 1, further comprising:
    - tokenizing one or more infrequent words in one or more of the plurality of segments to replace the one or more infrequent words with a token,where at least one of the plurality of clusters includes a segment with at least one token.
  - 6. The method of claim 1, further comprising:
    - identifying one or more titles in the text;
      
      identifying at least one segment of the plurality of segments that does not include at least one word from the identified one or more titles; and
      
      removing the identified at least one segment from the text.
  - 19. The method of claim 6, where, when identifying the one or more titles in the text, the method includes:
    - identifying at least one of;
      
      a paragraph break, ortext including a bold formatting; and
      
      identifying the one or more titles in the text based on the identified at least one of the paragraph break or text including the bold formatting.

7. An apparatus comprising:
- a memory including instructions; and
  
  a processor to execute the instructions to;
  
  receive a document;
  
  determine a plurality of topics associated with the document;
  
  each of the plurality of topics being associated with text,determine one or more desired topics of the plurality of topics;
  
  filter a first portion of text from the document without filtering a second portion of text from the document,the second portion of text being associated with the one or more desired topics,the first portion of text not being associated with the one or more desired topics,the first portion of text being removed from the document, andthe second portion of text being different than the first portion of text;
  
  split the second portion of text into a plurality of segments;
  
  cluster each of the plurality of segments into one or more clusters of a plurality of clusters,each cluster, of the plurality of clusters, including at least one of the plurality of segments, andeach cluster, of the plurality of clusters, being associated with the one or more desired topics;
  
  identify at least one segment, of the plurality of segments, having low relevance to a cluster, of the plurality of clusters, that includes the at least one segment; and
  
  remove the at least one segment from the cluster.
- View Dependent Claims (8, 9, 10, 11, 12, 20)
- - 8. The apparatus of claim 7, where the processor is further to:
    - extract the first portion of text and the second portion of text from the document.
  - 9. The apparatus of claim 7, where the processor is further to:
    - identify one or more headings in the document;
      
      associate the first portion of text with a first heading of the one or more headings; and
      
      filter the first portion of text based on the association.
  - 10. The apparatus of claim 7, where, when splitting the second portion of text into the plurality of segments, the processor is further to:
    - apply sentence identification criteria to the text to identify sentences in the text; and
      
      associate each identified sentence with a segment of the plurality of segments.
  - 11. The apparatus of claim 7, where the processor is further to:
    - tokenize one or more infrequent words in one or more of the plurality of segments to replace the one or more infrequent words with a token,where at least one of the plurality of clusters includes a segment with at least one token.
  - 12. The apparatus of claim 7, where the processor is further to:
    - identify one or more titles in the text;
      
      identify at least one segment of the plurality of segments that does not include at least one word from the identified one or more titles; and
      
      remove the identified at least one segment from the text.
  - 20. The apparatus of claim 12, where the processor, when identifying the one or more titles in the text, is further to:
    - identify at least one of;
      
      a paragraph break, ortext including a bold formatting; and
      
      identify the one or more titles in the text based on the identified at least one of the paragraph break or text including the bold formatting.

13. A non-transitory computer-readable medium storing instructions, the instructions comprising:
- one or more instructions which, when executed by at least one processor, cause the at least one processor to;
  
  receive a document;
  
  determine a plurality of topics associated with the document;
  
  each of the plurality of topics being associated with text,determine one or more desired topics of the plurality of topics;
  
  filter a first portion of text from the document without filtering a second portion of text from the document,the second portion of text being associated with the one or more desired topics,the first portion of text not being associated with the one or more desired topics,the first portion of text being removed from the document, andthe second portion of text being different than the first portion of text;
  
  split the second portion of text into a plurality of segments;
  
  cluster each of the plurality of segments into one or more clusters of a plurality of clusters,each cluster, of the plurality of clusters, including at least one of the plurality of segments, andeach cluster, of the plurality of clusters, being associated with the one or more desired topics;
  
  identify at least one segment, of the plurality of segments, having low relevance to a cluster, of the plurality of clusters, that includes the at least one segment; and
  
  remove the at least one segment from the cluster.
- View Dependent Claims (14, 15, 16, 17, 18, 21)
- - 14. The computer-readable medium of claim 13, where the instructions further comprise:
    - one or more instructions to extract the first portion of text and the second portion of text from the document.
  - 15. The computer-readable medium of claim 13, where the instructions further comprise:
    - one or more instructions to identify one or more headings in the document;
      
      one or more instructions to associate the first portion of text with a first heading of the one or more headings; and
      
      one or more instructions to filter the first portion of text based on the association.
  - 16. The computer-readable medium of claim 13, where the one or more instructions to split the second portion of text into the plurality of segments include:
    - one or more instructions to apply sentence identification criteria to the text to identify sentences in the text; and
      
      one or more instructions to associate each identified sentence with a segment of the plurality of segments.
  - 17. The computer-readable medium of claim 13, where the instructions further comprise:
    - one or more instructions to tokenize infrequent words in one or more of the plurality of segments to replace the one or more infrequent words with a token,where at least one of the plurality of clusters includes a segment with at least one token.
  - 18. The computer-readable medium of claim 13, where the instructions further comprise:
    - one or more instructions to identify one or more titles in the text;
      
      one or more instructions to identify at least one segment of the plurality of segments that does not include at least one word from the identified one or more titles; and
      
      one or more instructions to remove the identified at least one segment from the text.
  - 21. The computer-readable medium of claim 18, where the one or more instructions to identify one or more titles in the text include:
    - one or more instructions to identify at least one of;
      
      a paragraph break, ortext including a bold formatting; and
      
      one or more instructions to identify the one or more titles in the text based on the identified at least one of the paragraph break or text including the bold formatting.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Accenture Global Services Limited (Accenture PLC)
Original Assignee
Accenture Global Services Limited (Accenture PLC)
Inventors
Ghani, Rayid, Cumby, Chad, Krema, Marko
Primary Examiner(s)
Chaki, Kakali
Assistant Examiner(s)
Seck, Ababacar

Application Number

US12/987,469
Publication Number

US 20120179453A1
Time in Patent Office

1,086 Days
Field of Search

703/12
US Class Current

706/12
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 40/20   Natural language analysis s...

G06F 40/205   Parsing

Preprocessing of text

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Preprocessing of text

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links