Text analysis technique

US 7,158,983 B2
Filed: 09/23/2002
Issued: 01/02/2007
Est. Priority Date: 09/23/2002
Status: Active Grant

First Claim

Patent Images

1. A method for text analysis, comprising:

selecting a set of text documents;

selecting a number of terms included in the set;

establishing a multidimensional document space with a computer system as a function of the terms;

performing a bump hunting procedure with the computer system to identify a number of document space features, the features each corresponding to a composition of two or more concepts of the documents; and

deconvolving the features with the computer system to separately identify the concepts, wherein the concepts are stored in memory of the computer system.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

One embodiment of the present invention includes means determining a concept representation for a set of text documents based on partial order analysis and modifying this representation if it is determined to be unidentifiable. Furthermore, the embodiment includes means for labeling the representation, mapping documents to it to provide a corresponding document representation, generating a number of document signatures each of a different type, and performing several data processing applications each with a different one of the document signatures of differing types.

Citations

22 Claims

1. A method for text analysis, comprising:
- selecting a set of text documents;
  
  selecting a number of terms included in the set;
  
  establishing a multidimensional document space with a computer system as a function of the terms;
  
  performing a bump hunting procedure with the computer system to identify a number of document space features, the features each corresponding to a composition of two or more concepts of the documents; and
  
  deconvolving the features with the computer system to separately identify the concepts, wherein the concepts are stored in memory of the computer system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, which includes providing a concept representation corresponding to an acyclic graph with a number of nodes each corresponding to one of the concepts and different levels to represent related concepts of differing degrees of specificity.
  - 3. The method of claim 2, which includes identifying a number of different multilevel groups in accordance with a mathematically determined degree of desired fit of the different multilevel groups.
  - 4. The method of claim 1, which includes determining the multidimensional document space in accordance with frequency of each of the terms in each of the text documents.
  - 5. The method of claim 1, which includes determining a plurality of different signature vectors from the concepts for different text processing applications.
  - 6. The method of claim 1, wherein said deconvolving includes performing a latent variable analysis as a function of the features and the terms to identify the concepts.
  - 7. The method of claim 6, wherein, said deconvolving includes:
    - identifying one of a number of first level concepts of the text documents by determining each of the terms associated with one of the features; and
      
      establishing one of several second level concepts of the text documents by identifying at least one of the terms found in each member of a subset of the first level concepts.
  - 8. The method of claim 7, which includes:
    - providing a concept representation of the text documents, the representation including the first level concepts and the second level concepts with the subset of the first level concepts being subordinate to the one of the second level concepts;
      
      testing identifiability of the concept representation; and
      
      providing a modified concept representation in response to said testing if the concept representation is nonidentifiable.

9. A method for text analysis, comprising:
- performing a routine with a computer system, including;
  
  extracting terminological features from a set of text documents by executing a bump hunting procedure;
  
  establishing a representation of a number of concepts of the text documents as a function of the terminological features, the representation hierarchically indicating different degrees of specificity among related members of the concepts and corresponding to an acyclic graph organization;
  
  determining the representation is nonidentifiable;
  
  in response to said determining, constraining one or more processing parameters of the routine; and
  
  providing a modified concept representation after said constraining, the modified concept representation being identifiable and stored in memory of the computer system,wherein the concepts are determined by executing a deconvolution procedure with respect to the features.
- View Dependent Claims (10, 11, 12)
- - 10. The method of claim 9, wherein said constraining one or more processing parameters of the routine includes limiting the modified concept representation to a quantity of levels.
  - 11. The method of claim 9, wherein said constraining one or more processing parameters of the routine includes limiting the modified concept representation to a strict hierarchy form in which each one of the concepts is subordinate to at most one other of the concepts.
  - 12. The method of claim 9, wherein said constraining one or more processing parameters of the routine includes mapping the representation into a number of multilevel subgroupings each corresponding to an acyclic graph arrangement.

13. A method for text analysis, comprising:
- performing a routine with a computer system, including;
  
  extracting terminological features from a set of text documents by executing a bump hunting procedure;
  
  establishing a representation of a number of concepts of the text documents as a function of the terminological features, the representation hierarchically indicating different degrees of specificity among related ones of the concepts in correspondence to different levels of an acyclic graph organization;
  
  evaluating a selected document relative to the representation; and
  
  generating and storing in memory of the computer system a number of different document signatures for the selected document with the representation,wherein the concepts are determined by executing a deconvolution procedure with respect to the features.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The method of claim 13, which includes identifying several different group of related concepts, the groups each corresponding to several of the different levels of the representation.
  - 15. The method of claim 14, wherein said generating includes preparing each of the different document signatures in accordance with a different one of the groups.
  - 16. The method of claim 13, wherein said generating includes preparing each of the different documents signatures for a different text data processing application.
  - 17. The method of claim 16, wherein the different text data application is one or more of the group consisting of event detection, document summarization, document clustering, document filtering, querying, and synonym analysis.
  - 18. The method of claim 13, wherein:
    - said extracting includes determining the terminological features as a function of a set of terms contained in the set of text documents; and
      
      said evaluating includes mapping the selected document to the concept representation as a function of any terms of the selected document contained in the set of terms.

19. A method for text analysis, comprising:
- selecting a set of text documents;
  
  representing the documents with a number of terms;
  
  identifying a number of multiterm features of the text documents with a computer system as a function of frequency of each of the terms in each of the documents;
  
  relating the multiterm features and the terms with one or more data structures corresponding to a sparse matrix with the computer system;
  
  performing a latent variable analysis as a function of the terms to determine a number of concepts of the text documents from the one or more data structures with the computer system; and
  
  providing and storing in memory of the computer system a concept representation corresponding to a multilevel acyclic graph organization in which each node of the graph corresponds to one of the concepts,wherein the identifying is via a bump hunting procedure; and
  
  wherein the latent variable analysis includes deconvolving the features to determine the concepts.
- View Dependent Claims (20, 21, 22)
- - 20. The method of claim 19, wherein the latent variable analysis includes:
    - identifying one of the concepts in a first level of the concept representation by determining each of the terms associated with one of the features; and
      
      establishing one of the concepts in a second level of the concept representation by identifying at least one of the terms found in each member of a subset of the concepts in the first level.
  - 21. The method of claim 20, wherein the concept representation indicates the one of the concepts in the first level is related and subordinate to the one of the concepts in the second level.
  - 22. The method of claim 19, which includes:
    - determining a number of related subsets of the concepts, the subsets each spanning several levels of the concept representation and each corresponding to a different facet of the representation;
      
      testing identifiability of the concept representation; and
      
      providing several different document signatures from the concept representation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Battelle Memorial Institute
Original Assignee
Battelle Memorial Institute
Inventors
Nakamura, Grant C., Turner, Alan E., Hetzler, Elizabeth G., Tanasse, Theodore E., Havre, Susan L., Willse, Alan R., Hope, Lawrence L., MacGregor, deceased, Naucarrow, legal representative
Primary Examiner(s)
HWANG, JOON H

Application Number

US10/252,984
Publication Number

US 20040059736A1
Time in Patent Office

1,562 Days
Field of Search

707/1-10,100-104.1
US Class Current

1/1
CPC Class Codes

G06F 16/345   Summarisation for human users

G06F 40/20   Natural language analysis s...

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Text analysis technique

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Text analysis technique

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links