System and method for detecting and decoding semantically encoded natural language messages

US 7,356,463 B1
Filed: 12/18/2003
Issued: 04/08/2008
Est. Priority Date: 12/18/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A method for detecting semantically encoded natural language in textual input data, comprising:

segmenting the textual input data into a plurality of token linguistic units to define a linguistic event;

assigning a context to one or more of the token linguistic units in the linguistic event;

computing a score for the one or more token linguistic units in the linguistic event that are assigned a context;

applying a predetermined threshold to the score of the one or more token linguistic units to determine whether their use in the linguistic event in their assigned contexts is implausible;

estimating covert meanings of token linguistic units identified as being below the predetermined threshold; and

detecting semantically encoded natural language based on replacing all occurrences, within the linguistic event, of the token linguistic units identified as being below the predetermined threshold with the estimated covert meaning of the token linguistic units identified as being below the predetermined threshold;

wherein the computed scores for the one or more token linguistic units indicate whether the token linguistic units are expected to appear with their assigned contexts in the linguistic event; and

wherein each context, which is assigned to a token linguistic unit in the linguistic event, has at least one linguistic relation that relates the token linguistic unit in the linguistic event and at least one other token linguistic unit in the linguistic event.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system detects and decodes semantic camouflage in natural language messages. The system is adapted to identify entities such as words or phrases in overt messages that are being used to disguise different and unrelated entities or concepts. The system automatically determines the semantic plausibility of the overt message and identifies entities that appear in implausible contexts. In addition, the system automatically estimates covert meanings for the entities identified in the overt message that appear in implausible contexts.

47 Citations

View as Search Results

16 Claims

1. A method for detecting semantically encoded natural language in textual input data, comprising:
- segmenting the textual input data into a plurality of token linguistic units to define a linguistic event;
  
  assigning a context to one or more of the token linguistic units in the linguistic event;
  
  computing a score for the one or more token linguistic units in the linguistic event that are assigned a context;
  
  applying a predetermined threshold to the score of the one or more token linguistic units to determine whether their use in the linguistic event in their assigned contexts is implausible;
  
  estimating covert meanings of token linguistic units identified as being below the predetermined threshold; and
  
  detecting semantically encoded natural language based on replacing all occurrences, within the linguistic event, of the token linguistic units identified as being below the predetermined threshold with the estimated covert meaning of the token linguistic units identified as being below the predetermined threshold;
  
  wherein the computed scores for the one or more token linguistic units indicate whether the token linguistic units are expected to appear with their assigned contexts in the linguistic event; and
  
  wherein each context, which is assigned to a token linguistic unit in the linguistic event, has at least one linguistic relation that relates the token linguistic unit in the linguistic event and at least one other token linguistic unit in the linguistic event.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method according to claim 1, further comprising using a priori knowledge concerning the textual input data to estimate the covert meanings of token linguistic units identified as being below the predetermined threshold.
  - 3. The method according to claim 1, further comprising estimating a covert meaning of a token linguistic unit when its computed score indicates that its use in the linguistic event in its assigned contexts is implausible.
  - 4. The method according to claim 3, wherein the token linguistic units are one of words, phrases, and n-grams.
  - 5. The method according to claim 1, further comprising estimating a probability model P(u,C) based on contexts C for a set of linguistic units U, where U is given by {u₁, u₂. . . u_m}, occurring in a corpus V_Lof natural language L;
    - wherein the scores for linguistic units in the linguistic event are assigned using the probability model P(u,C).
  - 6. The method according to claim 5, further comprising collecting the corpus V_L, of events E of the natural language L, wherein the corpus V_Lincludes a set of linguistic units U of the natural language L, and wherein tuples of linguistic units in the set of linguistic units U have one or more relations defined therebetween that are selected from a set of linguistic relations R, where R is given by {r₁, r₂. . . r_k}.
  - 7. The method according to claim 6, wherein relations in the set of linguistic relations are one or more of a combination of (a) co-occurrence relations, (b) morphological relations, (c) syntactic relations, (d) semantic relations, and (e) discourse relations.
  - 8. The method according to claim 7, further comprising reducing linguistic events in the natural language L to a sequence of unit-context pairs S, where S is given by [<
    - u₁, C₁>
      
      , <
      
      u₂, C₂>
      
      . . . <
      
      u_n, C_n>
      
      ], where C, is a context of linguistic unit u, that is made up of a set of contextual elements {c₁, c₂. . . c_w}, and where c=<
      
      r_f, t₁. . . t_h>
      
      provides that each contextual element c in the set of contextual elements C_iis made up of a relation r_ffrom the set of linguistic relations and h=Ar(r_f)−
      
      1 other token linguistic units from the same linguistic event E.
  - 9. The method according to claim 1, further comprising converting audio or image input data with textual content to the textual input data.

10. An apparatus for detecting semantically encoded natural language in textual input data, comprising:
- a segmentation module for segmenting the textual input data into a plurality of token linguistic units to define a linguistic event;
  
  a context computation module for assigning a context to one or more of the token linguistic units in the linguistic event;
  
  a score computation module for computing a score for the one or more token linguistic units in the linguistic event that are assigned a context;
  
  a context verification module for applying a predetermined threshold to the score of the one or more token linguistic units to determine whether their use in the linguistic event in their assigned contexts is implausible; and
  
  a covert meaning estimation module for estimating covert meanings of token linguistic units identified as being below the predetermined threshold and the covert meaning estimation module detecting semantically encoded natural language based on replacing all occurrences, within the linguistic event, of the token linguistic units identified as being below the predetermined threshold with the estimated covert meaning of the token linguistic units identified as being below the predetermined threshold;
  
  wherein the computed scores for the one or more token linguistic units indicate whether the token linguistic units are expected to appear with their assigned contexts in the linguistic event; and
  
  wherein each context, which is assigned to a token linguistic unit in the linguistic event by the context computation module, has at least one linguistic relation that relates the token linguistic unit in the linguistic event and at least one other token linguistic unit in the linguistic event.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The apparatus according to claim 10, further comprising memory for storing a priori knowledge concerning the textual input data for estimating the covert meanings of token linguistic units identified as being below the predetermined threshold.
  - 12. The apparatus according to claim 10, further comprising:
    - means for estimating a probability model P(u,C) based on contexts C for a set of linguistic units U, where U is given by {u₁, u₂. . . u_m}, occurring in a corpus V_Lof natural language L; and
      
      memory for storing an estimated probability model P(u,C);
      
      wherein the scores for linguistic units in the linguistic event are assigned using the probability model P(u,C).
  - 13. The apparatus according to claim 12, further comprising means for collecting the corpus V_Lof events E of the natural language L, wherein the corpus V_Lincludes a set of linguistic units U of the natural language L, and wherein tuples of linguistic units in the set of linguistic units U have one or more relations defined therebetween that are selected from a set of linguistic relations R, where R is given by {r₁,r₂. . . r_k}.
  - 14. The apparatus according to claim 13, wherein relations in the set of linguistic relations are one or more of a combination of (a) co-occurrence relations, (b) morphological relations, (c) syntactic relations, (d) semantic relations, and (e) discourse relations.
  - 15. The apparatus according to claim 14, further comprising reducing linguistic events in the natural language L to a sequence of unit-context pairs S, where S is given by [<
    - u_i, C₁>
      
      , <
      
      u₂, C₂>
      
      . . . <
      
      u_n, C_n>
      
      ], where C_iis a context of linguistic unit u, that is made up of a set of contextual elements {c₁, c₂. . . c_w}, and where c=<
      
      r_f, t₁. . . t_h>
      
      provides that each contextual element c in the set of contextual elements C_iis made up of a relation r_ffrom the set of linguistic relations and h=Ar(r_f)−
      
      1 other token linguistic units from the same linguistic event E.

16. A memory device for storing a set of program instructions executable on a data processing device and usable for detecting semantically encoded natural language in textual input data, the set of program instructions comprising instructions for:
- segmenting the textual input data into a plurality of token linguistic units to define a linguistic event;
  
  assigning a context to one or more of the token linguistic units in the linguistic event;
  
  computing a score for the one or more token linguistic units in the linguistic event that are assigned a context;
  
  applying a predetermined threshold to the score of the one or more token linguistic units to determine whether their use in the linguistic event in their assigned contexts is implausible;
  
  estimating covert meanings of token linguistic units identified as being below the predetermined threshold; and
  
  detecting semantically encoded natural language based on replacing all occurrences, within the linguistic event, of the token linguistic units identified as being below the predetermined threshold with the estimated covert meaning of the token linguistic units identified as being below the predetermined threshold;
  
  wherein the computed scores for the one or more token linguistic units indicate whether the token linguistic units are expected to appear with their assigned contexts in the linguistic event; and
  
  wherein each context, which is assigned to a token linguistic unit in the linguistic event, has at least one linguistic relation that relates the token linguistic unit in the linguistic event and at least one other token linguistic unit in the linguistic event.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Isabelle, Pierre
Primary Examiner(s)
Hudspeth; David
Assistant Examiner(s)
Neway; Samuel G

Application Number

US10/737,975
Time in Patent Office

1,573 Days
Field of Search

None
US Class Current

704/9
CPC Class Codes

G06F 40/30 Semantic analysis

System and method for detecting and decoding semantically encoded natural language messages

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

47 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for detecting and decoding semantically encoded natural language messages

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

47 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links