Method and system for analyzing text

US 9,292,491 B2
Filed: 06/13/2014
Issued: 03/22/2016
Est. Priority Date: 11/04/2008
Status: Active Grant

First Claim

Patent Images

1. A method for predicting a value of a variable associated with a target word or set of words, performed by an apparatus comprising at least one computer and comprising the steps of:

the apparatus collecting a text corpus comprising a set of words that include the target word,the apparatus generating a representation of the text corpus,the at least one computer creating a semantic space for the set of words, based on the representation of the text corpus,the at least one computer defining, for a location in the semantic space, a value of the variable,the at least one computer estimating, for the target word, a value of the variable, based on the semantic space and the defined variable value of the location in the semantic space,calculating, by the at least one computer, a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word, andstatistically testing if two sets of words or two sets of documents of the text corpora differ in semantic representation,wherein the step of statistically testing comprises;

i) calculating a first vector to represent a mean location in the semantic space for a first of the two sets of words or documents;

ii) calculating a second vector to represent a mean location in the semantic space for a second of the two sets of words or documents;

iii) calculating a distance between the first and second vectors;

iv) repeating the steps i), ii), and iii) above while assigning the words randomly to the first of the two sets of words or documents and to the second of the two sets of words or documents;

v) counting a percentage of occasions when the distance for the randomly assigned words is larger than when the distance is based on the non-randomly assigned words; and

vi) providing the counted percentage as a probability for whether the two sets of words or documents differ in semantic representation.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An apparatus for providing a control input signal for an industrial process or technical system having one or more controllable elements includes elements for generating a semantic space for a text corpus, and elements for generating a norm from one or more reference words or texts, the or each reference word or text being associated with a defined respective value on a scale, and the norm being calculated as a reference point or set of reference points in the semantic space for the or each reference word or text with its associated respective scale value. Elements for reading at least one target word included in the text corpus, elements for predicting a value of a variable associated with the target word based on the semantic space and the norm, and elements for providing the predicted value in a control input signal to the industrial process or technical system. A method for predicting a value of a variable associated with a target word is also disclosed together with an associated system and computer readable medium.

9 Citations

View as Search Results

19 Claims

1. A method for predicting a value of a variable associated with a target word or set of words, performed by an apparatus comprising at least one computer and comprising the steps of:
- the apparatus collecting a text corpus comprising a set of words that include the target word,the apparatus generating a representation of the text corpus,the at least one computer creating a semantic space for the set of words, based on the representation of the text corpus,the at least one computer defining, for a location in the semantic space, a value of the variable,the at least one computer estimating, for the target word, a value of the variable, based on the semantic space and the defined variable value of the location in the semantic space,calculating, by the at least one computer, a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word, andstatistically testing if two sets of words or two sets of documents of the text corpora differ in semantic representation,wherein the step of statistically testing comprises;
  
  i) calculating a first vector to represent a mean location in the semantic space for a first of the two sets of words or documents;
  
  ii) calculating a second vector to represent a mean location in the semantic space for a second of the two sets of words or documents;
  
  iii) calculating a distance between the first and second vectors;
  
  iv) repeating the steps i), ii), and iii) above while assigning the words randomly to the first of the two sets of words or documents and to the second of the two sets of words or documents;
  
  v) counting a percentage of occasions when the distance for the randomly assigned words is larger than when the distance is based on the non-randomly assigned words; and
  
  vi) providing the counted percentage as a probability for whether the two sets of words or documents differ in semantic representation.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A method according to claim 1,in said collecting step, the apparatus collects the text corpus from a database, said apparatus further comprising a search robot connected to and searching a network containing a further apparatus containing the database,wherein the semantic space is created from the text corpus using Latent Semantic Analysis, andwherein the estimating of the target word variable value comprises performing regression analysis having the target word variable value as a dependant variable.
  - 3. A method according to claim 1, wherein the creating of the semantic space comprises performing a singular value decomposition on the representation of the text corpus.
  - 4. A method according to claim 1, further comprising the step of statistically testing the predicted value of the variable, by comparing the predicted value with known values.
  - 5. A method according to claim 1, wherein the collecting of the text corpus comprises collecting time information associated with text in the text corpus.
  - 6. A method according to claim 1,wherein the collecting of the text corpus comprises collecting time information associated with text in the text corpus from the apparatus accessing a plurality of distributed text sources via one or more networks, andwherein the predicting of the value of the variable comprises associating the predicted value with the time information of the text corpus.
  - 7. A method according to claim 1, wherein the collecting of the text corpus comprises collecting a relevance indicating measure associated with text in the text corpus.
  - 8. A method according to claim 7, wherein the predicting of the value of the variable comprises numerically weighting the value with the relevance indicating measure.

9. An apparatus (300) for providing a control input signal (412) for an industrial process or technical system (400) having one or more controllable elements (421-42n), the apparatus comprising:
- means (320;
  
  909) for generating a semantic space (892) for a text corpus (302;
  
  908);
  
  means (330;
  
  914) for generating a norm (896) from one or more reference words or texts (332;
  
  893), the reference word or text being associated with a defined respective value on a scale, and the norm being calculated as a reference point or set of reference points in the semantic space for the reference word or text with its associated respective scale value;
  
  means (340) for reading at least one target word;
  
  means (340) for predicting a value (350) of a variable associated with the target word based on the semantic space and the norm;
  
  means (340) for providing the predicted value in a control input signal (412) to said industrial process or technical system (400); and
  
  means for statistically testing if two sets of words or two sets of documents of the text corpora differ in semantic representation,wherein the means for statistically testing provides for;
  
  i) calculating a first vector to represent a mean location in the semantic space for a first of the two sets of words or documents;
  
  ii) calculating a second vector to represent a mean location in the semantic space for a second of the two sets of words or documents;
  
  iii) calculating a distance between the first and second vectors;
  
  iv) repeating the steps i), ii), and iii) above while assigning the words randomly to the first of the two sets of words or documents and to the second of the two sets of words or documents;
  
  v) counting a percentage of occasions when the distance for the randomly assigned words is larger than when the distance is based on the non-randomly assigned words; and
  
  vi) providing the counted percentage as a probability for whether the two sets of words or documents differ in semantic representation.
- View Dependent Claims (10, 11)
- - 10. An apparatus as defined in claim 9, further comprising means (310) for collecting said digital text corpus from a plurality of distributed text sources (314-316) accessible via one or more networks (312),wherein the semantic space is created from the text corpus using Latent Semantic Analysis, andwherein the estimating of the target word variable value comprises performing regression analysis having the target word variable value as a dependant variable.
  - 11. An apparatus as defined in claim 9, further comprising a data processing unit, wherein said data processing unit is configured to perform a method for predicting a value of a variable associated with a target word or set of words, performed on at least one computer and comprising the steps of:
    - collecting a text corpus comprising a set of words that include the target word,generating a representation of the text corpus,creating a semantic space for the set of words, based on the representation of the text corpus,defining, for a location in the semantic space, a value of the variable,estimating, for the target word, a value of the variable, based on the semantic space and the defined variable value of the location in the semantic space, andcalculating a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word.

12. A system for predicting a value of a variable associated with a target word or set of words, comprising at least one computer configured to:
- collect a text corpus comprising a set of words that include the target word,generate a representation of the text corpus,create a semantic space for the set of words, based on the representation of the text corpus,define, for a location in the semantic space, of a subset of the words, a value of the variable,estimate, for the target word, a value of the variable, based on the semantic space and the defined variable value of the location in the semantic space,calculate a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word, andstatistically test if two sets of words or two sets of documents of the text corpora differ in semantic representation,wherein the statistically test comprises;
  
  i) calculating a first vector to represent a mean location in the semantic space for a first of the two sets of words or documents;
  
  ii) calculating a second vector to represent a mean location in the semantic space for a second of the two sets of words or documents;
  
  iii) calculating a distance between the first and second vectors;
  
  iv) repeating the steps i), ii), and iii) above while assigning the words randomly to the first of the two sets of words or documents and to the second of the two sets of words or documents;
  
  v) counting a percentage of occasions when the distance for the randomly assigned words is larger than when the distance is based on the non-randomly assigned words; and
  
  vi) providing the counted percentage as a probability for whether the two sets of words or documents differ in semantic representation.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. A system according to claim 12, wherein the creating of the semantic space comprises performing a singular value decomposition on the representation of the text corpus.
  - 14. A system according to claim 12, further configured for statistically testing the predicted value of the variable, by comparing the predicted value with known values.
  - 15. A system according to claim 12, wherein,the semantic space is created from the text corpus using Latent Semantic Analysis, andwherein the estimating of the target word variable value comprises performing regression analysis having the target word variable value as a dependant variable.
  - 16. A system according to claim 12, wherein the collecting of the text corpus comprises one of the group consisting of i) collecting a relevance indicating measure associated with text in the text corpus, and ii) collecting time information associated with text in the text corpus.
  - 17. A system according to claim 16, wherein the predicting of the value of the variable comprises numerically weighting the value with the relevance indicating measure.

18. A non-transitory computer readable medium having stored thereon a computer program having software instructions which when run on a computer cause the computer to perform the steps of:
- collecting a text corpus comprising a set of words that include the target word,generating a representation of the text corpus,creating a semantic space for the set of words, based on the representation of the text corpus,defining, for a location in the semantic space, a value of the variable,estimating, for the target word, a value of the variable, based on the semantic space and the defined variable value of the location in the semantic space,calculating a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word; and
  
  statistically testing if two sets of words or two sets of documents of the text corpora differ in semantic representation,wherein the statistically testing comprises;
  
  i) calculating a first vector to represent a mean location in the semantic space for a first of the two sets of words or documents;
  
  ii) calculating a second vector to represent a mean location in the semantic space for a second of the two sets of words or documents;
  
  iii) calculating a distance between the first and second vectors;
  
  iv) repeating the steps i), ii), and iii) above while assigning the words randomly to the first of the two sets of words or documents and to the second of the two sets of words or documents;
  
  v) counting a percentage of occasions when the distance for the randomly assigned words is larger than when the distance is based on the non-randomly assigned words; and
  
  vi) providing the counted percentage as a probability for whether the two sets of words or documents differ in semantic representation.

19. A method for predicting a value of a variable associated with a target word or set of words, performed on an apparatus comprising at least one computer and comprising the steps of:
- connecting the apparatus to a network containing a plurality of text sources, wherein the apparatus further comprises search robot, and collecting a text corpus comprising a set of words that include the target word from said text sources using the search robot,the apparatus generating a representation of the text corpus,the computer creating a semantic space for the set of words, based on the representation of the text corpus,the computer defining, for a location in the semantic space, a value of the variable,the computer estimating, for the target word, a value of the variable, based on the semantic space and the defined variable value of the location in the semantic space,the computer calculating a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word,wherein the estimating of the target word variable value comprises performing regression analysis having the target word variable value as a dependant variable, andthe computer statistically testing if two sets of words or two sets of documents of the text corpora differ in semantic representation,wherein the step of statistically testing comprises;
  
  i) calculating a first vector to represent a mean location in the semantic space for a first of the two sets of words or documents;
  
  ii) calculating a second vector to represent a mean location in the semantic space for a second of the two sets of words or documents;
  
  iii) calculating a distance between the first and second vectors;
  
  iv) repeating the steps i), ii), and iii) above while assigning the words randomly to the first of the two sets of words or documents and to the second of the two sets of words or documents;
  
  v) counting a percentage of occasions when the distance for the randomly assigned words is larger than when the distance is based on the non-randomly assigned words; and
  
  vi) providing the counted percentage as a probability for whether the two sets of words or documents differ in semantic representation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Strossle International AB
Original Assignee
Strossle International AB
Inventors
Sikstrom, Sverker, Tyrberg, Mattias, Hall, Anders, Horte, Fredrik, Stenberg, Joakim
Primary Examiner(s)
VO, HUYEN X

Application Number

US14/303,651
Publication Number

US 20140309989A1
Time in Patent Office

648 Days
Field of Search

704 1- 10, 704/251, 704/255, 704/257, 704/270, 704/250, 707/739
US Class Current

1/1
CPC Class Codes

G06F 16/36   Creation of semantic tools,...

G06F 40/30   Semantic analysis

G06Q 10/04   Forecasting or optimisation...

Method and system for analyzing text

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

9 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for analyzing text

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

9 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links