Method and system for analyzing text
First Claim
1. A method for predicting a value of a variable associated with a target word or set of words, performed by an apparatus comprising at least one computer and comprising the steps of:
- the apparatus collecting a text corpus comprising a set of words that include the target word,the apparatus generating a representation of the text corpus,the at least one computer creating a semantic space for the set of words, based on the representation of the text corpus,the at least one computer defining, for a location in the semantic space, a value of the variable,the at least one computer estimating, for the target word, a value of the variable, based on the semantic space and the defined variable value of the location in the semantic space,calculating, by the at least one computer, a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word, andstatistically testing if two sets of words or two sets of documents of the text corpora differ in semantic representation,wherein the step of statistically testing comprises;
i) calculating a first vector to represent a mean location in the semantic space for a first of the two sets of words or documents;
ii) calculating a second vector to represent a mean location in the semantic space for a second of the two sets of words or documents;
iii) calculating a distance between the first and second vectors;
iv) repeating the steps i), ii), and iii) above while assigning the words randomly to the first of the two sets of words or documents and to the second of the two sets of words or documents;
v) counting a percentage of occasions when the distance for the randomly assigned words is larger than when the distance is based on the non-randomly assigned words; and
vi) providing the counted percentage as a probability for whether the two sets of words or documents differ in semantic representation.
2 Assignments
0 Petitions
Accused Products
Abstract
An apparatus for providing a control input signal for an industrial process or technical system having one or more controllable elements includes elements for generating a semantic space for a text corpus, and elements for generating a norm from one or more reference words or texts, the or each reference word or text being associated with a defined respective value on a scale, and the norm being calculated as a reference point or set of reference points in the semantic space for the or each reference word or text with its associated respective scale value. Elements for reading at least one target word included in the text corpus, elements for predicting a value of a variable associated with the target word based on the semantic space and the norm, and elements for providing the predicted value in a control input signal to the industrial process or technical system. A method for predicting a value of a variable associated with a target word is also disclosed together with an associated system and computer readable medium.
9 Citations
19 Claims
-
1. A method for predicting a value of a variable associated with a target word or set of words, performed by an apparatus comprising at least one computer and comprising the steps of:
-
the apparatus collecting a text corpus comprising a set of words that include the target word, the apparatus generating a representation of the text corpus, the at least one computer creating a semantic space for the set of words, based on the representation of the text corpus, the at least one computer defining, for a location in the semantic space, a value of the variable, the at least one computer estimating, for the target word, a value of the variable, based on the semantic space and the defined variable value of the location in the semantic space, calculating, by the at least one computer, a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word, and statistically testing if two sets of words or two sets of documents of the text corpora differ in semantic representation, wherein the step of statistically testing comprises; i) calculating a first vector to represent a mean location in the semantic space for a first of the two sets of words or documents; ii) calculating a second vector to represent a mean location in the semantic space for a second of the two sets of words or documents; iii) calculating a distance between the first and second vectors; iv) repeating the steps i), ii), and iii) above while assigning the words randomly to the first of the two sets of words or documents and to the second of the two sets of words or documents; v) counting a percentage of occasions when the distance for the randomly assigned words is larger than when the distance is based on the non-randomly assigned words; and vi) providing the counted percentage as a probability for whether the two sets of words or documents differ in semantic representation. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. An apparatus (300) for providing a control input signal (412) for an industrial process or technical system (400) having one or more controllable elements (421-42n), the apparatus comprising:
-
means (320;
909) for generating a semantic space (892) for a text corpus (302;
908);means (330;
914) for generating a norm (896) from one or more reference words or texts (332;
893), the reference word or text being associated with a defined respective value on a scale, and the norm being calculated as a reference point or set of reference points in the semantic space for the reference word or text with its associated respective scale value;means (340) for reading at least one target word; means (340) for predicting a value (350) of a variable associated with the target word based on the semantic space and the norm; means (340) for providing the predicted value in a control input signal (412) to said industrial process or technical system (400); and means for statistically testing if two sets of words or two sets of documents of the text corpora differ in semantic representation, wherein the means for statistically testing provides for; i) calculating a first vector to represent a mean location in the semantic space for a first of the two sets of words or documents; ii) calculating a second vector to represent a mean location in the semantic space for a second of the two sets of words or documents; iii) calculating a distance between the first and second vectors; iv) repeating the steps i), ii), and iii) above while assigning the words randomly to the first of the two sets of words or documents and to the second of the two sets of words or documents; v) counting a percentage of occasions when the distance for the randomly assigned words is larger than when the distance is based on the non-randomly assigned words; and vi) providing the counted percentage as a probability for whether the two sets of words or documents differ in semantic representation. - View Dependent Claims (10, 11)
-
-
12. A system for predicting a value of a variable associated with a target word or set of words, comprising at least one computer configured to:
-
collect a text corpus comprising a set of words that include the target word, generate a representation of the text corpus, create a semantic space for the set of words, based on the representation of the text corpus, define, for a location in the semantic space, of a subset of the words, a value of the variable, estimate, for the target word, a value of the variable, based on the semantic space and the defined variable value of the location in the semantic space, calculate a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word, and statistically test if two sets of words or two sets of documents of the text corpora differ in semantic representation, wherein the statistically test comprises; i) calculating a first vector to represent a mean location in the semantic space for a first of the two sets of words or documents; ii) calculating a second vector to represent a mean location in the semantic space for a second of the two sets of words or documents; iii) calculating a distance between the first and second vectors; iv) repeating the steps i), ii), and iii) above while assigning the words randomly to the first of the two sets of words or documents and to the second of the two sets of words or documents; v) counting a percentage of occasions when the distance for the randomly assigned words is larger than when the distance is based on the non-randomly assigned words; and vi) providing the counted percentage as a probability for whether the two sets of words or documents differ in semantic representation. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A non-transitory computer readable medium having stored thereon a computer program having software instructions which when run on a computer cause the computer to perform the steps of:
-
collecting a text corpus comprising a set of words that include the target word, generating a representation of the text corpus, creating a semantic space for the set of words, based on the representation of the text corpus, defining, for a location in the semantic space, a value of the variable, estimating, for the target word, a value of the variable, based on the semantic space and the defined variable value of the location in the semantic space, calculating a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word; and statistically testing if two sets of words or two sets of documents of the text corpora differ in semantic representation, wherein the statistically testing comprises; i) calculating a first vector to represent a mean location in the semantic space for a first of the two sets of words or documents; ii) calculating a second vector to represent a mean location in the semantic space for a second of the two sets of words or documents; iii) calculating a distance between the first and second vectors; iv) repeating the steps i), ii), and iii) above while assigning the words randomly to the first of the two sets of words or documents and to the second of the two sets of words or documents; v) counting a percentage of occasions when the distance for the randomly assigned words is larger than when the distance is based on the non-randomly assigned words; and vi) providing the counted percentage as a probability for whether the two sets of words or documents differ in semantic representation.
-
-
19. A method for predicting a value of a variable associated with a target word or set of words, performed on an apparatus comprising at least one computer and comprising the steps of:
-
connecting the apparatus to a network containing a plurality of text sources, wherein the apparatus further comprises search robot, and collecting a text corpus comprising a set of words that include the target word from said text sources using the search robot, the apparatus generating a representation of the text corpus, the computer creating a semantic space for the set of words, based on the representation of the text corpus, the computer defining, for a location in the semantic space, a value of the variable, the computer estimating, for the target word, a value of the variable, based on the semantic space and the defined variable value of the location in the semantic space, the computer calculating a predicted value of the target word, on basis of the semantic space, the defined variable value of the location in the semantic space and the estimated variable value of the target word, wherein the estimating of the target word variable value comprises performing regression analysis having the target word variable value as a dependant variable, and the computer statistically testing if two sets of words or two sets of documents of the text corpora differ in semantic representation, wherein the step of statistically testing comprises; i) calculating a first vector to represent a mean location in the semantic space for a first of the two sets of words or documents; ii) calculating a second vector to represent a mean location in the semantic space for a second of the two sets of words or documents; iii) calculating a distance between the first and second vectors; iv) repeating the steps i), ii), and iii) above while assigning the words randomly to the first of the two sets of words or documents and to the second of the two sets of words or documents; v) counting a percentage of occasions when the distance for the randomly assigned words is larger than when the distance is based on the non-randomly assigned words; and vi) providing the counted percentage as a probability for whether the two sets of words or documents differ in semantic representation.
-
Specification