Methods and systems for identifying paraphrases from an index of information items and associated sentence fragments

US 8,290,963 B1
Filed: 05/02/2011
Issued: 10/16/2012
Est. Priority Date: 03/23/2005
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a machine-readable index;

one or more computers programmed to perform operations in accordance with computer-readable instructions, the operations comprising;

identifying, in the machine-readable index, a first sentence fragment and a second sentence fragment that are both associated with a same first information item, wherein the first information item is one of a date, an entity name, and a concept, wherein the machine-readable index comprises a plurality of information items and sentence fragments associated with respective of the information items;

in response to identifying that the first sentence fragment and the second sentence fragment are both associated with a same first information item, identifying a paraphrase pair in the first and second sentence fragments;

repeating the identifying of the first sentence fragment and the second sentence fragment and the identifying of the paraphrase pair to identify a plurality of paraphrase pairs; and

determining a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in the plurality of paraphrase pairs,whereinthe paraphrase pair comprises a first paraphrase and a second paraphrase,the first paraphrase comprises a proper subset of the words in the first sentence fragment,the second paraphrase comprises a proper subset of the words in the second sentence fragment, andthe first paraphrase and the second paraphrase are in a same language, have a same or a similar meaning, and are not identical.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and systems for identification of paraphrases from an index of information items and associated sentence fragments are described. One method described comprises identifying a pair of sentence fragments each having a same associated information item from an index, wherein the index comprises a plurality of information items and associated sentence fragments, and identifying a paraphrase pair from the pair of sentence fragments.

Citations

22 Claims

1. A system comprising:
- a machine-readable index;
  
  one or more computers programmed to perform operations in accordance with computer-readable instructions, the operations comprising;
  
  identifying, in the machine-readable index, a first sentence fragment and a second sentence fragment that are both associated with a same first information item, wherein the first information item is one of a date, an entity name, and a concept, wherein the machine-readable index comprises a plurality of information items and sentence fragments associated with respective of the information items;
  
  in response to identifying that the first sentence fragment and the second sentence fragment are both associated with a same first information item, identifying a paraphrase pair in the first and second sentence fragments;
  
  repeating the identifying of the first sentence fragment and the second sentence fragment and the identifying of the paraphrase pair to identify a plurality of paraphrase pairs; and
  
  determining a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in the plurality of paraphrase pairs,whereinthe paraphrase pair comprises a first paraphrase and a second paraphrase,the first paraphrase comprises a proper subset of the words in the first sentence fragment,the second paraphrase comprises a proper subset of the words in the second sentence fragment, andthe first paraphrase and the second paraphrase are in a same language, have a same or a similar meaning, and are not identical.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein the first information item is an entity name.
  - 3. The system of claim 2, wherein the entity name comprises at least one of a name of a person, a name of a place, and a name of an organization.
  - 4. The system of claim 1, wherein:
    - the first sentence fragment and the second sentence fragment each comprises a plurality of tokens; and
      
      identifying the paraphrase pair comprises;
      
      aligning the first sentence fragment and the second sentence fragment to match tokens in the first sentence fragment with tokens in the second sentence fragment;
      
      identifying one or more tokens in the first sentence fragment that are dissimilar to one or more tokens in the second sentence fragment; and
      
      identifying the paraphrase pair from the dissimilar tokens.
  - 5. The system of claim 1, wherein the operations further comprise:
    - identifying a subset of the plurality of paraphrase pairs having a frequency of occurrence value above a threshold; and
      
      adding the subset of the plurality of paraphrase pairs to a machine-readable data collection.

6. A system comprising:
- a machine-readable index that associates information items and sentence fragments, wherein the information items are each one of a date, an entity name, and a concept; and
  
  one or more computers programmed to perform operations in accordance with computer-readable instructions, the operations comprising;
  
  identifying a collection of paraphrase pairs from the machine-readable index;
  
  determining a frequency of occurrence value for a first paraphrase pair of the collection of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which the first paraphrase pair appears in the collection; and
  
  adding the first paraphrase pair to a data collection based at least in part on the frequency of occurrence value meeting a criterion,whereineach paraphrase pair comprises a first paraphrase and a second paraphrase,the first paraphrase comprises a proper subset of the words in a first sentence fragment,the second paraphrase comprises a proper subset of the words in a second sentence fragment, andeach of the first paraphrase and the second paraphrase in a paraphrase pair are in a same language, have a same or a similar meaning, and are not identical.
- View Dependent Claims (7, 8)
- - 7. The system of claim 6, wherein:
    - the sentence fragments each comprise a plurality of tokens; and
      
      identifying the collection of paraphrase pairs comprises;
      
      aligning a first sentence fragment and a second sentence fragment to match tokens in the first sentence fragment with tokens in the second sentence fragment;
      
      identifying one or more tokens in the first sentence fragment that are dissimilar to one or more tokens in the second sentence fragment; and
      
      identifying the paraphrase pair from the dissimilar tokens.
  - 8. The system of claim 6, further comprising:
    - determining a second frequency of occurrence value for a second paraphrase pair; and
      
      adding the second paraphrase pair to the machine-readable index based at least in part on the frequency of occurrence value meeting the criterion.

9. A method performed by a system of one or more computers, the method comprising:
- identifying, by the system in a machine-readable index, a first sentence fragment and a second sentence fragment that are both associated with a same first information item, wherein the first information item is one of a date, an entity name, and a concept, wherein the machine-readable index comprises a plurality of information items and sentence fragments associated with respective of the information items;
  
  in response to identifying that the first sentence fragment and the second sentence fragment are both associated with a same first information item, identifying, by the system, a paraphrase pair in the first and second sentence fragments;
  
  repeating, by the system, the identifying of the first sentence fragment and the second sentence fragment and the identifying of the paraphrase pair to identify a plurality of paraphrase pairs; and
  
  determining, by the system, a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in the plurality of paraphrase pairs,whereinthe paraphrase pair comprises a first paraphrase and a second paraphrase,the first paraphrase comprises a proper subset of the words in the first sentence fragment,the second paraphrase comprises a proper subset of the words in the second sentence fragment, andthe first paraphrase and the second paraphrase are in a same language, have a same or a similar meaning, and are not identical.
- View Dependent Claims (10, 11)
- - 10. The method of claim 9, wherein:
    - the first sentence fragment and a second sentence fragment each comprises a plurality of tokens; and
      
      identifying the paraphrase pair comprises;
      
      aligning the first sentence fragment and a second sentence fragment to match tokens in the first sentence fragment with tokens in the second sentence fragment;
      
      identifying one or more tokens in the first sentence fragment that are dissimilar to one or more tokens in the second sentence fragment; and
      
      identifying the paraphrase pair from the dissimilar tokens.
  - 11. The method of claim 9, further comprising:
    - identifying, by the system, a subset of the plurality of paraphrase pairs having a frequency of occurrence value above a threshold; and
      
      adding, by the system, the subset of the plurality of paraphrase pairs to a machine-readable data collection.

12. A method performed by a system of one or more computers, the method comprising:
- identifying, by the system, a collection of paraphrase pairs that each associates information items and sentence fragments, wherein the information items are each one of a date, an entity name, and a concept;
  
  determining, by the system, a frequency of occurrence value for a first paraphrase pair of the collection of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which the first paraphrase pair appears in the collection;
  
  adding, by the system, the first paraphrase pair to a machine-readable index based at least in part on the frequency of occurrence value meeting a criterion,whereineach paraphrase pair comprises a first paraphrase and a second paraphrase,the first paraphrase comprises a proper subset of the words in a first sentence fragment in the index,the second paraphrase comprises a proper subset of the words in a second sentence fragment in the index, andeach of the first paraphrase and the second paraphrase in a paraphrase pair are in a same language, have a same or a similar meaning, and are not identical.

13. A system comprising:
- a machine-readable index that associates information items and sentence fragments; and
  
  one or more computers programmed to perform operations in accordance with computer-readable instructions, the operations comprising;
  
  repeatedlyidentifying a first sentence fragment and a second sentence fragment, each comprising a plurality of tokens and each associated with a same information item in the machine-readable index,aligning the first sentence fragment and the second sentence fragment so that tokens in the first sentence fragment match tokens in the second sentence fragment,determining a number of matched non-stop tokens in the aligned first and second sentence fragments,determining a number of dissimilar tokens in the aligned first and second sentence fragments, andidentifying a paraphrase pair in the dissimilar tokens based at least in part on the number of matched non-stop tokens and the number of dissimilar tokens, wherein paraphrases in the paraphrase pair are in a same language;
  
  determining a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in a collection of the identified paraphrase pairs;
  
  identifying a subset of the plurality of paraphrase pairs, wherein each paraphrase pair in the subset has a frequency of occurrence value that is above a criteria; and
  
  adding the subset of the plurality of paraphrase pairs to a machine-readable index.

14. A system comprising:
- a machine-readable index that includes a collection of index entries, wherein each of the index entries comprises a sentence fragment and an associated date; and
  
  one or more computers programmed to perform operations in accordance with computer-readable instructions, the operations comprising;
  
  accessing the index;
  
  repeatedly selecting, from the index, first index entries comprising a date and second index entries comprising the same date;
  
  identifying whether first portions of first sentence fragments from the first index entries paraphrase second portions of second sentence fragments from the second index entries, includingdetermining a frequency of occurrence value for each first portion and second portion in the selected index entries, wherein the frequency of occurrence value embodies the frequency at which the first portions and the second portions are in the sentence fragments of the selected entries andidentifying a subset of the first portions and the second portions having a frequency of occurrence value above a threshold; and
  
  in response to identifying that the first portions paraphrase the second portions, storing the first portions and the second portions in a machine-readable data collection,wherein paraphrases are in a same language, have a same or a similar meaning, and are not identical.

15. An article comprising one or more computer-readable data storage media storing program code operable to cause one or more machines to perform operations, the operations comprising:
- identifying, in a machine-readable index, a first sentence fragment and a second sentence fragment that are both associated with a same first information item, wherein the first information item is an entity name, wherein the machine-readable index comprises a plurality of information items and sentence fragments associated with respective of the information items;
  
  in response to identifying that the first sentence fragment and the second sentence fragment are both associated with the same first information item, identifying a paraphrase pair in the first and second sentence fragments;
  
  repeating the identifying of the first sentence fragment and the second sentence fragment and the identifying of the paraphrase pair to identify a plurality of paraphrase pairs; and
  
  determining a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in the plurality of paraphrase pairs,whereinthe paraphrase pair comprises a first paraphrase and a second paraphrase,the first paraphrase comprises a proper subset of the words in the first sentence fragment,the second paraphrase comprises a proper subset of the words in the second sentence fragment, andthe first paraphrase and the second paraphrase are in a same language, have a same or a similar meaning, and are not identical.
- View Dependent Claims (16, 17, 18)
- - 16. The article of claim 15, wherein the entity name comprises at least one of a name of a person, a name of a place, and a name of an organization.
  - 17. The article of claim 15, wherein:
    - the first sentence fragment and the second sentence fragment each comprises a plurality of tokens; and
      
      identifying the paraphrase pair comprises;
      
      aligning the first sentence fragment and the second sentence fragment to match tokens in the first sentence fragment with tokens in the second sentence fragment;
      
      identifying one or more tokens in the first sentence fragment that are dissimilar to one or more tokens in the second sentence fragment; and
      
      identifying the paraphrase pair from the dissimilar tokens.
  - 18. The article of claim 15, wherein the operations further comprise:
    - identifying a subset of the plurality of paraphrase pairs having a frequency of occurrence value above a threshold; and
      
      adding the subset of the plurality of paraphrase pairs to a machine-readable data collection.

19. An article comprising one or more computer-readable data storage media storing program code operable to causeone or more machines to perform operations, the operations comprising:
- identifying, in a machine-readable index, a first sentence fragment and a second sentence fragment that are both associated with a same first information item, wherein the first information item is a concept, wherein the machine-readable index comprises a plurality of information items and sentence fragments associated with respective of the information items;
  
  in response to identifying that the first sentence fragment and the second sentence fragment are both associated with a same first information item, identifying a paraphrase pair in the first and second sentence fragments;
  
  repeating the identifying of the first sentence fragment and the second sentence fragment and the identifying of the paraphrase pair to identify a plurality of paraphrase pairs; and
  
  determining a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in the plurality of paraphrase pairs,whereinthe paraphrase pair comprises a first paraphrase and a second paraphrase,the first paraphrase comprises a proper subset of the words in the first sentence fragment,the second paraphrase comprises a proper subset of the words in the second sentence fragment, andthe first paraphrase and the second paraphrase are in a same language, have a same or a similar meaning, and are not identical.
- View Dependent Claims (20, 21)
- - 20. The article of claim 19, wherein:
    - the first sentence fragment and a second sentence fragment each comprises a plurality of tokens; and
      
      identifying the paraphrase pair comprises;
      
      aligning the first sentence fragment and a second sentence fragment to match tokens in the first sentence fragment with tokens in the second sentence fragment;
      
      identifying one or more tokens in the first sentence fragment that are dissimilar to one or more tokens in the second sentence fragment; and
      
      identifying the paraphrase pair from the dissimilar tokens.
  - 21. The article of claim 19, further comprising:
    - identifying, by the system, a subset of the plurality of paraphrase pairs having a frequency of occurrence value above a threshold; and
      
      adding, by the system, the subset of the plurality of paraphrase pairs to a machine-readable data collection.

22. An article comprising one or more computer-readable data storage media storing program code operable to cause one or more machines to perform operations, the operations comprising:
- repeatedlyidentifying a first sentence fragment and a second sentence fragment, each comprising a plurality of tokens and each associated with a same information item in a machine-readable index that associates information items and sentence fragments,aligning the first sentence fragment and the second sentence fragment so that tokens in the first sentence fragment match tokens in the second sentence fragment,determining a number of matched non-stop tokens in the aligned first and second sentence fragments,determining a number of dissimilar tokens in the aligned first and second sentence fragments, andidentifying a paraphrase pair in the dissimilar tokens based at least in part on the number of matched non-stop tokens and the number of dissimilar tokens, wherein paraphrases in the paraphrase pair are in a same language;
  
  determining a frequency of occurrence value for each of the paraphrase pairs in the plurality of paraphrase pairs, wherein the frequency of occurrence value embodies the frequency at which each paraphrase pair appears in a collection of the identified paraphrase pairs;
  
  identifying a subset of the plurality of paraphrase pairs, wherein each paraphrase pair in the subset has a frequency of occurrence value that is above a criteria; and
  
  adding the subset of the plurality of paraphrase pairs to a machine-readable index.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Pasca, Alexandru Marius, Dienes, Peter Szabolcs
Primary Examiner(s)
TRUONG, CAM Y T

Application Number

US13/098,697
Time in Patent Office

533 Days
Field of Search

707/750, 707/753, 707/999.06, 707/999.05
US Class Current

707/750
CPC Class Codes

G06F 40/247 Thesauruses; Synonyms

Methods and systems for identifying paraphrases from an index of information items and associated sentence fragments

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for identifying paraphrases from an index of information items and associated sentence fragments

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links