METHOD FOR DISAMBIGUATED FEATURES IN UNSTRUCTURED TEXT

US 20150154286A1
Filed: 12/02/2014
Published: 06/04/2015
Est. Priority Date: 12/02/2013
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

searching, by a node of a system hosting an in-memory database, a set of candidate records to identify one or more candidates matching one or more extracted features, wherein an extracted feature that matches a candidate is a primary feature;

associating, by the node, each of the extracted features with one or more machine-generated topic identifiers (“

topic IDs”

);

disambiguating, by the node, each of the primary features from one another based on relatedness of topic IDs;

identifying, by the node, a set of secondary features associated with each primary feature based upon the relatedness of topic IDs;

disambiguating, by the node, each of the primary features from each of the secondary features in the associated set of secondary features based on relatedness of topic IDs;

linking, by the node, each primary feature to the associated set of secondary features to form a new cluster;

determining, by the node, whether the new cluster matches an existing knowledgebase cluster, wherein,when there is a match, determining, by the disambiguation module of the in-memory database server computer, an existing unique identifier (“

unique ID”

) corresponding to each matching primary feature in the knowledgebase cluster and updating the knowledgebase cluster to include the new cluster; and

when there is no match, creating, by the node, a new knowledgebase cluster and assigning a new unique ID to the primary feature of the new knowledgebase cluster; and

transmitting, by the node, one of the existing unique ID and the new unique ID for the primary feature.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for disambiguating features in unstructured text is provided. The disclosed method may not require pre-existing links to be present. The method for disambiguating features in unstructured text may use co-occurring features derived from both the source document and a large document corpus. The disclosed method may include multiple modules, including a linking module for linking the derived features from the source document to the co-occurring features of an existing knowledge base. The disclosed method for disambiguating features may allow identifying unique entities from a knowledge base that includes entities with a unique set of co-occurring features, which in turn may allow for increased precision in knowledge discovery and search results, employing advanced analytical methods over a massive corpus, employing a combination of entities, co-occurring entities, topic IDs, and other derived features.

Citations

20 Claims

1. A method comprising:
- searching, by a node of a system hosting an in-memory database, a set of candidate records to identify one or more candidates matching one or more extracted features, wherein an extracted feature that matches a candidate is a primary feature;
  
  associating, by the node, each of the extracted features with one or more machine-generated topic identifiers (“
  
  topic IDs”
  
  );
  
  disambiguating, by the node, each of the primary features from one another based on relatedness of topic IDs;
  
  identifying, by the node, a set of secondary features associated with each primary feature based upon the relatedness of topic IDs;
  
  disambiguating, by the node, each of the primary features from each of the secondary features in the associated set of secondary features based on relatedness of topic IDs;
  
  linking, by the node, each primary feature to the associated set of secondary features to form a new cluster;
  
  determining, by the node, whether the new cluster matches an existing knowledgebase cluster, wherein,when there is a match, determining, by the disambiguation module of the in-memory database server computer, an existing unique identifier (“
  
  unique ID”
  
  ) corresponding to each matching primary feature in the knowledgebase cluster and updating the knowledgebase cluster to include the new cluster; and
  
  when there is no match, creating, by the node, a new knowledgebase cluster and assigning a new unique ID to the primary feature of the new knowledgebase cluster; and
  
  transmitting, by the node, one of the existing unique ID and the new unique ID for the primary feature.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method according to claim 1, further comprising:
    - comparing, by the node, each of the candidate records matching an extracted feature; and
      
      assigning, by the node, a weighted match score result to each of the extracted features based upon the comparison.
  - 3. The method according to claim 2, further comprising associating, by the node, each of the extracted features with a set of weighted feature attributes.
  - 4. The method according to claim 3, further comprising determining, by the node, relatedness of each of the extracted features based on one or more weighted feature attributes.
  - 5. The method according to claim 1, further comprising:
    - recognizing and extracting, by an extraction module of the node, wherein one or more primary features are identified in the one or more extracted features; and
      
      storing, by the extraction module of the node, each of the extracted features in a database.
  - 6. The method according to claim 5, further comprising assigning, by the extraction module of the node, an extraction certainty score to each of the features.
  - 7. The method according to claim 1, wherein each primary feature is associated with a set of one or more feature attributes.
  - 8. The method according to claim 7, wherein a feature attribute is selected from the group consisting of:
    - a topic ID, a document identifier (“
      
      document ID”
      
      ), a feature type, a feature name, a confidence score, and a feature position.
  - 9. The method according to claim 1, wherein each associated feature is associated with a set of lower-ordinal features according to a pre-defined cluster hierarchy.
  - 10. The method according to claim 1, further comprising performing, by a node, a fuzzy key search of the set of candidate records.
  - 11. The method according to claim 7, further comprising linking, by a link-on-the fly module of the node, two or more data sources based on co-occurrence of related topic IDs and one or more feature attributes.
  - 12. The method according to claim 1, further comprising:
    - determining, by the node, whether an extracted feature in a data source co-occurs in a second data source by comparing the extracted feature with a feature in the second data source; and
      
      linking, by the node, each of the data sources based upon the comparison.
  - 13. The method according to claim 1, further comprising analyzing, by the node, co-occurrence of an extracted feature from different data sources to improve accuracy of disambiguating extracted features.
  - 14. The method according to claim 1, further comprising:
    - continuously receiving, by the node, one or more new data sources;
      
      continuously extracting, by the node, one or more extracted features;
      
      continuously performing, by the node, candidate searching on the one or more extracted features;
      
      continuously disambiguating, by the node, the extracted features; and
      
      continuously linking, by the node, the extracted features into one or more new clusters.

15. A non-transitory computer readable medium having stored thereon computer executable instructions comprising:
- searching, by a node of a system hosting an in-memory database, a set of candidate records to identify one or more candidates matching one or more extracted features, wherein an extracted feature that matches a candidate is a primary feature;
  
  associating, by the node, each of the extracted features with one or more machine-generated topic identifiers (“
  
  topic IDs”
  
  );
  
  disambiguating, by the node, each of the primary features from one another based on relatedness of topic IDs;
  
  identifying, by the node, a set of secondary features associated with each primary feature based upon the relatedness of topic IDs;
  
  disambiguating, by the node, each of the primary features from each of the secondary features in the associated set of secondary features based on relatedness of topic IDs;
  
  linking, by the node, each primary feature to the associated set of secondary features to form a new cluster;
  
  determining, by the node, whether the new cluster matches an existing knowledgebase cluster, wherein,when there is a match, determining, by the node, an existing unique identifier (“
  
  unique ID”
  
  ) corresponding to each matching primary feature in the knowledgebase cluster and updating the knowledgebase cluster to include the new cluster; and
  
  when there is no match, creating a new knowledgebase cluster and assigning a new unique ID to the primary feature of the new knowledgebase cluster; and
  
  transmitting, by the node, one of the existing unique ID and the new unique ID for the primary feature.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer readable medium according to claim 15, wherein the instructions further comprise:
    - comparing, by the node, each of the candidate records matching an extracted feature;
      
      and assigning a weighted match score result to each of the extracted features based upon the comparison.
  - 17. The computer readable medium according to claim 16, wherein the instructions further comprise associating, by the node, each of the extracted features with a set of weighted feature attributes.
  - 18. The computer readable medium according to claim 17, wherein the instructions further comprise determining, by the node, relatedness of each of the extracted features based on one or more weighted feature attributes.
  - 19. The computer readable medium according to claim 15, wherein the instructions further comprise:
    - recognizing and extracting, by an extraction module of the node, one or more extracted features, wherein one or more primary features are identified in the one or more extracted features; and
      
      storing, by the extraction module of the node, each of the extracted features in a database.
  - 20. The computer readable medium according to claim 19, wherein the instructions further comprise assigning, by the extraction module of the node, an extraction certainty score to each of the features.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Finch Computing LLC (Qbase, LLC)
Original Assignee
Qbase, LLC
Inventors
LIGHTNER, Scott, Weckesser, Franz, Boddhu, Sanjay, Dave, Rakesh, Flagg, Robert

Granted Patent

US 9,239,875 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/3344   using natural language anal...

G06F 16/35   Clustering; Classification

G06F 40/279   Recognition of textual enti...

G16B 40/00   ICT specially adapted for b...

G16C 20/70   Machine learning, data mini...

METHOD FOR DISAMBIGUATED FEATURES IN UNSTRUCTURED TEXT

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD FOR DISAMBIGUATED FEATURES IN UNSTRUCTURED TEXT

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links