LATENT METONYMICAL ANALYSIS AND INDEXING (LMAI)

US 20100114561A1
Filed: 04/02/2007
Published: 05/06/2010
Est. Priority Date: 04/02/2007
Status: Active Grant

First Claim

Patent Images

1-72. -72. (canceled)

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to Latent Metonymical analysis and Indexing (LMai) is a novel concept for Advance Machine Learning or Unsupervised Machine Learning Techniques, which uses a statistical approach to identify the relationship between the words in a set of given documents (Unstructured Data). This approach does not necessarily need training data to make decisions on matching the related words together but actually has the ability to do the classification by itself. All that is needed is to give the algorithm a set of natural documents. The method is elegant enough to classify the relationships automatically without any human guidance during the process as shown in FIGS. 6 and 7.

140 Citations

93 Claims

1-72. -72. (canceled)

73. A method for advance and/or unsupervised machine learning by Latent Metonymical Analysis and Indexing (LMai), said method comprising steps of:
- a. inputting natural documents;
  
  b. eliminating special characters to count number of words within the given document, filtering the contents based on the predefined stop-words and calculating the fraction of the stop-words present in the document;
  
  c. determining Significant Single Value Term data set and Significant Multi Value Term data set from the document being processed;
  
  d. decomposing the words in Significant Single Value Term data set and Significant Multi Value Term data set to extract the Keywords of the document being processed;
  
  e. optionally, determining KeyTerms and their respective hand-in-hand (HiH) words automatically for further decomposition;
  
  f. identifying Topic in an unsupervised manner based not just on File Name but also by manipulating/comparing with various combinations of document attributes that are extracted to identify Best Topic candidates and thereafter defining an appropriate Topic based on predefined rules; and
  
  g. analyzing relationship between the Topics and the Keywords and thereafter indexing the Topics and their related Keywords, KeyTerms and their respective hand-in-hand terms into Metonymy cluster and KeyTerms HiH cluster respectively.
- View Dependent Claims (74, 75, 76, 77, 78, 79, 80)
- - 74. The method as claimed in claim 73, wherein the method uses self-learning process to make decision in identifying the relationship between the words in natural documents in any electronic file format converted into tokenized format before data is given to the method to perform the classification of relationship between the related words without any human guidance by the virtue of defining an appropriate Topic for a given document based on its content.
  - 75. The method as claimed in claim 73, wherein the method identifies the documents with gibberish data or documents having stop-words less than or equal to predetermined percentage are not processed further to identify Keywords and Topics or data having no proper meaning to be eliminated during indexing.
  - 76. The method as claimed in claim 73, wherein the method is designed to act as a plug-in to connect to any typical search engine, which indexes and retrieves unstructured data and analyzes and combines result set of the search engine with the metonymical terms to obtain context-based results and returns results for a given search keyword that match the Topic in LMai index along with the results returned by the base search engine and suggests the related Topics that match the search Keyword in separate sections in order to search within the Topic or to search related Topics and displays the Keywords of results returned in order for the user to select the appropriate link that matches the content they are looking for without having to traverse back and forth otherwise and wherein the metonymy or relationship index created by the method is incremental and dynamic based on new addition of data.
  - 77. The method as claimed in claim 73 is capable of processing the documents written in any language which is tokenized and wherein documents having stop-words less than or equal to predetermined percentage is used to filter out or skip the documents of other languages.
  - 78. The method as claimed in claim 73, wherein the method provides for advance and/or unsupervised machine learning in robots, guidance systems, knowledge management system, decision making machines and/or search engines.
  - 79. The method as claimed in claim 73, wherein the method automatically creates personalized search profile based on the user'"'"'s interest by maintaining previous search information includes but not limited to various links the user visited and corresponding related Topics that are extracted upon each search, thereafter the profile is updated dynamically based on consecutive searches performed by the user.
  - 80. The method as claimed in claim 73, wherein the method classifies the documents precisely without the intervention of experts during the process using trained data and/or guidance to machine and depicts the percentage accuracy determined during classification and the percentage of content related to each of the sub-categories for ontology mapping.

81. A decomposition method to extract Keywords and KeyTerms from the documents, said method comprising steps of:
- a. inputting natural documents;
  
  b. checking the document being processed to identify the prerequisite minimal size of data and/or word articles/words;
  
  c. storing the data or words in the document in a sequential order as per their occurrence in the document;
  
  d. creating two identical instances of the data to facilitate the identification of Significant Single Value Term data set and Significant Multi Value Term data set;
  
  e. determining Significant Single Value Term from one of the instance of the data set and Significant Multi Value Term from the other instance of the data set starting from the highest hand-in-hand range predefined, followed by consecutive hand-in-hand range terms of lesser dimension;
  
  f. storing the identified Significant Single Value Term and Significant Multi Value Term of different hand-in-hand range in their respective data sets;
  
  g. comparing words in Significant Multi Value Term data sets with the words in Significant Single Value Term data set to extract those words in the respective hand-in-hand range of each Significant Multi Value Term data set as Best-Terms, which have at least one instance of Single Value Terms within their range and the rest of the hand-in-hand terms are decomposed; and
  
  h. comparing the data sets in such way that every individual hand-in-hand range term that has at least one instance of any term in Significant Single Value Term data set is extracted as a Keyword and the rest are decomposed to determine the KeyTerms.
- View Dependent Claims (82, 83, 84, 85, 86, 87)
- - 82. The method as claimed in claim 81, wherein the method automatically extracts the Keywords and KeyTerms from the electronic documents without any guidance or training data given to the said method and extracted words and terms are stored in two data sets, which are the Significant Single Value Term data set and Significant Multi Value Term data set each having the same instance of data that has the words stored in sequential order as per their occurrence in the document in order to decompose the words to identify Keywords in the document processed.
  - 83. The method as claimed in claim 81, wherein the Significant Multi Value Term data set has its own predefined set of hand-in-hand range dimensions and wherein the extraction of Significant Multi Value Term data set is carried out with the first stage being the extraction of the maximum hand-in-hand dimensional range followed by consecutive hand-in-hand range words of lesser dimension and optionally KeyTerms are used for further decomposition of the Keywords.
  - 84. The method as claimed in claim 81, wherein the method for identifying Significant Single Value Term data set from the given document further comprises steps of:
    - a. retrieving words from the data set stored in sequential order as per their occurrences in the document;
      
      b. eliminating special characters and/or word articles/words in the document by comparing with a list of predefined stop-words in order to obtain informative words in the document;
      
      c. processing the informative words to determine the frequency of each word occurrence; and
      
      d. sorting the processed words in order to extract a predefined number of words with highest frequency to identify the Significant Single Value Term.
  - 85. The method as claimed in claim 81, wherein the method for identifying Significant Multi Value Term data set from the given document further comprises steps of:
    - a. retrieving words from the data set stored in sequential order as per their occurrences in the document;
      
      b. extracting hand-in-hand words of a predetermined range into appropriate data sets from retrieved words, thereafter extracting words of type Single Value Term that are left over by eliminating stop-words and void values into a different data set;
      
      c. processing the extracted words in each of the respective data set to determine frequency of each word occurrence; and
      
      d. sorting the processed words in order to extract a predefined number of words with highest frequency in each of the respective data set to identify Significant Multi Value Term data sets of various predefined hand-in-hand range dimensions; and
      
      another data set with words of type Single Value Term, which is the residue after Significant Multi Value Term extraction.
  - 86. The method as claimed in claim 85, wherein the range of hand-in-hand words have value within the practical limits of usage and the extraction of hand-in-hand words of predetermined range is carried out with the extraction of words based on maximum hand-in-hand range dimension followed by consecutive hand-in-hand range words of lesser dimension and wherein the hand-in-hand words of a predetermined range is identified by taking sequential words in the order of their occurrence from the document and adding them together with a space.
  - 87. The method as claimed in claim 81, wherein the Term Decomposition is carried out by comparing the Significant Single Value Term dataset and Significant Multi Value Term dataset in such a way that every individual hand-in-hand range term that has at least one instance of any of the term in Significant Single Value Term dimension are extracted as Keywords and the rest are decomposed.

88. A method of defining an appropriate Topic to the document based on the document content comprises steps of:
- a. cleaning up the document'"'"'s File Name to remove the file dot (.) extension and any alphanumeric characters;
  
  b. extracting the first few predefined number of words from the beginning of the document as the Document Header;
  
  c. comparing each word in the File Name and each word in the Document Header with every word in Significant Single Value Terms data set to extract the words that match into two separate data sets;
  
  d. comparing each word in the Document Header with every word in File Name to extract the words that match into a separate data set;
  
  e. Transferring the data from the said individual data sets achieved in steps c and d into another data set;
  
  thereafter processing the data/words to determine frequency of each word occurrence;
  
  f. comparing every word in the Significant Multi Value Term data sets of a predefined range with the File Name to extract the hand-in-hand words that match in the separate data set;
  
  g. comparing every word in the Significant Multi Value Term data set of a predefined range with the Document Header to extract the hand-in-hand words that match in the separate data set;
  
  h. transferring the data from the individual data sets achieved in steps f and g into another separate data set;
  
  thereafter processing the data/words to determine frequency of each word occurrence;
  
  i. comparison of the data set achieved in step e, which consists of words of type Single Value Term and the data set achieved in step h, which consists of words of type Multi Value Term to extract those hand-in-hand words as Best Topic candidates that have at least one instance of any of the words of type Single Value Term; and
  
  j. defining an appropriate Topic based on predefined rules.
- View Dependent Claims (89, 90)
- - 89. The method as claimed in claim 88, wherein the Topic to a given document is defined based on predefined rules and thereafter, the Best Topic candidates'"'"' data set is checked to see, if there is only one such candidate, if it is asserted then that is defined as the Topic of the document and if there is more than one Best Topic candidate in the data set then the frequency of each Best Topic candidate is calculated based on matching the words in Best Topic candidate data set with the words in Significant Single Value Term data set to extract the corresponding frequency of each word that matches;
    - Thereafter adding up the individual frequencies of each word in the Best Topic candidate to derive the Topic with highest frequency and if there are no Best Topic candidate'"'"'s extracted, then the matching words from the comparison of Significant Single Value Term and File Name are chosen as per the sequence of the word occurrence in the File Name to define the Topic of the document, but if there are no matching words extracted based on the comparison of words from Significant Single Value Term data set and File Name, then the collective words that are extracted based various combination of comparison of words between the File Name, the Document Header and the Significant Single Value Term data set is now compared with the words in Significant Single Value Term data set and the term match that has the highest frequency in Significant Single Value Term data set is chosen as the Topic of the document and if there are no matching words found from the various combination of comparison of words between the File Name, the Document Header and the Significant Single Value Term data set, then no Topic is defined to the document by the method.
  - 90. The method as claimed in claim 89, wherein the method extracts Keywords, KeyTerms, and Topic for every document processed based on the predefined rules and each cluster represents the Topic and its related words in LMai index.

91. A system for automatically identifying Keywords, KeyTerms and Topics from a set of documents and thereafter automatically identifying the metonymical/related words by Latent Metonymical Analysis and Indexing (LMai), said system comprising:
- a. document input module for providing unstructured data;
  
  b. analyzer to identify similar words having singular and plural form and to convert the words into one of the form;
  
  c. means for decomposing the words in Significant Single Value Term data set and Significant Multi Value Term data set to extract the Keywords of the document being processed;
  
  d. means for analyzing relationship between the Topics and the Keywords and thereafter indexing the Topics and their related Keywords, KeyTerms and their respective hand-in-hand terms into Metonymy cluster and KeyTerms HiH cluster respectively;
  
  e. an indexing module for indexing/clustering Topics and their related words, and also KeyTerm and their HiH terms;
  
  f. retrieval engine to Analyze the Topic'"'"'s of each document during retrieval process to identify the Topic'"'"'s that are related to each other based on a predefined threshold limit to retrieve the context based results from the index/cluster; and
  
  g. display system to displaya. link to take the user to content page; and
  
  b. Topic and significant Keywords extracted by the method to understand the content within the link without having to visit result page.
- View Dependent Claims (92, 93)
- - 92. The system as claimed in claim 91, wherein the documents are in any electronic format and the method is designed in a way to act as a plug-in to connect to any base search engine, which indexes and retrieves unstructured data, and said system utilizes the search results returned by the Base Search Engine to identify, if there is a relationship between the Topics in the Index.
  - 93. The system as claimed in claim 91, wherein for every document returned by the Base Search Engine as a Search Result, the system would extract the corresponding Topic of the document from its Index and thereafter extracts a predefined set of Topics corresponding to the most relevant search results returned by the Base Search Engine and all those Topics that have a predefined frequency of co-occurrences are extracted as the Topics that are related to the Search Keyword.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Syed Yasin
Original Assignee
Syed Yasin
Inventors
Yasin, Syed

Granted Patent

US 8,583,419 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 16/355 Class or cluster creation o...

G06F 40/216 using statistical methods

LATENT METONYMICAL ANALYSIS AND INDEXING (LMAI)

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

140 Citations

93 Claims

Specification

Solutions

Use Cases

Quick Links

LATENT METONYMICAL ANALYSIS AND INDEXING (LMAI)

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

140 Citations

93 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links