Conceptual document analysis and characterization

US 9,886,488 B2
Filed: 07/20/2016
Issued: 02/06/2018
Est. Priority Date: 04/27/2015
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, by at least one data processor, a plurality of data files from a plurality of data sources that comprise textual content;

categorizing, by the at least one data processor, the plurality of data files into a taxonomy of categories in which each category has associated sample textual content defining a concept for the category and each category associated with a memory-optimized structure that comprises a collection of at least one identification corresponding to at least one of the plurality of data files, the categorizing comprising, for each category;

comparing, by the at least one data processor, for each of the plurality of data files, the textual content of the data file with the sample textual content for the category;

calculating, by the at least one data processor, based on the comparing and for each of the plurality of data files, a file score corresponding to the degree of similarity between the defined concept of the category and a determined concept for the data file; and

generating, by the at least one data processor, the identification stored in the memory-optimized structure that comprises the collection by at least associating, for each of the plurality of data files, the data file with the category if the file score is equal to or greater than a pre-determined minimum score for the category; and

providing, by the at least one data processor, at least a portion of the data file and/or the associated file score.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Data files are received from data sources that include textual content. The data files are categorized using a taxonomy of categories, where each category has sample textual content that defines a concept for the category. The categorizing includes comparing the textual content of the data file with the sample textual content for the category. A file score is calculated for each data file to compare the degree of similarity between the defined concept of the category and a determined concept for the data file. Each data file is associated with the category if the file score is equal to or greater than a pre-determined minimum score for the category. A portion of the data file and/or file score is be provided.

Citations

20 Claims

1. A method comprising:
- receiving, by at least one data processor, a plurality of data files from a plurality of data sources that comprise textual content;
  
  categorizing, by the at least one data processor, the plurality of data files into a taxonomy of categories in which each category has associated sample textual content defining a concept for the category and each category associated with a memory-optimized structure that comprises a collection of at least one identification corresponding to at least one of the plurality of data files, the categorizing comprising, for each category;
  
  comparing, by the at least one data processor, for each of the plurality of data files, the textual content of the data file with the sample textual content for the category;
  
  calculating, by the at least one data processor, based on the comparing and for each of the plurality of data files, a file score corresponding to the degree of similarity between the defined concept of the category and a determined concept for the data file; and
  
  generating, by the at least one data processor, the identification stored in the memory-optimized structure that comprises the collection by at least associating, for each of the plurality of data files, the data file with the category if the file score is equal to or greater than a pre-determined minimum score for the category; and
  
  providing, by the at least one data processor, at least a portion of the data file and/or the associated file score.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, further comprising:
    - generating, by the at least one data processor, the taxonomy by;
      
      adding, by the at least one data processor and to the taxonomy, at least one of the categories, with each of the at least one categories representing the first concept;
      
      adding, by the at least one data processor and to the at least one category, a sample comprising the sample textual content corresponding to the first concept; and
      
      adding, by the at least one data processor and to the at least one category, the minimum score.
  - 3. The method of claim 1, wherein the associating is between the data file and only one category, the category being the category generating the highest file score equal to or greater than the minimum score.
  - 4. The method of claim 1, further comprising clustering, by the at least one data processor, the textual content into at least one cluster, the cluster representative of identified concepts.
  - 5. The method of claim 1, wherein the data file further comprises a source identifier identifying the data source.
  - 6. The method of claim 1, wherein at least one text item is identified from the sample textual content and the text item is given a text item score to identify the relevance of the text item to the sample.
  - 7. The method of claim 1, wherein the providing includes providing, by the at least one data processor, a first representation of the data file along with a second representation of all attachments, metadata, or electronic associations.
  - 8. The method of claim 1, wherein providing at least a portion of the data file and/or the associated file score comprises at least one of:
    - displaying, by the at least one data processor, at least a portion of the data file and/or the associated file score, loading, by the at least one data processor, at least a portion of the data file and/or the associated file score into memory, transmitting, by the at least one data processor, data including at least a portion of the data file and/or the associated file score to a remote computing device, or storing, by the at least one data processor, at least a portion of the data file and/or the associated file score into persistent memory.
  - 9. The method of claim 1, wherein the memory-optimized structure that comprises the collection is generated by run-length encoding the collection.
  - 10. The method of claim 1, wherein the receiving is from an ongoing data stream providing an ongoing source of the data files to be categorized.
  - 11. The method of claim 10, wherein the ongoing data stream is from an e-mail server.

12. A non-transitory computer program product storing instructions that, when executed by at least one programmable processor, cause the at least one programmable processor to perform operations comprising:
- receiving a plurality of data files from a plurality of data sources that comprise textual content;
  
  categorizing the plurality of data files into a taxonomy of categories in which each category has associated sample textual content defining a concept for the category and each category associated with a memory-optimized structure that comprises a collection of at least one identification corresponding to at least one of the plurality of data files, the categorizing comprising, for each category;
  
  comparing, for each of the plurality of data files, the textual content of the data file with the sample textual content for the category;
  
  calculating, based on the comparing and for each of the plurality of data files, a file score corresponding to the degree of similarity between the defined concept of the category and a determined concept for the data file; and
  
  generating the identification stored in the memory-optimized structure that comprises the collection by at least associating, for each of the plurality of data files, the data file with the category if the file score is equal to or greater than a pre-determined minimum score for the category; and
  
  providing at least a portion of the data file and/or the associated file score.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The non-transitory computer program product of claim 12, wherein the operations of generating the taxonomy further comprise:
    - adding, to the taxonomy, at least one of the categories, with each of the at least one categories representing the first concept;
      
      adding, to the at least one category, a sample comprising the sample textual content corresponding to the first concept; and
      
      adding, to the at least one category, the minimum score.
  - 14. The non-transitory computer program product of claim 12, wherein the operations further comprise clustering the textual content into at least one cluster, the cluster representative of identified concepts.
  - 15. The non-transitory computer program product of claim 12, wherein at least one text item is identified from the sample textual content and the text item is given a text item score to identify the relevance of the text item to the sample.
  - 16. The non-transitory computer program product of claim 12, wherein the operations of providing includes providing a first representation of the data file along with a second representation of all attachments, metadata, or electronic associations.

17. A system comprising:
- at least one programmable data processor device; and
  
  memory storing instructions which, when executed by the at least one programmable data processor, result in operations comprising;
  
  receiving a plurality of data files from a plurality of data sources that comprise textual content;
  
  categorizing the plurality of data files into a taxonomy of categories in which each category has associated sample textual content defining a concept for the category and each category associated with a memory-optimized structure that comprises a collection of at least one identification corresponding to at least one of the plurality of data files, the categorizing comprising, for each category;
  
  comparing, for each of the plurality of data files, the textual content of the data file with the sample textual content for the category;
  
  calculating, based on the comparing and for each of the plurality of data files, a file score corresponding to the degree of similarity between the defined concept of the category and a determined concept for the data file; and
  
  generating the identification stored in the memory-optimized structure that comprises the collection by at least associating, for each of the plurality of data files, the data file with the category if the file score is equal to or greater than a pre-determined minimum score for the category; and
  
  providing at least a portion of the data file and/or the associated file score.
- View Dependent Claims (18, 19, 20)
- - 18. The system of claim 17, wherein the operations of generating the taxonomy further comprise:
    - generating the taxonomy by;
      
      adding, to the taxonomy, at least one of the categories, with each of the at least one categories representing the first concept;
      
      adding, to the at least one category, a sample comprising the sample textual content corresponding to the first concept; and
      
      adding, to the at least one category, the minimum score.
  - 19. The system of claim 17, wherein the operations further comprise clustering the textual content into at least one cluster, the cluster representative of identified concepts.
  - 20. The system of claim 17, wherein the operations of providing further comprise providing a first representation of the data file along with a second representation of all attachments, metadata, or electronic associations.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Consilio LLC
Original Assignee
Altep, Inc. (Consilio LLC)
Inventors
Miller, Roger W., van den Berge, Willem R.
Primary Examiner(s)
Dang, Thanh-Ha

Application Number

US15/215,470
Publication Number

US 20160328454A1
Time in Patent Office

566 Days
Field of Search

707728
US Class Current
CPC Class Codes

G06F 16/24575   using context

G06F 16/24578   using ranking

G06F 16/252   between a Database Manageme...

G06F 16/353   into predefined classes

G06F 16/93   Document management systems

Conceptual document analysis and characterization

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Conceptual document analysis and characterization

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links