Apparatus and methods of identifying potentially similar content for data reduction

US 7,836,053 B2
Filed: 12/28/2007
Issued: 11/16/2010
Est. Priority Date: 12/28/2007
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of identifying potentially similar content for data reduction, comprising:

receiving content workflow metadata corresponding to content to be processed, wherein the content to be processed includes a data component, and wherein the content workflow metadata comprises a workflow processing characteristic of the data component;

wherein the content workflow metadata corresponding to the content to be processed further comprises a plurality of content workflow metadata corresponding to a plurality of content to be processed, wherein each of the plurality of content to be processed includes a respective data component, and wherein each respective content workflow metadata comprises a respective workflow processing characteristic corresponding to a respective data component;

receiving known content workflow metadata corresponding to a plurality of known content, wherein each known content includes a known data component, and wherein each known content workflow metadata comprises a workflow processing characteristic of the corresponding known data component;

comparing the content workflow metadata of the content to be processed and the known content workflow metadata of the plurality of known content according to a similarity rule to identify a first subset of the plurality of known content having potentially similar content relative to the content to be processed;

identifying a first subset of the plurality of content to be processed based on determining a potential similarity between respective data components based on the respective content workflow metadata; and

outputting an identification of the first subset of the plurality of known content and the first subset of the plurality of content to be processed to use in reducing data in the content to be processed;

wherein each workflow processing characteristic relates to a workflow process applicable to the corresponding content; and

wherein the similarity rule comprises a workflow-specific similarity rule, wherein the workflow-specific similarity rule depends on a type of the workflow associated with the content to be processed.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Apparatus and methods of identifying potentially similar content include utilizing workflow metadata to identify potential similarities in content to be processed, or between content to be processed and known content. As a result, a subset of potentially similar content is identified, and the subset can be used in data reduction operations to reduce data in the content to be processed.

Citations

44 Claims

1. A computer-implemented method of identifying potentially similar content for data reduction, comprising:
- receiving content workflow metadata corresponding to content to be processed, wherein the content to be processed includes a data component, and wherein the content workflow metadata comprises a workflow processing characteristic of the data component;
  
  wherein the content workflow metadata corresponding to the content to be processed further comprises a plurality of content workflow metadata corresponding to a plurality of content to be processed, wherein each of the plurality of content to be processed includes a respective data component, and wherein each respective content workflow metadata comprises a respective workflow processing characteristic corresponding to a respective data component;
  
  receiving known content workflow metadata corresponding to a plurality of known content, wherein each known content includes a known data component, and wherein each known content workflow metadata comprises a workflow processing characteristic of the corresponding known data component;
  
  comparing the content workflow metadata of the content to be processed and the known content workflow metadata of the plurality of known content according to a similarity rule to identify a first subset of the plurality of known content having potentially similar content relative to the content to be processed;
  
  identifying a first subset of the plurality of content to be processed based on determining a potential similarity between respective data components based on the respective content workflow metadata; and
  
  outputting an identification of the first subset of the plurality of known content and the first subset of the plurality of content to be processed to use in reducing data in the content to be processed;
  
  wherein each workflow processing characteristic relates to a workflow process applicable to the corresponding content; and
  
  wherein the similarity rule comprises a workflow-specific similarity rule, wherein the workflow-specific similarity rule depends on a type of the workflow associated with the content to be processed.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, wherein receiving content workflow metadata further comprises obtaining the content workflow metadata from a system comprising a plurality of content workflow metadata corresponding to a plurality of workflow content, wherein the content to be processed comprises one of the plurality of workflow content.
  - 3. The method of claim 1, further comprising:
    - identifying a second subset of the plurality of known content, and wherein the second subset of the plurality of known content is not equal to the first subset of the plurality of known content;
      
      performing a data compression technique on the first subset of the plurality of content to be processed and the second subset of the plurality of known content to identify a reduced data representation of the content to be processed; and
      
      wherein outputting comprises outputting an identification of the reduced data representation.
  - 4. The method of claim 3, further comprising transmitting or receiving the reduced data representation.
  - 5. The method of claim 3, further comprising replacing a duplicate data component in the content to be processed with a token to form the reduced data representation.
  - 6. The method of claim 3, wherein performing the data compression technique comprises identifying a same data component in both the first subset of the plurality of content to be processed and the second subset of the plurality of known content.
  - 7. The method of claim 1, further comprising:
    - determining a data component difference between the content to be processed and the first subset of the plurality of known content;
      
      determining a network storage location of each of a plurality of network-based content having the data component difference;
      
      determining a network destination location for receiving a transmission of the data component difference;
      
      determining a delivery efficiency between each network storage location and the network destination location; and
      
      causing transmission of the data component difference to the network destination location from the respective network storage location having a most efficient one of the determined delivery efficiencies.
  - 8. The method of claim 1, further comprising:
    - identifying at least one of a same data component or a different data component between the first subset of the plurality of content to be processed and one of the first subset of the plurality of known content; and
      
      wherein the outputting further comprises outputting an identification of at least one of the same data component or the different data component.
  - 9. The method of claim 8, further comprising transmitting or receiving the different data component based on the identification of at least one of the same data component or the different data component.
  - 10. The method of claim 8, further comprising replacing the same data component in the content to be processed with a token based on the identification of at least one of the same data component or the different data component.
  - 11. The method of claim 1, further comprising:
    - transmitting the identification of the first subset of the plurality of known content to a data reduction component;
      
      receiving from the data reduction component an identification of at least one of a same data component or a different data component between the content to be processed and the first subset of the plurality of known content based on execution of a data compression protocol; and
      
      transmitting a reduced data representation of the content to be processed to a file transfer destination based on the identification of the at least one of a same data component or a different data component.
  - 12. The method of claim 11, further comprising identifying a second subset of the plurality of content to be processed that corresponds to the first subset of the plurality of known content according to the similarity rule, wherein the transmitting further comprises transmitting a respective data reduction signature of one or more portions of each of the second subset of the plurality of content to be processed or transmitting an identification of the first subset of the plurality of known content.
  - 13. The method of claim 12, wherein the receiving of the identification of at least one of a same data component or a different data component is further based on the data reduction component generating a respective data reduction signature of one or more portions of each of the first subset of the plurality of known content, and comparing the respective data reduction signatures to determine the same data component.
  - 14. The method of claim 11, wherein transmitting the reduced data representation of the content to be processed further comprises transmitting one or more different data components and one or more tokens representing a respective one or more same data components.
  - 15. The method of claim 1, further comprising:
    - obtaining a reduced data representation of the content to be processed based on the first subset of the plurality of known content;
      
      processing the reduced data representation; and
      
      updating the content workflow metadata corresponding to the content to be processed with information describing the processing.
  - 16. The method of claim 1, wherein receiving known content workflow metadata corresponding to the plurality of known content further comprises receiving known content workflow metadata corresponding to at least one of:
    - a plurality of previously-transferred content;
      
      or a plurality of previously-received content;
      
      or the plurality of content to be processed.
  - 17. The method of claim 1, further comprising:
    - identifying a proper subset of the plurality of content to be processed based on performing a data compression technique on the first subset of the plurality of content to be processed;
      
      identifying a second subset of the plurality of known content that represent content potentially similar to the proper subset of the plurality of content to be processed based on comparing a respective data component of a respective one of the proper subset of the plurality of content to be processed and a respective known data component of a respective one of the first subset of the plurality of known content according to the similarity rule, wherein the second subset of the plurality of known content is a proper subset of the first subset of the plurality of known content;
      
      performing a data compression technique on the proper subset of the plurality of content to be processed and the second subset of the plurality of known content to identify a reduced data representation of the plurality of content to be processed; and
      
      wherein outputting comprises outputting the reduced data representation.
  - 18. The method of claim 1, wherein receiving the content workflow metadata corresponding to the content to be processed further comprises receiving at a destination from a source located across a communication network, wherein the comparing further comprises comparing at the destination, and wherein outputting the identification of the first subset of the plurality of known content further comprises transmitting from the destination to the source.

19. A computer program product configured to identify potentially similar content for data reduction, comprising:
- a computer-readable medium comprising;
  
  at least one set of instructions operable to cause a computer to receive content workflow metadata corresponding to content to be processed, wherein the content to be processed includes a data component, and wherein the content workflow metadata comprises a workflow processing characteristic of the data component;
  
  wherein the content workflow metadata corresponding to the content to be processed further comprises a plurality of content workflow metadata corresponding to a plurality of content to be processed, wherein each of the plurality of content to be processed includes a respective data component, and wherein each respective content workflow metadata comprises a respective workflow processing characteristic corresponding to a respective data component,at least one set of instructions operable to cause the computer to receive known content workflow metadata corresponding to a plurality of known contents, wherein each known content includes a known data component, and wherein each known content workflow metadata comprises a workflow processing characteristic of the corresponding known data component;
  
  at least one set of instructions operable to cause the computer to compare the content workflow metadata of the content to be processed and the known content workflow metadata of the plurality of known content according to a similarity rule to identify a first subset of the plurality of known content having potentially similar content relative to the content to be processed;
  
  at least one set of instructions operable to cause the computer to identify a first subset of the plurality of content to be processed based on determining a potential similarity between respective data components based on the respective content workflow metadata; and
  
  at least one set of instructions operable to cause the computer to output an identification of the first subset of the plurality of known content and the first subset of the plurality of content to be processed to use in reducing data in the content to be processed;
  
  wherein each workflow processing characteristic relates to a workflow process applicable to the corresponding content; and
  
  wherein the similarity rule comprises a workflow-specific similarity rule, wherein the workflow-specific similarity rule depends on a type of the workflow associated with the content to be processed.

20. At least one processor configured to identify potentially similar content for data reduction, comprising:
- a first hardware module for receiving content workflow metadata corresponding to content to be processed, wherein the content to be processed includes a data component, and wherein the content workflow metadata comprises a workflow processing characteristic of the data component;
  
  wherein the content workflow metadata corresponding to the content to be processed further comprises a plurality of content workflow metadata corresponding to a plurality of content to be processed, wherein each of the plurality of content to be processed includes a respective data component, and wherein each respective content workflow metadata comprises a respective workflow processing characteristic corresponding to a respective data component;
  
  a second module for receiving known content workflow metadata corresponding to a plurality of known contents, wherein each known content workflow metadata comprises a workflow processing characteristic of the corresponding known data component;
  
  a third module for comparing the content workflow metadata of the content to be processed and the known content workflow metadata of the plurality of known content according to a similarity rule to identify a first subset of the plurality of known content having potentially similar content relative to the content to be processed;
  
  a fourth module for identifying a first subset of the plurality of content to be processed based on determining a potential similarity between respective data components based on the respective content workflow metadata; and
  
  a fifth module for outputting an identification of the first subset of the plurality of known content and the first subset of the plurality of content to be processed to use in reducing data in the content to be processed;
  
  wherein each workflow processing characteristic relates to a workflow process applicable to the corresponding content; and
  
  wherein the similarity rule comprises a workflow-specific similarity rule, wherein the workflow-specific similarity rule depends on a type of the workflow associated with the content to be processed.

21. A computing device for identifying potentially similar content for data reduction, comprising:
- means for receiving content workflow metadata corresponding to content to be processed, wherein the content to be processed includes a data component, and wherein the content workflow metadata comprises a workflow processing characteristic of the data component;
  
  wherein the content workflow metadata corresponding to the content to be processed further comprises a plurality of content workflow metadata corresponding to a plurality of content to be processed, wherein each of the plurality of content to be processed includes a respective data component, and wherein each respective content workflow metadata comprises a respective workflow processing characteristic corresponding to a respective data component;
  
  means for receiving known content workflow metadata corresponding to a plurality of known contents, wherein each known content includes a known data component, and wherein each known content workflow metadata comprises a workflow processing characteristic of the corresponding known data component;
  
  means for comparing the content workflow metadata of the content to be processed and the known content workflow metadata of the plurality of known content according to a similarity rule to identify a first subset of the plurality of known content having potentially similar content relative to the content to be processed;
  
  means for identifying a first subset of the plurality of content to be processed based on determining a potential similarity between respective data components based on the respective content workflow metadata; and
  
  means for outputting an identification of the first subset of the plurality of known content and the first subset of the plurality of content to be processed to use in reducing data in the content to be processed;
  
  wherein each workflow processing characteristic relates to a workflow process applicable to the corresponding content; and
  
  wherein the similarity rule comprises a workflow-specific similarity rule, wherein the workflow-specific similarity rule depends on a type of the workflow associated with the content to be processed.

22. A computing device for identifying potentially similar content for data reduction, comprising:
- a communications hardware module operable to receive content workflow metadata corresponding to content to be processed, wherein the content to be processed includes a data component, and wherein the content workflow metadata comprises a workflow processing characteristic of the data component;
  
  wherein the content workflow metadata corresponding to the content to be processed further comprises a plurality of content workflow metadata corresponding to a plurality of content to be processed, wherein each of the plurality of content to be processed includes a respective data component, and wherein each respective content workflow metadata comprises a respective workflow processing characteristic corresponding to a respective data component;
  
  wherein the communications module is further operable to receive known content workflow metadata corresponding to a plurality of known content, wherein each known content includes a known data component, and wherein each known content workflow metadata comprises a workflow processing characteristic of the corresponding known data component;
  
  a similarity identifier module having one or more similarity rules operable to compare the content workflow metadata of the content to be processed and the known content workflow metadata of the plurality of known content according to a similarity rule to identify a first subset of the plurality of known content having potentially similar content relative to the content to be processed;
  
  wherein the similarity identifier component is further operable to identify a first subset of the plurality of content to be processed based on determining a potential similarity between respective data components based on the respective content workflow metadata; and
  
  wherein the similarity identifier component is further operable to output an identification of the first subset of the plurality of known content and the first subset of the plurality of content to be processed to use in reducing data in the content to be processed;
  
  wherein each workflow processing characteristic relates to a workflow process applicable to the corresponding content; and
  
  wherein the similarity rule comprises a workflow-specific similarity rule, wherein the workflow-specific similarity rule depends on a type of the workflow associated with the content to be processed.
- View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39)
- - 23. The computing device of claim 22, wherein the communications module is further operable to obtain the content workflow metadata from a system comprising a plurality of content workflow metadata corresponding to a plurality of workflow content, wherein the content to be processed comprises one of the plurality of workflow content.
  - 24. The computing device of claim 22, further comprising:
    - wherein the similarity identifier component is further operable to identify a second subset of the plurality of known content, and wherein the second subset of the plurality of known content is not equal to the first subset of the plurality of known content;
      
      a data reduction component having a data compression protocol operable to compress the first subset of the plurality of content to be processed and the second subset of the plurality of known content to identify a reduced data representation of the content to be processed; and
      
      wherein outputting comprises outputting an identification of the reduced data representation.
  - 25. The computing device of claim 24, wherein the communications module is further operable to transmit or receive the reduced data representation.
  - 26. The computing device of claim 24, wherein the data reduction component is further operable to replace a duplicate data component in the content to be processed with a token to form the reduced data representation.
  - 27. The computing device of claim 24, wherein the data reduction component is further operable to identify a same data component in both the first subset of the plurality of content to be processed and the second subset of the plurality of known content.
  - 28. The computing device of claim 22, further comprising:
    - a data reduction component having a data compression protocol operable to determine a data component difference between the content to be processed and the first subset of the plurality of known content;
      
      a delivery management component having a data location identifier operable to determine a network storage location of each of a plurality of network-based content having the data component difference;
      
      a content processing component operable to determine a network destination location for receiving a transmission of the data component difference;
      
      wherein the delivery management component further comprises a delivery optimizer operable to determine a delivery efficiency between each network storage location and the network destination location; and
      
      wherein the delivery management component is further operable to cause transmission of the data component difference to the network destination location from the respective network storage location having a most efficient one of the determined delivery efficiencies.
  - 29. The computing device of claim 22, further comprising:
    - a data reduction component having a data compression protocol operable to identify at least one of a same data component or a different data component between the first subset of the content to be processed and one of the first subset of the plurality of known content; and
      
      wherein the data reduction component is further operable to output an identification of at least one of the same data component or the different data component.
  - 30. The computing device of claim 29, wherein the communications module is further operable to transmit or receive the different data component based on the identification of at least one of the same data component or the different data component.
  - 31. The computing device of claim 29, wherein the data reduction component is further operable to replace the same data component in the content to be processed with a token based on the identification of at least one of the same data component or the different data component.
  - 32. The computing device of claim 22, further comprising:
    - wherein the similarity identifier component is further operable to transmit the identification of the first subset of the plurality of known content to a data reduction component;
      
      a content processing component operable to receive from the data reduction component an identification of at least one of a same data component or a different data component between the content to be processed and the first subset of the plurality of known content based on execution of a data compression protocol; and
      
      wherein the content processing component is further operable to initiate transmission of a reduced data representation of the content to be processed to a file transfer destination based on the identification of the at least one of a same data component or a different data component.
  - 33. The computing device of claim 32, further comprising identifying a second subset of the plurality of the content to be processed that corresponds to the first subset of the plurality of known content according to the similarity rule, wherein the content processing component is further operable to initiate transmission of a respective data reduction signature of one or more portions of each of the second subset of the plurality of the content to be processed or transmitting an identification of the subset of the plurality of known content.
  - 34. The computing device of claim 33, wherein the identification of at least one of a same data component or a different data component is further based on the data reduction component generating a respective data reduction signature of one or more portions of each of the first subset of the plurality of known content, and comparing the respective data reduction signatures to determine the same data component.
  - 35. The computing device of claim 32, wherein the reduced data representation of the content to be processed further comprises one or more different data components and one or more tokens representing a respective one or more same data components.
  - 36. The computing device of claim 22, further comprising:
    - wherein the communications module is further operable to obtain a reduced data representation of the content to be processed based on the first subset of the plurality of known content;
      
      a content processing component operable to process the reduced data representation; and
      
      wherein the content processing component is further operable to update the content workflow metadata corresponding to the content to be processed with information describing the processing.
  - 37. The computing device of claim 22, wherein the known content workflow metadata corresponding to the plurality of known content further comprises known content workflow metadata corresponding to at least one of:
    - a plurality of previously-transferred content;
      
      or a plurality of previously-received content;
      
      or the plurality of content to be processed.
  - 38. The computing device of claim 22, further comprising:
    - a data reduction component having a data compression protocol operable to identify a proper subset of the plurality of content to be processed based on performing a data compression technique on the first subset of the plurality of content to be processed;
      
      wherein the similarity identifier component is further operable to identify a second subset of the plurality of known content that represent content potentially similar to the proper subset of the plurality of content to be processed based on comparing a respective data component of a respective one of the proper subset of the plurality of content to be processed and a respective known data component of a respective one of the first subset of the plurality of known content according to the similarity rule, wherein the second subset of the plurality of known content is a proper subset of the first subset of the plurality of known content;
      
      wherein the data reduction component is further operable to execute the data compression protocol on the proper subset of the plurality of content to be processed and the second subset of the plurality of known content to identify a reduced data representation of the plurality of content to be processed; and
      
      wherein the data reduction component is further operable to output the reduced data representation.
  - 39. The computing device of claim 22, wherein the computing device is located at a destination on a communications network, wherein the communications module is further operable to receive the content workflow metadata corresponding to the content to be processed from a source located across the communications network from the destination, and wherein the similarity identifier component is further operable to transmit the identification of first subset of the plurality of known content from the destination to the source.

40. A computer-implemented method of identifying potentially similar content for data reduction, comprising:
- receiving content workflow metadata corresponding to content to be processed, wherein the content to be processed includes a data component, and wherein the content workflow metadata represents workflow processing information corresponding to the data component;
  
  receiving known content workflow metadata corresponding to a first plurality of known content, wherein each known content includes a known data component, and wherein the known content workflow metadata represents workflow processing information corresponding to each respective known data component;
  
  determining a potential similarity between the data component of the content to be processed and at least one known data component of at least one of the first plurality of known content based on a similarity between the respective content workflow metadata and the respective known content workflow metadata;
  
  outputting an identification of potentially similar content, based on the determined potential similarity, for use in reducing data in the content to be processed;
  
  wherein receiving content workflow metadata corresponding to a content to be processed further comprises receiving a plurality of content workflow metadata corresponding to a plurality of content to be processed, wherein each of the plurality of content to be processed includes a respective data component, and wherein each respective content workflow metadata represents workflow processing information corresponding to a respective data component;
  
  identifying potentially similar ones of the plurality of content to be processed based on determining a potential similarity between respective data components based on the respective content workflow metadata;
  
  identifying a proper subset of the plurality of content to be processed based on performing a data compression technique on the identified potentially similar ones of the plurality of content to be processed;
  
  wherein determining the potential similarity with the first plurality of known content further comprise determining a potential similarity between a respective data component of a respective one of the proper subset of the plurality of content to be processed and a respective known data component of a respective one of the first plurality of known content based on a similarity between the respective content workflow metadata and the respective known content metadata;
  
  identifying a second plurality of known content that represent content potentially similar to the proper subset of the plurality of content to be processed based on the determined potential similarity, wherein the second plurality of known content is a proper subset of the first plurality of known content;
  
  performing a data compression technique on the proper subset of the plurality of content to be processed and the second plurality of known content to identify a reduced data representation of the plurality of content to be processed; and
  
  wherein outputting comprises outputting the reduced data representation.

41. A computer program product configured to identify potentially similar content for data reduction, comprising:
- a computer-readable medium comprising;
  
  at least one set of instructions operable to cause a computer to receive content workflow metadata corresponding to content to be processed, wherein the content to be processed includes a data component, and wherein the content workflow metadata represents workflow processing information corresponding to the data component;
  
  at least one set of instructions operable to cause the computer to receive known content workflow metadata corresponding to a first plurality of known contents, wherein each known content includes a known data component, and wherein the known content workflow metadata represents workflow processing information corresponding to each respective known data component;
  
  at least one set of instructions operable to cause the computer to determine a potential similarity between the data component of the content to be processed and at least one known data component of at least one of the first plurality of known contents based on a potential similarity between the respective content workflow metadata and the respective known content workflow metadata; and
  
  at least one set of instructions operable to cause the computer to output an identification of potentially similar content, based on the determined potential similarity, for use in reducing data in the content to be processedwherein receiving content workflow metadata corresponding to a content to be processed further comprises receiving a plurality of content workflow metadata corresponding to a plurality of content to be processed, wherein each of the plurality of content to be processed includes a respective data component, and wherein each respective content workflow metadata represents workflow processing information corresponding to a respective data component;
  
  identifying potentially similar ones of the plurality of content to be processed based on determining a potential similarity between respective data components based on the respective content workflow metadata;
  
  identifying a proper subset of the plurality of content to be processed based on performing a data compression technique on the identified potentially similar ones of the plurality of content to be processed;
  
  wherein determining the potential similarity with the first plurality of known content further comprise determining a potential similarity between a respective data component of a respective one of the proper subset of the plurality of content to be processed and a respective known data component of a respective one of the first plurality of known content based on a similarity between the respective content workflow metadata and the respective known content metadata;
  
  identifying a second plurality of known content that represent content potentially similar to the proper subset of the plurality of content to be processed based on the determined potential similarity, wherein the second plurality of known content is a proper subset of the first plurality of known content;
  
  performing a data compression technique on the proper subset of the plurality of content to be processed and the second plurality of known content to identify a reduced data representation of the plurality of content to be processed; and
  
  wherein outputting comprises outputting the reduced data representation.

42. At least one processor configured to identify potentially similar content for data reduction, comprising:
- a first hardware module for receiving content workflow metadata corresponding to content to be processed, wherein the content to be processed includes a data component, and wherein the content workflow metadata represents workflow processing information corresponding to the data component;
  
  a second module for receiving known content workflow metadata corresponding to a first plurality of known contents, wherein each known content includes a known data component, and wherein the known content workflow metadata represents workflow processing information corresponding to each respective known data component;
  
  a third module for determining a potential similarity between the data component of the content to be processed and at least one known data component of at least one of the first plurality of known contents based on a potential similarity between the respective content workflow metadata and the respective known content workflow metadata; and
  
  a fourth module for outputting an identification of potentially similar content, based on the determined potential similarity, for use in reducing data in the content to be processedwherein receiving content workflow metadata corresponding to a content to be processed further comprises receiving a plurality of content workflow metadata corresponding to a plurality of content to be processed, wherein each of the plurality of content to be processed includes a respective data component, and wherein each respective content workflow metadata represents workflow processing information corresponding to a respective data component;
  
  identifying potentially similar ones of the plurality of content to be processed based on determining a potential similarity between respective data components based on the respective content workflow metadata;
  
  identifying a proper subset of the plurality of content to be processed based on performing a data compression technique on the identified potentially similar ones of the plurality of content to be processed;
  
  wherein determining the potential similarity with the first plurality of known content further comprise determining a potential similarity between a respective data component of a respective one of the proper subset of the plurality of content to be processed and a respective known data component of a respective one of the first plurality of known content based on a similarity between the respective content workflow metadata and the respective known content metadata;
  
  identifying a second plurality of known content that represent content potentially similar to the proper subset of the plurality of content to be processed based on the determined potential similarity, wherein the second plurality of known content is a proper subset of the first plurality of known content;
  
  performing a data compression technique on the proper subset of the plurality of content to be processed and the second plurality of known content to identify a reduced data representation of the plurality of content to be processed; and
  
  wherein outputting comprises outputting the reduced data representation.

43. A computing device for identifying potentially similar content for data reduction, comprising:
- means for receiving content workflow metadata corresponding to content to be processed, wherein the content to be processed includes a data component, and wherein the content workflow metadata represents workflow processing information corresponding to the data component;
  
  means for receiving known content workflow metadata corresponding to a first plurality of known contents, wherein each known content includes a known data component, and wherein the known content workflow metadata represents workflow processing information corresponding to each respective known data component;
  
  means for determining a potential similarity between the data component of the content to be processed and at least one known data component of at least one of the first plurality of known contents based on a potential similarity between the respective content workflow metadata and the respective known content workflow metadata; and
  
  means for outputting an identification of potentially similar content, based on the determined potential similarity, for use in reducing data in the content to be processedwherein receiving content workflow metadata corresponding to a content to be processed further comprises receiving a plurality of content workflow metadata corresponding to a plurality of content to be processed, wherein each of the plurality of content to be processed includes a respective data component, and wherein each respective content workflow metadata represents workflow processing information corresponding to a respective data component;
  
  identifying potentially similar ones of the plurality of content to be processed based on determining a potential similarity between respective data components based on the respective content workflow metadata;
  
  identifying a proper subset of the plurality of content to be processed based on performing a data compression technique on the identified potentially similar ones of the plurality of content to be processed;
  
  wherein determining the potential similarity with the first plurality of known content further comprise determining a potential similarity between a respective data component of a respective one of the proper subset of the plurality of content to be processed and a respective known data component of a respective one of the first plurality of known content based on a similarity between the respective content workflow metadata and the respective known content metadata;
  
  identifying a second plurality of known content that represent content potentially similar to the proper subset of the plurality of content to be processed based on the determined potential similarity, wherein the second plurality of known content is a proper subset of the first plurality of known content;
  
  performing a data compression technique on the proper subset of the plurality of content to be processed and the second plurality of known content to identify a reduced data representation of the plurality of content to be processed; and
  
  wherein outputting comprises outputting the reduced data representation.

44. A computing device for identifying potentially similar content for data reduction, comprising:
- a communications hardware module operable to receive content workflow metadata corresponding to content to be processed, wherein the content to be processed includes a data component, and wherein the content workflow metadata represents workflow processing information corresponding to the data component;
  
  wherein the communications module is further operable to receive known content workflow metadata corresponding to a first plurality of known content, wherein each known content includes a known data component, and wherein the known content workflow metadata represents workflow processing information corresponding to each respective known data component;
  
  a similarity identifier module having one or more similarity rules operable to determine a potential similarity between the data component of the content to be processed and at least one known data component of at least one of the first plurality of known content based on a potential similarity between the respective content workflow metadata and the respective known content workflow metadata;
  
  wherein the similarity identifier component is further operable to output an identification of potentially similar content, based on the determined potential similarity, for use in reducing data in the content to be processed;
  
  wherein the content workflow metadata corresponding to the content to be processed further comprises a plurality of content workflow metadata corresponding to a plurality of content to be processed, wherein each of the plurality of contents to be processed includes a respective data component, and wherein each respective content workflow metadata represents workflow processing information corresponding to a respective data component;
  
  wherein the similarity identifier component is further operable to identify potentially similar ones of the plurality of content to be processed based on determining a potential similarity between respective data components based on the respective content workflow metadata;
  
  a data reduction component having a data compression protocol operable to identify a proper subset of the plurality of content to be processed based on performing a data compression technique on the identified potentially similar ones of the plurality of content to be processed;
  
  wherein the similarity identifier component is further operable to determine a potential similarity between a respective data component of a respective one of the proper subset of the plurality of content to be processed and a respective known data component of a respective one of the first plurality of known content based on a similarity between the respective content workflow metadata and the respective known content metadata;
  
  wherein the similarity identifier component is further operable to identify a second plurality of known content that represent content potentially similar to the proper subset of the plurality of contents to be processed based on the determined potential similarity, wherein the second plurality of known content is a proper subset of the first plurality of known content;
  
  wherein the data reduction component is further operable to execute the data compression protocol on the proper subset of the plurality of content to be processed and the second plurality of known content to identify a reduced data representation of the plurality of content to be processed; and
  
  wherein the data reduction component is further operable to output the reduced data representation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
MidCap Financial Trust
Original Assignee
Group Logic, Inc. (Acronis AG)
Inventors
Naef, Frederick E. III
Primary Examiner(s)
Wassum; Luke S.
Assistant Examiner(s)
Badawi; Sherief

Application Number

US11/966,618
Publication Number

US 20090171990A1
Time in Patent Office

1,054 Days
Field of Search

707/104.1, 707/620, 707/737, 707/754, 707/758, 707/748, 707/749, 715/234
US Class Current

707/737
CPC Class Codes

G06Q 10/06 Resources, workflows, human...

Apparatus and methods of identifying potentially similar content for data reduction

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

Citations

44 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus and methods of identifying potentially similar content for data reduction

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

44 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links