Efficient data infrastructure for high dimensional data analysis

US 20080313213A1
Filed: 06/15/2007
Published: 12/18/2008
Est. Priority Date: 06/15/2007
Status: Active Grant

First Claim

Patent Images

1. In a computing environment in which source data is arranged as a data structure comprising record identifiers and dimensions, each dimension having a data value, which may be non-null or null for each record identifier, a method comprising, constructing an inverted index corresponding to a dimension, including by mapping data from raw dimension values to mapped values based on mapping entries in a dimension table, and arranging the record identifiers into subgroups within an record identifier data structure based on each record identifier'"'"'s corresponding mapped value in the dimension table.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described is a technology by which high dimensional source data corresponding to rows of records with identifiers, and columns comprising dimensions of data values, are processed into a file model for efficient access. An inverted index corresponding to any dimension is built by mapping data from raw dimension values to mapped values based on mapping entries in a dimension table. The record identifiers are arranged into subgroups based on their mapped value; a count and/or an offset may be maintained for locating each of the subgroups. The raw values for a dimension are maintained within a raw value file. For sparse data, the raw value file may be compressed, e.g., by excluding nulls and associating a record identifier with each non-null. A data manager provides access to data in the data files, such as by offering various functions, using caching for efficiency.

Citations

20 Claims

1. In a computing environment in which source data is arranged as a data structure comprising record identifiers and dimensions, each dimension having a data value, which may be non-null or null for each record identifier, a method comprising, constructing an inverted index corresponding to a dimension, including by mapping data from raw dimension values to mapped values based on mapping entries in a dimension table, and arranging the record identifiers into subgroups within an record identifier data structure based on each record identifier'"'"'s corresponding mapped value in the dimension table.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein constructing the inverted index further comprises computing a count or an offset, or both a count and an offset, for locating each of the subgroups within the record identifier data structure.
  - 3. The method of claim 1 wherein when a subgroup contains a plurality of record identifiers, the record identifiers are arranged sequentially within their subgroup.
  - 4. The method of claim 1 further comprising maintaining the raw values within a raw value file, with the raw values arranged in an order corresponding to an ordering of the record identifiers.
  - 5. The method of claim 1 further comprising compressing the raw values by excluding at least some raw values from the raw value file.
  - 6. The method of claim 5 wherein excluding at least some of the raw values comprises excluding nulls, and further comprising, associating a record identifier with each non-null raw value in the raw value file.
  - 7. The method of claim 1 further comprising mapping the record identifier to a physical cache address when accessing a record.

8. At least one computer-readable medium having computer-executable instructions, which when executed perform steps, comprising:
- processing high dimensional data corresponding to rows of record identifiers by columns of dimensions, including constructing a raw value file by which raw values for a dimension can be located, and constructing an inverted index containing subgroups of one or more record identifiers, each subgroup defined by a mapping value based on the raw value associated with each record identifier of that subgroup; and
  
  providing access to the raw value file and inverted index for use in analyzing the high dimensional data.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The computer-readable medium of claim 8 wherein constructing the inverted index further comprises computing a count or an offset, or both a count and an offset, for locating each of the subgroups.
  - 10. The computer-readable medium of claim 8 wherein constructing the raw value file comprises maintaining the raw values in an order corresponding to an ordering of the record identifiers.
  - 11. The computer-readable medium of claim 8 wherein constructing the raw value file comprises compressing the raw value file by excluding null values from the raw value file and associating the corresponding record identifier of each non-null raw value in the raw value file.
  - 12. The computer-readable medium of claim 8 further comprising mapping the record identifier to a physical cache address when accessing a record.
  - 13. The computer-readable medium of claim 8 wherein providing access to the raw value file and inverted index comprises providing at least one callable function, including at least one of:
    - a function to get the raw values of specified rows, a function to get the mapped values of specified rows, a function to get the rows of specified raw values, a function to get the rows of specified mapped values, a function to get a mapped value dictionary, or a function to get a row count, or any combination of a function to get the raw values of specified rows, a function to get the mapped values of specified rows, a function to get the rows of specified raw values, a function to get the rows of specified mapped values, a function to get a mapped value dictionary, or a function to get a row count.

14. In a computing environment, a system comprising:
- a data importer and processing mechanism coupled to a data source containing data corresponding to rows of record identifiers by columns of dimensions, the data importer and processing mechanism writing files containing information corresponding to the data, including a raw value file by which raw values for a dimension can be located, and constructing an inverted index file containing subgroups of one or more record identifiers, each subgroup defined by a mapping value based on the raw value associated with each record identifier of that subgroup; and
  
  a data manager that provides access to data in the data files.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The system of claim 14 wherein the data comprises software quality metrics data having session identifiers as record identifiers and dimension values arranged in a column for each dimension.
  - 16. The system of claim 14 further comprising a dimension table, wherein the mapping values are contained in the dimension table for a corresponding dimension.
  - 17. The system of claim 14 wherein the data importer and processing mechanism compresses the raw value file by excluding at least some raw values from the raw value file, and maintaining information associated with the raw value file that indicates which raw values were excluded.
  - 18. The system of claim 17 wherein the information associated with the raw value file that indicates which raw values were excluded comprises associating the record identifier with each raw value included in the raw value file such that any record identifier not present in the raw value file is known to be associated with an excluded raw value.
  - 19. The system of claim 14 further comprising a cache coupled to the data manager, the data manager mapping a record identifier to a physical cache address when accessing a record.
  - 20. The system of claim 14 wherein the data manager provides access to the data files via at least one function, including at least one of:
    - a function to get the raw values of specified rows, a function to get the mapped values of specified rows, a function to get the rows of specified raw values, a function to get the rows of specified mapped values, a function to get a mapped value dictionary, or a function to get a row count, or any combination of a function to get the raw values of specified rows, a function to get the mapped values of specified rows, a function to get the rows of specified raw values, a function to get the rows of specified mapped values, a function to get a mapped value dictionary, or a function to get a row count.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Li, Yantao, Sun, Bing, Zhang, Haidong, Wang, Jian, Liu, Guowei

Granted Patent

US 7,870,114 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/102
CPC Class Codes

G06F 16/283 Multi-dimensional databases...

Efficient data infrastructure for high dimensional data analysis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Efficient data infrastructure for high dimensional data analysis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links