Skip to main content

Adding Datasets

The first step in designing an AtScale model is to import the physical tables and views that it will be based on.

About reusing datasets

Reusing datasets between the dimension Canvas and the model Canvas, or between multiple dimensions, can cause the engine to take unexpected join-paths at deployment time.

If you want to reference the same table from multiple points in the same model, it is recommended that you define separate Query Datasets (QDSs) that select from the desired table. The separate datasets should be used to back each dimension or fact model component. Doing so ensures the generation of runtime join-paths that are readily identified in Design Center.

note

Each QDS must be created from scratch. Converting an existing dataset to a QDS does not satisfy the uniqueness requirement.

About this task

If you are not sure what datasets you need for your model, you can drag them to the main Canvas to look at them first. Consider the following:

  • Adding datasets to the repository: You can build up datasets in your repository by dragging all the datasets from the Datasources panel to the main Canvas. The new datasets are automatically added to the datasets/ folder in the repository. This allows you to preview columns, add calculated columns, and plan how you will model relationships between the datasets. You can then remove all the datasets from the Canvas and begin modeling by adding them one by one from the Workspace panel.

  • Using a table partitioned in BigQuery: If you use Google BigQuery, if you import as a fact dataset a table that is partitioned in BigQuery, the \_PARTITIONTIME pseudo-column is listed among the columns of the dataset. When you are modeling data, you can build a date dimension from this pseudo-column. When users working in client BI applications query your deployed model, they can include WHERE clauses on \_PARTITIONTIME; BigQuery will perform partition pruning when serving the results.

    For more information about the \_PARTITIONTIME pseudo-column, refer to the Google BigQuery documentation.

Add a table-based dataset to the repository

You can add table-based datasets to a model by either dragging them from the Datasources panel directly onto the model Canvas, or by creating them via the Dataset properties panel. Creating them via the panel gives you the opportunity to edit them before they are added.

To add a dataset via the Dataset properties panel:

  1. In the Workspace panel, click the new object icon and select Dataset.yml. A new dataset appears in the panel.

  2. Enter a unique name for the dataset file and press Enter. The Dataset properties panel opens.

  3. Complete the following fields:

    • Unique name: A unique name for the dataset. This must be unique across all repositories and subrepositories.
    • Label: A label for the dataset.
    • Description: Optional. A description of the dataset.
    • Source Type: Determines whether the dataset is based on a table or a query. Select Table.
    • Connection: Select the data warehouse connection that contains the table the dataset will be based on.
    • Database: Select the database that contains the table the dataset will be based on.
    • Schema: Select the schema to base the dataset on.
    • Table: Select the table to base the dataset on.
  4. (Optional) If you want to allow incremental aggregates to be built for the dataset, enable the Allow incremental builds option, then select an Incremental indicator and enter a Grace period.

    For more information on enabling incremental aggregates for a dataset, see Setting Properties to Allow Incremental Rebuilds of Aggregates.

  5. Click Apply.

The dataset is added to the repository and appears in the datasets/ folder in the Workspace panel. From here, you can edit it, or add it to models and dimensions as needed.

Add a query-based dataset to the repository

A query-based dataset allows you to select data from one or more tables in your data warehouse and save the query in AtScale as a new dataset. This enables you to manipulate the tables you have in your data warehouse and select just the rows you are interested in for your model. For example, you can select only the columns you need, join rows from multiple tables, and do other data manipulation functions. Query-based datasets do not alter your data warehouse.

note

You cannot add calculated columns to a query-based dataset after it is created. Do all of the required data manipulation in the SELECT list of the query.

To add a query-based dataset:

  1. In the Workspace panel, click the new object icon and select Dataset.yml. A new dataset appears in the panel.

  2. Enter a unique name for the dataset file and press Enter. The Dataset properties panel opens.

  3. Complete the following fields:

    • Unique name: A unique name for the dataset. This must be unique across all repositories and subrepositories.
    • Label: A label for the dataset.
    • Description: Optional. A description of the dataset.
    • Source Type: Determines whether the dataset is based on a table or a query. Select Query.
    • Source: Select the data warehouse connection that contains the table you want to query.
  4. In the Selection SQL field, enter the SQL query you want to use to create the dataset.

    Note

    If needed, you can define environment variables and include them in your SQL query. This enables you to create standard models that work with a variety of schemas. For more information, see Working with Environment Variables.

  5. Click Preview SQL.

    note

    This button is disabled if your data warehouse is an instance of Google BigQuery. This is because Google BigQuery does not respect the LIMIT keyword in SQL statements that would limit the size of the result set in the preview. Rather, BigQuery performs a full table scan, which is costly.

    If results appear, the SQL statement is valid. If you see an error or do not see data, double-check your SQL syntax and try running the statement directly against your data warehouse.

  6. (Optional) If you want to allow incremental aggregates to be built for the dataset, enable the Allow incremental builds option, then select an Incremental indicator and enter a Grace period.

    For more information on enabling incremental aggregates for a dataset, see Setting Properties to Allow Incremental Rebuilds of Aggregates.

  7. Click Apply.

The dataset is added to the repository and appears in the datasets/ folder in the Workspace panel. From here, you can edit it, or add it to models and dimensions as needed.

Add a dataset to a model

  1. Open the model you want to add datasets to and switch to the Canvas tab.

  2. In the Workspace panel, locate the dataset you want to add and do one of the following:

    • Drag the dataset onto the model Canvas.
    • Click the context menu for the dataset and select Add to Model.

The dataset appears on the model Canvas.

Add a dataset to a dimension

  1. Open the dimension you want to add datasets to and switch to the Canvas tab.

  2. In the Workspace panel, locate the dataset you want to add and do one of the following:

    • Drag the dataset onto the dimension Canvas.
    • Click the context menu for the dataset and select Add to dimension.

The dataset appears on the dimension Canvas.