Skip to main content

Data Source configuration

Data sources refer to any resources that are used to capture the details necessary to extract data from a source. They may also be the source for data itself (e.g. an uploaded Excel workbook).

The data source(s) of a model can be seen by expanding the Data Sources section. You can view any of the data sources in the list by clicking it.

note

If a data source is indented in the list, it has been added as an additional data source (within the first non-indented data source above it).

Configuration of a data source can be done in two functional areas:

  • Data Source screen - Add new pipelines, refresh, and preview the source data
  • Data Source design panel - Configure more specialized settings like the behavior of the data source during an incremental process or how Data Hub profiles the data source.

Data Source screen

Data Source TABLES tab

The TABLES tab displays all tables available from the current data source. Tables already included in the model (as pipelines) have an existing pipeline icon beside their names.

You can hover over an individual table to view a tooltip.

The toolip displays the following information:

  • Table display name
  • Table database name (noted by the database icon)
  • Table's description (if one has been defined)
  • Defined relationships
tip

The text in the tooltip itself can be selected for copy and paste.

Data Source Preview pane

A Preview pane is available at the bottom of the data source. The rows displayed are taken directly from the selected source table on the TABLES tab which allows you to see data as it appears in the data source, before you import it via a pipeline.

Preview pane features:

  • Configure row limits - This applies a limit to the number of rows shown in the preview pane to improve performance. The default value can be adjusted by entering a new value into the Row Limit field and clicking apply.
  • Preview SQL - This allows the user to enter an SQL query and preview the result.

  • Download preview data - This downloads the data source preview as a .zip archive containing a .csv (Comma-Separated Values) file. The .csv file will contain all rows in the preview table, regardless of the Configure Row Limits setting.
  • Refresh preview - Refreshes the contents of the preview pane from the data source.

Change data source settings

Data source settings (that were initially defined when the data source was specified using the model wizard) can be updated at any time. Simply access the data source from the model tab, and change the settings, then save the data source.

When changes are made to a data source's connection details, a message appears to the left of the TABLES tab Refresh Table List button, informing you of any additional actions that need to be taken to update the data source.

Once the data source is saved, any changes made to these settings are automatically updated in the related model pipelines.

Refresh data source information

Use the Refresh options to verify that you are viewing the most up-to-date information from your data source. Refresh the currently displayed table or all of the source schema.

Click the Refresh button and then choose one of the following options:

  • Refresh Table List - Reloads the list of tables and their properties from the data source. This allows you to verify that you are viewing the most up-to-date list of tables available in the data source. Only the information shown on the TABLES tab, adjacent to this button, is reloaded. This option is useful if you know exactly what has changed in your data source (such as a new table being added), and you simply want it to be selectable from the TABLES tab.

  • Refresh Source Schema - This reloads the source schema, including the table list, for every pipeline that originates from the current data source (the data source whose tab is being viewed). This option is useful if your data source has changed, and you need to be sure that all changes have been propagated to Data Hub or if you are not sure exactly what has been altered.

note

If you only want to refresh pipeline descriptions (in the Pipelines list on the model tab) use the Refresh Source Schema option.

Data Source design panel

Some of the data source settings can be found in the Data Source design panel. These settings include the source incremental processing behavior for the data source (if the data source supports incremental processing) and the incremental processing behavior for the warehouse and cube (does not rely on the data source supporting incremental processing), how Data Hub profiles the data source, what version of the data source is in use and the data source's advanced settings.

The Data Source design panel with the Incremental Process (Source), Incremental Process (Warehouse & Cube) and Add Pipelines sections expanded

note

The Add Pipelines section appears in the Data Source design panel when pipelines are being added to the parent model.

Incremental Process (Source)

This section allows you to configure the incremental processing behavior for the data source (if supported).

  • Timestamp Column Name - This setting specifies the name of the column used as a timestamp for each pipeline.

  • Timestamp Comparison Operator - This setting changes the operator to use when comparing the timestamp column to determine which source rows to load. The options available are:

    • Greater Than or Equal - If a timestamp column is accurate to the minute, the last incremental process may not have loaded all the source rows in that minute (they may have not been written to the source yet). With Greater Than or Equal the next process would then pick up on the source rows that were missed as it would include that minute again (which Greater Than would not).

    • Greater Than - When the timestamp column is a batch identifier, with batches written together, you can make the assumption that the last incremental process found all the source rows (in every batch) and only start looking at new batches. This is particulary useful if the batches contain many rows, as with Greater Than or Equal Data Hub would still look at the already previously loaded batch.

  • Last Value Expression - This setting allows you to define how the maximum/last value is derived for the timestamp column. When left blank (the default setting), it uses MAX(@COLUMN). You can use @COLUMN to bind the timestamp column name in the expression.

example

When the source is replicated and an incremental process runs against this replicated source, some rows could be missed due to the replication still happening during the process. This may be one of the reasons why you would change Last Value Expression.

tip

These settings are inherited by each pipeline using the data source. You can override them on individual pipelines as needed.

Incremental Process (Warehouse and Cube)

This section allows you to configure the warehouse and cube incremental processing behavior for pipelines using this source. You can choose the default behaviour to use when processing with Default and Quick process types. The options available are:

  • Add & Update (default) - This option will add new rows and update existing rows.

  • Add Only - This option will only add new rows and will not update existing rows, which can make processing much faster.

tip

These settings are inherited by each pipeline using the data source. You can override them on individual pipelines as needed.

Profiling Settings

This section allows you to control how Data Hub profiles data from the data source. Profiling is used when adding a source pipeline to determine the best default configuration for the pipeline. It also controls the data source preview behavior.

note

See here for more information about pipeline profiling.

  • Rows to Profile - This setting determines the number of rows in each table from the data source that are included in the preview and examined to build a profile of each column. Tables in the data source with fewer than the specified Rows to Profile will be examined in full. You can reduce the number of rows to improve profiling speed, but accuracy may be reduced.

  • Profile additional data sources - When checked, additional data sources under this source will be included in profiling. Unchecking this setting makes adding pipelines faster but may lead to inaccurate preview results as rows from additional data sources will not be included.

Process Settings

  • Maximum parallel reading operations - This setting determines the maximum number of tables that may be read from the data source in parallel. This can be reduced in cases where the data source does not respond well to multiple parallel read operations or there are too many read operations happening in parallel which is overloading the data source.

  • Enable pipeline retries - This setting enables pipelines to retry loading data from this data source if an error is encountered during processing (instead of failing after one attempt). The options available are:

    • Time between retries - Sets the time between each retry, in minutes.
    • Number of retries - Sets the number of times to retry before failing.

Version

This section allows you to control which version of the data source is used. It displays the current version of the data source, the date it was installed on, and whether it is a hotfix version. You can switch to the hotfix version by selecting Apply hotfix.

warning

Only apply the hotfix version of a data source if you were directed to by support.

Advanced

This section allows you to control advanced settings of the data source. Note that the available settings depends on which data source you are configuring, so please check the specific documentation for your data source here.

Synchronize deletes

If selected, only rows not previously loaded from the source are loaded into the warehouse table, saving processing time.

If the pipeline contains complex SQL steps or calculations that aggregate, you should disable this setting to ensure the warehouse table is populated correctly.

With many database-type data sources such as SQL Server, deleting rows from data tables is slow. To optimize performance when incremental processing, only new rows are added and changed rows are updated by default.

Deleted rows are left in the table and are removed when the next full update is performed. Other data sources such as CSV, may not have a similar performance cost for incremental deletes. If rows in the source table are deleted frequently and the table is small, it may be worth selecting this check box.

note

Depending on the source database structure, deleted rows may affect the accuracy of reports. To minimize this problem, you should perform a full load of the source data periodically to effectively remove the deleted rows. A full load can be achieved by creating a model process schedule with the Disable Incremental check box selected.

In some data sources such as CRM, this problem is alleviated by not deleting rows but rather flagging them as disabled. In this case, the need for periodic full loads to maintain report accuracy is eliminated.

Additional data sources

Data Hub simplifies managing multiple data sources with the same schema. This is achieved with "additional" data sources which can be added to an existing data source on the ADDITIONAL SOURCES tab.

When a pipeline uses a source table from the parent data source, existing rows in the additional data source will be unioned in if the schema matches. The result is that the rows from both the parent and additional source are merged automatically.

example

One example of where you would use additional data sources is when the source system stores each company's data in a different database. By choosing the primary company's data source to be the parent and then adding an additional data source for each remaining company, you'll able to provide a single set of merged source tables that contain the data for all companies.

note

If you are connecting to a source known to store each company's data in a different database, Data Hub will automatically configure the ADDITIONAL SOURCES tab for you.

If merged source tables result in duplicate rows, you can bring in the Zap_DataSource column as an additional key column in your pipeline to differentiate the duplicated rows. This column is populated based on the code that was entered for the data source on the ADDITIONAL SOURCES tab.

Source tables from an additional source that don't match any parent table's schema can still be referenced by pipelines. Note that these pipelines only source the data from the additional source and not the parent.

Data source parameters

Parameters are designed to compliment data source-specific parameters, such as those currently available via a CRM data source (viewable on the PARAMETERS tab when a CRM data source's tab is open). Parameters added to individual data sources can only be used in pipelines that originate from that data source.

Add parameters

Parameters can be added to a model or an individual data source. Once defined, they can then be used in SQL expressions when creating calculated columns in pipelines. You can add parameters to two different locations within a model, depending on the desired scope of the parameter:

  • Data source. Parameters added to individual data sources can only be used in pipelines that originate from that data source.
    • Data source-level parameters are added using the Parameters tab on an individual data source.
    • These parameters can be used in pipelines that originate from the specific data source.
  • Model Parameters list. Parameters added to this list are model-wide and can be used in any pipeline in the model.

View and edit parameters

Parameters are viewed in one of the following two areas:

  • In the Parameters list on a model tab
  • On the Parameters tab for an individual data source

The PARAMETERS tab displays data source-level parameters. You can edit the data type, value, and description of parameters using the corresponding columns in the list of parameters.

note

Parameters that have a gray background cannot be edited.

Union sources

You can combine multiple pipeline source tables into a single pipeline using the source table union feature. You can manually combine (union) any table that exists in either the primary data source or in any additional data sources. The most common use of this feature is to combine corresponding tables from multiple companies into a single pipeline.

  • The source union action is performed by adding the additional pipeline tables to the Tables setting on the SOURCE pipeline tab.
  • This feature supports two scenarios: single data source and multiple data source.

Single data source

In the one data source there are multiple tables with the same schema, but different names — typically one for each company (named CompanyName.Tablename). You can union tables from different companies to have a single combined pipeline in the data warehouse. As the company tables have different names, you must manually add the appropriate tables to the pipeline.

The example below shows multiple tables containing address information that are combined into a single address pipeline. Initially, the EmployeeAddress pipeline is based on a single table. Additional tables may be added by typing their names, or clicking the + icon in the tag control.

In the following example, several additional tables (each of which contains address information) have been added to the EmployeeAddress pipeline, which originally only included the HumanResources.EmployeeAddress table.

If more than one tag has been manually added to the Tables tag control, as shown above, the table tag that was added first uses a solid border. It is the primary source table. Other table tags use a broken-line border. A different primary source table can be specified by right-clicking a table and selecting Set Primary Table. The primary table is used to populate any captions in the cube.

If you change the defined tables, you must click the Apply button for the changes to take effect.

Multiple data source

The multiple data source is typically used to join data sources from different companies typically containing corresponding tables with the same name and schema.

Starting with one company's data source, you add other company data sources, as additional data sources. In each pipeline (based on the initial data source) you can union corresponding tables from these multiple separate data sources to create a single combined pipeline in the data warehouse.

The Tables tag control initially displays the table in the currently selected data source, used for the pipeline.

If you have added additional data sources to the current pipeline data source , you can select the "Include matching tables from additional data sources" check box to have Data Hub automatically include any table in these additional data sources (that uses the same name as any tables currently listed in the Tables tag control).

In this scenario, the following behavior should be noted:

  • If any additional data sources are found to have matching table names, those data sources are listed directly below the check box. If a data source name appears in this list, it indicates that a table having the name shown in the Tables tag control in that data source is included in the pipeline source data. Otherwise, the (none) string is displayed (as shown above), meaning no matching tables have been found. ("Include matching tables from additional data sources" check box is selected by default.)

  • If only one tag has been added to the Tables tag control, the table in the initial data source will be the primary table. The primary table is used to populate any captions in the cube.

  • If the corresponding table in an additional data source has a different name to the one in the initial data source, it will not be included automatically. You may add it manually as a new tag by typing its name, or clicking the + icon in the tag control.

  • If more than one tag has been added to the Tables tag control, the table tag that uses a solid border is the primary source table. Other table tags use a broken-line border. A different primary source table can be specified by right-clicking a table and selecting Set Primary Table. The primary table is used to populate any captions in the cube.

You can add or remove tables using the tag control. If you change the defined tables, you must click the Apply button for the changes to take effect.

note
  • If no additional data sources have been added to the data source used by the current pipeline, the "Include matching tables from additional data source" check box does not appear.

  • Ideally, when tables are unioned, their schemas (column names and data types) should match exactly. Should the schemas vary, Data Hub includes all columns from all the source tables. When a column is not present in all source tables, it will contain null entries in the rows from the source tables that do not contain it.

  • Once a new data source or table is selected for a pipeline, you may need to make changes to the columns listed in the COLUMNS pipeline tab to ensure they still match your requirements.

How to preview data

How to add data to your model