December 2016 - Scaling Tableau to Enterprise

Tableau helps us to see and understand our data which is great. A lot of great things are happening every day when creative analysts have powerful Tableau Desktop with unlocked enterprise source data and Tableau server collaboration environment.

As Tableau adoption goes from teams to BU to enterprise, you quickly run into scalability challenges : Extract delays and enterprise data warehouse (EDW) struggles to meet ad-hoc workloads, etc.

My last blog talks about setting extract priority on server to improve 50% extract efficiency. This blog will focus on best practices for data source connection to scale EDW & server – use published data sources.

What is Tableau published data source?

It is nothing but Tableau’s semantic layer. For those who have been in BI space for a while, you may be familiar with Oracle BI’s Repository or Business Objects’ Universe. The problem of Repository or Universe is that they are too complex and are designed for specially trained IT professions only. Tableau is a new tool designed for business analysts who do not have to know SQL. Tableau has much simplified semantic layer. Tableau community has never focused enough on published data sources till recent when people start to realize that leveraging published data source is not only a great best practice but almost must to have in scaling Tableau to enterprise.

2. Again, what makes up Tableau published data source?

Information about how to access or refresh the data: server name & credentials, Excel path, etc.
The data connection information: table joins, field friendly names, etc
Customization and cleanup : calculations, sets, groups, bins, and parameters; define any custom field formatting; hide unused fields; and so on.

3. Why Tableau published data source?

Reusable: Published data sources are reusable connections to data. When you prep your data, add calculations, and make other changes to your fields, these changes are all captured in your data source. Then when you publish the data source, other people can use it to conduct their own analysis.

Single source of truth (SSoT): You can have data steward who defines the data model while workbook publishers who can consume the publish data source to create viz and analysis. Here is an example of how to set up permission to achieve SSoT.

Less workload to EDW: When you use extracts , one refresh of the published data source will refresh all data to its connected workbooks, which reduces a lot workloads to your EDW. This can be a very big deal to your EDW.

4. How many data sources are embedded vs published data sources? You can find it out from Data_Connections table. Look for the DBCLASS column, when value = ‘sqlproxy’, it means that it is a published data source. Work with your server admin if you do not have access to workgroup table of Tableau Postgre database.

If you have <20% data sources are published data sources, it means that published data sources is not well leveraged yet in your org or BU.

5. How to encourage people to use published data sources?

Control who can access to EDW: Let’s say you have a team of 10 Desktop users, you may want to give 2 of them the EDW access so you do not have to train all 10 people about table structure details while have the rest of 8 people to use published data sources created by the two data stewards.
If extracts are used, you can create higher priority to all published data sources as incentive for people to use published data sources. See my previous blog for details.
Make sure people know the version control feature works for data source as well
As data stewards, add comments to columns – here is how comment looks like when mouse over in Desktop Data pan:

Here is how to add comments: Tableaucomments1

Conclusions: Published data sources are not new Tableau feature but are not widely used although they are reusable, SSoT, scalable, less workload to your DB server. Tableau has been improving its publishing workflow by making data source publishing much easier than before since 9.3. Tableau v10 even gives you a new option to publish your data sources separately or not during workbook publish workflow. Data source revision history is great feature to control data source version. Tableau has announced big roadmap about data governance in TC16. However self-service practitioners do not have to wait any new Tableau features in order to leverage the published data sources.

Are you facing the situation that your Tableau server backgrounder jobs have much longer delay during peak hours ?

There are many good reasons why extracts are scheduled at peak hours, likely right after nightly ETL completions or even triggered automatically by ETL completions.

You always have limited backgrounders that you can have on your server. How to cut average extract delay without adding extract backgrounders and without rescheduling any extract jobs?

The keyword is job PRIORITY. There are some good priority suggestions in community(like https://community.tableau.com/thread/152689). However what I found the most effective approach to prioritize the extracts was duration based priority in additional to business criticality – I managed to reduce 50% extract waiting time after increased priority for all extracts with duration below median average runtime.

Here is what I will recommend as extract priority best practices:

Priority 10 for any business critical extracts : Hope nobody will disagree with me to give highest priority to business critical extracts..
Priority 20 for all incremental extracts : Not only normally incremental takes less time than full, but also it is an awesome incentive to encourage more and more people use incremental extracts
Priority 30 for any extracts with duration below median average (this is 50% of all extract jobs). This is another great incentive for publishers to make their extracts more effective. It is the responsibilities of both server admin and publishers to make backgrounder jobs more effective. There are many things that publishers can do to improve the extract efficiency : tune extracts to be more efficient, use incremental vs full extracts, hide unused columns, add extract filters to pull less data, reduce extract frequency, schedule extracts to off-peak hours, or better run extracts outside of Tableau by using Tableau SDK (see my blog @http://enterprisetableau.com/sdk/), etc.
Priority 50 for all the rest (default)
Turn on tabadmin set backgrounder.sort_jobs_by_run_time_history_observable_hours 36 which will prioritize full extracts in the same priority to run in order from shortest to longest based on their “last” run duration.

The combination of #3 and #5 will reduce your extract waiting time dramatically during peak hours.

What is this backgrounder sort by run time option (#5 above)? I am sure that you want to read official Tableau online help here.

In short, Tableau server can sort full extract refresh jobs with the same priority (like 50) so they are executed based on the duration of their “last run,” executing the fastest full extract refresh jobs first.

The “last run” duration of a particular job is determined from a random sample of a single instance of the full extract refresh job in last <n> hours which you can config. By default this sorting is disabled (-1). If enabling this, Tableau’s suggested value is 36 (hours)

Let’s say that you have the following jobs scheduled at 5am, here is how extracts are prioritized:

Priority	Job Name	Duration (min)	Background priority with sort_jobs_by_run_time option ON	Background priority with sort_jobs_by_run_time option OFF
10	Job 10.1	2	1	Those 4 jobs will go fist one by one w/o any priority among them. i.e. it could be Job 10.4, then Job 10.2, then Job 10.3 and Job 10.1
	Job 10.2	3	2
	Job 10.3	9	3
	Job 10.4	15	4
20	Job 20.1	1	5	Those 4 jobs will go one by one after all pririty 10 jobs. Again no priority among them.
	Job 20.2	2	6
	Job 20.3	14	7
	Job 20.4	15	8
30	Job 30.1	2	9	Those 4 jobs will go one by one after all pririty 20 jobs. Again no priority among them.
	Job 30.2	3	10
	Job 30.3	5	11
	Job 30.4	8	12
50	Job 50.1	1	13	Those 10 jobs will go one by one after all pririty 30 jobs. Again no priority among them.
	Job 50.2	9	14
	Job 50.3	20	15
	Job 50.4	25	16
	Job 50.5	30	17
	Job 50.6	50	18
	Job 50.7	55	19
	Job 50.8	60	20
	Job 50.9	70	21
	Job 50.10	80	22

For example, the max waiting time for the 1 min Job 20.1 will be 29 mins (all priority 10 jobs) with sort_jobs_by_run_time option ON. However the max waiting time could be 60 min with sort_jobs_by_run_time option OFF (all priority 10 jobs + other priority 2 jobs).

Re-cap on how extracts are run in this order:

Any task already in process is completed first.
Any task that you initiate manually using Run now starts when the next backgrounder process becomes available.
Tasks set with the highest priority (the lowest number) start next, independent of how long they have been in the queue. For example, a task with a priority of 20 will run before a task with a priority of 50, even if the second task has been waiting longer.
Tasks with the same priority are executed in the order they were added to the queue except if tabadmin set backgrounder.sort_jobs_by_run_time_history_observable_hours 36 is turned on. When the above option is on, the fastest full extract refresh jobs go first.

A few final practice guide:

Step 1: If most of your extracts have priority 50. You may want to try just turn on tabadmin set backgrounder.sort_jobs_by_run_time_history_observable_hours 36 to see how much waiting time improvement you can gain.
Step 2: If step 1 does not give you what you are looking for, try to change extract priority to higher priority (like 30 ) for any extracts with duration below median average. This will give you big waiting time reduction. You can start to change the extract priorities manually to see how it goes. Just be aware that any re-publishing of extracts will change priority back to default 50.
How to automate Step 2? I have not seen API for it. However I just had a program to update tasks.priority directly.
Why I do not recommend to have higher priority for more frequent jobs? I know that it is one of the recommended best practices by a lot of Tableau practitioners. However I just think that it drives a wrong behavior – it will encourage publishers to increase the extract from weekly to daily or to hourly just in order to get their jobs higher priority, which in turn causing more extract delays. I think that job duration and incremental high priority give much better incentive for publishers to make their extracts more effective, which becomes a positive cycle.

Scaling Tableau to Enterprise

Monthly Archives: December 2016

SCALING TABLEAU (3/10) – USE PUBLISHED DATA SOURCES

Scaling Tableau (2/10) – Set Extract Priority Based on Duration

Tableau Enterprise Deployment