Last week’s blog post on merging tables using Outer joins has proved to be pretty popular. (I guess I’m not the only one who struggled with this concept!) This week we’ll look at the remaining three options, showing how to merge tables using Inner and Anti joins.
The Inner and Anti join types:
Again, we have three join types to explore this week:
Inner Join
Left Anti Join
Right Anti Join
If you read last week’s article you may already have an idea of what you’ll be seeing here, but we’ll make sure we go through it in full anyway. (If you HAVEN’T read last week’s article, you might want to do so, as this one just builds on steps that readers will already be comfortable with.)
Sample Data
We’re going to work with the same set of data as we did last week, although we have a different sample file for it (to hold the completed queries.) That file can be downloaded here.
… two tables of data, one called Transactions, and one called ChartOfAccounts.
Now, the key piece you want to watch here is that we need the Account-Dept combination to exist in both tables in order to make a perfect join between them. Think VLOOKUP for a second… if you tried to make a VLOOKUP for account 10045 from the Transactions table against the ChartOfAccounts table, what would you get? Of course, you’d get #N/A since 10045 doesn’t exist in the ChartOfAccounts table.
In this case we have items in both tables that don’t exist in the other. (The yellow records in the Transactions table don’t have a match in the ChartOfAccounts table, and the red records in ChartOfAccounts don’t have a match in the Transactions table.) With these differences we can test how each of the first three join types available to us behave when we try to merge the data in both tables together.
We’ve already got the ChartOfAccounts query and Transactions queries set up as connection only queries, so we’re ready to jump right into comparing the join types.
Merge Tables using Inner and Anti Joins – Inner Join
This join type stands out somewhat from the others in that there is no “left” or “right” version. Let’s build the join to explore why that is:
Open the Workbook Queries pane
Right click the Transactions query and choose Merge
Select ChartOfAccounts for the bottom table
For the top query (Transactions) select Account, hold down CTRL and select Dept
For the bottom query (ChartOfAccounts) select Account, hold down CTRL and select Dept
Change the Join type to “Inner (only matching rows)”
Click OK
Like last week, the data lands in Power Query, and we’ll take the following steps to expand the rows:
Right click the NewColumn column –> Rename –> COA
Click the Expand icon on the top right of the COA column
We can now see the meaning of the Inner Join. Unlike the Full Outer Join that pulls in all records from both tables whether there is a match or not, the Inner Join only pulls in rows that exist in both the left and right tables. In other words… all those red and yellow rows shown in our original data set? They’re missing from this output.
I can see this being very useful for indentifying matching records, giving a list to show which ones matched without polluting the data set with non-matching items.
Let’s finalize this query and move to the Anti Joins:
Change the name of the query to Inner
Go to Home –> Close & Load –> Close & Load To… –> Only Create Connection
Merge Tables using Inner and Anti Joins – Left Anti Join
Now that we’ve seen Outer and Inner joins, give some thought as to what an Anti join might do…
Done? Let’s go see if you’re right.
Right click the Transactions query and choose Merge
Select ChartOfAccounts for the bottom table
For the top query (Transactions) select Account, hold down CTRL and select Dept
For the bottom query (ChartOfAccounts) select Account, hold down CTRL and select Dept
Change the Join type to “Left Anti (rows only in first)”
Click OK
And when the data gets to Power Query:
Right click the NewColumn column –> Rename –> COA
Click the Expand icon on the top right of the COA column
So this time, we only see the records from the left table (Transactions) that had no matching record in the ChartOfAccounts (right) table. How cool is that? This allows us to immediately identify records with no matches at all.
Let’s finalize this query as well:
Change the name of the query to Left Anti
Go to Home –> Close & Load –> Close & Load To… –> Only Create Connection
Merge Tables using Inner and Anti Joins – Right Anti Join
By this point I’m sure you can predict where this is going. So let’s get to it:
Right click the Transactions query and choose Merge
Select ChartOfAccounts for the bottom table
For the top query (Transactions) select Account, hold down CTRL and select Dept
For the bottom query (ChartOfAccounts) select Account, hold down CTRL and select Dept
Change the Join type to “Right Anti (rows only in second)”
Click OK
And when the data gets to Power Query:
Right click the NewColumn column –> Rename –> COA
Click the Expand icon on the top right of the COA column
Leave all the defaults and click OK
And, in anti-climactic fashion (ha!) we end up with the following:
Yes indeed, as you probably predicted, only the records shown highlighted in yellow in the ChartOfAccounts (right) table show up in this set. No record that has a match in the Transactions (left) table, nor any records in the Transactions table without a match in the ChartOfAccounts table show up.
We can now finish our final comparison:
Change the name of the query to Right Anti
Go to Home –> Close & Load –> Close & Load To… –> Only Create Connection
Final Thoughts
I think this is pretty cool stuff. If you’d asked me how many ways there are to join data, I think I would have been hard pressed to answer six until I wrote this up. But seeing it laid out in full, with each option detailed… I can see where these would each be useful in their own right. I hope you enjoyed taking the journey with me here!
Oh… and by the way… if you’d like to download one workbook with all six join types included… you can do that here.
Just a quick note here to let you know that the blog is closed this week. (We’re taking a well deserved break to enjoy some family time, and hope you get the chance to do the same.) But don’t worry, we’ll be back in January with more posts!
I also wanted to throw out a quick thank you to all of you who have been faithfully reading the blog on a weekly basis. Your support means a great deal, especially since I took the plunge to go full time with Excelguru back in May. It’s been a great year in that regard, as it’s given me the opportunity to build many things, including M is for Data Monkey. Much of the material for this book was inspired by blog comments and email, and I feel fortunate to have been able to translate those into a book that is being so well received.
As it’s the beginning of a new year, I thought it might be interesting to show my spin on creating a custom calendar in Power Query. This topic has been covered by many others, but I’ve never put my own signature on it.
Our goal
If you’re building calendar intelligence in Power Pivot for custom calendars, you pretty much need to use Rob Collie’s GFTIW pattern as shown below:
Note: The pattern as written above assumes that your calendar table is called “Calendar445”. If it isn’t, you’ll need to change that part.
This pattern is pretty robust, and, as shown above, will allow you to return the value of the measure for the prior period you provide. But the big question here is how you create the needed columns to do that. So this article will focus on building a calendar with the proper ID columns that you can use to create a 445, 454, 455 or 13 month/year calendar. By doing so, we open up our ability to use Rob Collie’s GFITW pattern for a custom calendar intelligence in Power Pivot.
If you’ve never used one of these calendars, the main concept is this: Comparing this month vs last month doesn’t provide an apples to apples comparison for many businesses. This is because months don’t have an consistent number of days. In addition, comparing May 1 to May 1 is only good if your business isn’t influenced by the day of the week. Picture retail for a second. Wouldn’t it make more sense to compare Monday to Monday? Or the first Tuesday of this month vs the first Tuesday of last month? That’s hard to do with a standard 12 month calendar.
So this is the reason for the custom calendar. It basically breaks up the year into chunks of weeks, with four usual variants:
445: These calendars have 4 quarters per year, 3 “months” per quarter, with 4 weeks, 4 weeks and 5 weeks respectively.
454: Similar to the 445, but works in a 4 week, 5 week, 4 week pattern.
544: Again, similar to 445, but works in a 5 week, 4 week, 4 week pattern
13 periods: These calendars have 13 “months” per year, each made up of 4 weeks
The one commonality here is that, unlike a standard calendar, the custom calendar will always have 364 days per year (52 weeks x 7 days), meaning that their year end is different every year.
Creating a custom calendar
In order to work with Rob’s pattern, we need 5 columns:
A contiguous date column (to link to our Fact table)
YearID
QuarterID
MonthID
WeekID
DayID
With each of those, we can pretty much move backwards or forwards in time using the GFITW pattern.
Creating a custom calendar – Creating a contiguous date column
To create our contiguous date column, we have a few options. We could follow the steps in this blog post on creating a dynamic calendar table. Or we could skip the fnGetParameter function, and just directly query our parameter table. Whichever method you choose, there is one REALLY important thing you need to do:
Your calendar start date must be the first date of (one of) your fiscal year(s).
It can be this year or last year, but you need to determine that. I’m going to assume for this year that my year will start on Sunday, Jan 3, 2016, so I’ll set up a basic table in Excel to hold the dates for my calendar:
Notice the headers are “Parameter” and “Value”, and I also named this table “Parameters” via the Table Tools –> Design tab. For reference, the Start Date is hard coded to Jan 3, 2016, and the End Date is a formula of B4+364*2 (running the calendar out two years plus a day.)
Now I’m ready to pull this into Power Query and build my contiguous list of dates.
Select any cell in the table –> Create a new query –> From Table
Remove the Changed Type step (as we don’t really need it)
This should leave you with a single step in your query (Source), and a look at your data table.
Click the fx button on the formula bar to add a new custom step
This will create a new step that points to the previous step, showing =Source in the formula bar. Let’s drill in to one of the values on the table. Modify the formula to:
Reading this, we’ve taken the Source step, drilled into the [Value] column, and extracted the value in position 0. (Remembering that Power Query starts counting from 0.)
Now this is cool, but I’m going to want to use this in a list, and to get a range of values in a list, I need this as a number. So let’s modify this again.
We’ve now got what we need to create our calendar:
Click the fx button to create a new step
Replace the text in the formula bar with this:
={StartDate..EndDate}
If you did this right, you’ve got a nice list of numbers (if you didn’t, check the spelling, as Power Query is case sensitive). Let’s convert this list into something useable:
Go to List Tools –> Transform –> To Table –> OK
Right click Column1 –> Rename –> DateKey
Right click DateKey –> Change Type –> Date
Change the query name to Calendar445
Right click the Change Type step –> Rename –> DateKey
The result is a nice contiguous table of dates that runs from the first day of the fiscal year through the last date provided:
Creating a custom calendar – Adding the PeriodID columns
Now that we have a list of dates, we need to add our PeriodID columns which will allow the GFITW to function.
Creating a custom calendar – DayID column
This column is very useful when calculating other columns, but can also be used in the GFITW formula to navigate back and forward over days that overlap a year end. To create it:
Go to Add Column –> Index –> From 1
Change the formula that shows up in the formula bar to:
=Table.AddIndexColumn(DateKey, "DayID", 1, 1)
Right click the Added Index step –> Rename –> DayID
NOTE: The last two steps are optional. Instead of changing the formula in the formula bar, you could right click and rename the Index column to DayID. Personally, I like to have less steps in my window though, and by renaming those steps I can see exactly where each column was created when I’m reviewing it later.
What we have now is a number that starts at 1 and goes up for each row in the table. If you scroll down the table, you’ll see that this value increases to 729 for the last row of the table. (Day 1 + 364*2 = Day 729).
Creating a custom calendar – YearID column
Next, let’s create a field that will let us navigate over different years. To do this, we will write a formula that targets the DayID column:
Go to Add Column –> Add Custom Column
Name: YearID
Formula: =Number.RoundDown(([DayID]-1)/364)+1
Right click the Added Custom step –> Rename –> YearID
If you scroll down the table, you’ll see that our first year shows a YearID of 1, and when we hit day 365 it changes:
The reason this works for us is this: We can divide the DayID by 364 and round it down. This gives us 0 for the first year values, hence the +1 at the end. The challenge, however, is that this only works up to the last day of the year, since dividing 364 by 364 equals 1. For that reason, we subtract 1 from the DayID column before dividing it by 364. The great thing here is that this is a pattern that we can exploit for some other fields…
Creating a custom calendar – QuarterID column
This formula is very similar to the YearID column:
Go to Add Column –> Add Custom Column
Name: QuarterID
Formula: =Number.RoundDown(([DayID]-1)/91)+1
Right click the Added Custom step –> Rename –> QuarterID
The result is a column that increased its value every 91 days:
It’s also worth noting here that this value does not reset at the year end, but rather keeps incrementing every 90 days.
Creating a custom calendar – MonthID column
The formula for this column is the tricky one, and depends on which version of the calendar you are using. We’re still going to create a new custom column, and we’ll call it MonthID. But you’ll need to pick the appropriate formula from this list based on the calendar you’re using:
Calendar Type
Formula
445
Number.RoundDown([DayID]/91)*3+
( if Number.Mod([DayID],91)=0 then 0
else if Number.Mod([DayID],91)<=28 then 1
else if Number.Mod([DayID],91)<=56 then 2
else 3
)
454
Number.RoundDown([DayID]/91)*3+
( if Number.Mod([DayID],91)=0 then 0
else if Number.Mod([DayID],91)<=28 then 1
else if Number.Mod([DayID],91)<=63 then 2
else 3
)
544
Number.RoundDown([DayID]/91)*3+
( if Number.Mod([DayID],91)=0 then 0
else if Number.Mod([DayID],91)<=35 then 1
else if Number.Mod([DayID],91)<=63 then 2
else 3
)
13 periods
Number.RoundDown(([DayID]-1)/28)+1
As I’m building a 445 calendar here, I’m going to go with the 445 pattern, which will correctly calculate an ever increasing month ID based on a pattern of 4 weeks, 4 weeks, then 5 weeks. (Or 28 days + 28 days + 35 days.)
This formula is a bit tricky, and – like the GFITW pattern – you honestly don’t have to understand it to make use of it. In this case this is especially true, as the formula above never changes.
If you’re interested however, the most important part to understand is what is happening in each of the Number.Mod functions. That is the section that is influencing how many weeks are in each period. The key values you see there:
0: Means that you hit the last day of the quarter
28: This is 4 weeks x 7 days
35: This is 5 weeks x 7 days
56: This is 8 weeks x 7 days
63: This is 9 weeks x 7 days
The Number.RoundDown portion divides the number of days in the DayID column by 91, then rounds down. That will return results of 0 through 3 for any given value. We then multiply that number by 3 in order to return values of 0, 3, 6, 9 (which turns out to be the month of the end of the prior quarter.)
The final piece of this equation is to add the appropriate value to the previous step in order to get it in the right quarter. For this we look at the Mod (remainder) of days after removing all multiples of 91. In the case of the 445, if the value is <= 28 that means we’re in the first 4 weeks, so we add one. If it’s >28 but <=56, that means it’s in the second 4 weeks, so we add two. We can assume that anything else should add 3… except if there was no remainder. In that case we don’t add anything as it’s already correct.
Creating a custom calendar – WeekID column
WeekID is fortunately much easier, returning to the same pattern we used for the YearID column:
Go to Add Column –> Add Custom Column
Name: WeekID
Formula: =Number.RoundDown(([DayID]-1)/7)+1
Right click the Added Custom step –> Rename –> WeekID
The result is a column that increases its value every 7 days:
The last thing we should do before we load our calendar is define our data types. Even though they all look like numbers here, the reality is that many are actually defined as the “any” data type. This is frustrating, as you’d think a Number.Mod function would return a number and not need subsequent conversion.
Right click the DateKey column –> Change Type –> Date
Right click each ID column –> Change Type –> Decimal Number
Go to Home –> Close & Load To…
Choose Only Create Connection
Check Add to Data Model
Click OK
And after a quick sort in the data model, you can see that the numbers have continued to grow right through the last date:
We now have everything we need in order to use the GFITW pattern and get custom calendar intelligence from Power Pivot. Simply update the PeriodID with the period you wish to use. For example, if we had a Sales$ measure defined, we can get last month’s sales using the following:
As an added bonus, as we’re using Power Query, the calendar will update every time we refresh the data in the workbook. We never really have to worry about updating it, as we can use a dynamic formula to drive the start and end dates of the calendar.
As you can see from reading the post, the tricky part is really about grabbing the right formula for the MonthID. The rest are simple and consistent, it’s just that one that gets a bit wonky, as the number of weeks can change. (To be fair, this would be a problem for the quarter in a 13 period calendar as well… one of those quarter will need 4 weeks where the rest will need 3.)
One thing we don’t have here is any fields to use as Dimensions (Row or Column labels, Filters, or for Slicers.) The reason I elected not to include those here is that the post is already very long, and they’re not necessary to the mechanics of the GFITW formula.
If you’d like a copy of the completed calendar, you can download it here. Be warned though, that I created this in Excel 2016. It should work nicely with Excel 2013 and higher, but you may have to rebuilt it in a new workbook if you’re using Excel 2010 due to the version difference on the Power Pivot Data Model.
We’ll announce the winners on the http://powerquery.training website on February 1, 2016, and contact the winners by email.
Oh by the way…
We sent this challenge out to the people who subscribed to our Power Query newsletter… and got a pretty cool comment back from one entrant:
“After taking your class, this seemed to be a pretty straightforward problem.”
Awesome! That is EXACTLY what we wanted to hear! Do you want to feel that comfortable manipulating data? We’ve still got spots open in our next course intake starting on February 3, 2016. Register here!
Even though this hits on techniques used on this blog before, a colleague asked today “I have a lot of garbage names in a column and only want to keep rows that begin with an alphabetical character.” I figured we’d explore that here.
The issue
The data at hand has a bunch of garbage records, as well as some that are valid. Anything that starts with a number, punctuation or other special character needs to go, leaving only items that start with alphabetical characters.
So basically, we want to take the table in blue, and convert it to the table shown in green, below:
Naturally, the first thing we need to do is bring the data into Power Query:
Select a cell in the data range
Create a new query –> From Table
Now, with the data in the table, our first temptation is to immediately try to come up with a way to figure out how to filter to only alpha characters. Unfortunately, there are 26 of them, and a custom filter will only allow 2 at a time:
Based on the concepts covered in the previous post, this formula takes the Name column, converts it to lower case, then splits the text at any occurrence of the letters provided between the quotes. This actually returns a list, so we use {0} to drill into the first instance. The trick here is that if the text starts with a letter, it splits and results in a blank record. If it’s not alphabetical, however, it gives us the character(s) that aren’t alphabetical:
Without question, the Expand feature that shows up when you are merging tables using Power Query is very useful. But one of the things I’ve never called out is that – when you are merging tables – you have the opportunity to aggregate data while expanding columns.
The one of the left is Inventory items, and holds the SKU (a unique product identifier), the brand, type and sale price. The table on the right is our Sales table, and holds the transaction date, SKU sold (many instances here), brand (for some reason) and the sales quantity. And, as it happens, I already have two queries set up as connections to these tables:
(Both of these were created by selecting a cell in the table, creating a new query –> From Table, setting the data types, then going to Close & Load To… –> Only Create Connection.)
Step 1: Join the Sales table to the Inventory table
The first thing we need to do is merge the two tables together. We will use the default (Left Outer) join (as described in this post) to make this happen:
Go to the Workbook Queries pane –> right click Inventory –> Merge
The key to understanding this is that the fields in the top will be preserved, the fields in the bottom will be aggregated (or grouped) together. Any columns in your original data set that you don’t specify will just be ignored.
Now let’s look at an alternate method to do the same thing…
Start by following Step 1 exactly as shown above. None of that changes. It’s not until we get to the part where we have the tables merged and showing a column of tables that the methods depart.
So this time:
Click the expand button
Click the Aggregate button at the top of the expand window:
The logic here is that, if the field is a date or text, it defaults to offering a count of the data in that column for each sales item I have. But if I click on the Sum of Sales Quantity, I get the option to add additional aggregation levels:
This is cool, as we don’t have to first expand, then group. And while I haven’t tested this, it only stands to reason that this method should be faster than having to expand all records then group them afterwards.
One thing that is a bit of a shame is that we can’t name the columns in the original aggregation, so we do have to do that manually now:
Right click Sum of Sales Quantity –> Rename –> Total Units Sold
Right click Average of Sales Quantity –> Rename –> Avg Units Sold
This post solves a tricky issue of removing offset duplicates or, in other words, removing items from a list that exist not only in a different column, but also on different rows.
Problem History
This data format is based on a real life example that my brother in law sent me. He is a partner in a public practice accounting firm*, and has software to track all his clients. As he’s getting prepared for tax season he wants to get in contact with all of his clients, but his tax software dumps out lists in a format like this:
As you can see, the clients are matched to their spouses, but each client (spouse or not) has their own row in the data too. While this is great to build a list of unique clients, we only want to send one letter to each household.
The challenges we have to deal with here is to create a list of unique client households by removing the spouse (whomever shows up second) from the list. The things we need to be careful of:
Not accidentally removing too many people based on last name
Getting the duplicate removal correct even if the spouse has a different last name
So basically, what we have here now is a client ID for each "Client" (not spouse) in our list.
Figuring out the Spouse's ClientID
The next step to this problem is to work out the Spouse's client ID for each row as well. To do that we're going to employ a little trick I've actually been dying to need to use. Image may be NSFW. Clik here to view.
See, ever since I've started teaching Power Query to people, I've mentioned that when you go to append or merge tables, you have to option to use merge the table you're working on against itself. As I've said for ages "I don't know when I'll need to use this, but one day I will, and it's comforting to know that I can." Well… that day is finally here!
Go to Home –> Merge Queries
From the drop down list, pick to merge the query to itself
Now comes the tricky part… we want to merge the Client with the Spouse, so that we can get the ClientID number that is applicable to the entries in the Spouse columns. So:
In the top table, select Client FirstName –> hold down CTRL –> select Client LastName
In the bottom table, select Spouse FirstName –> hold down CTRL –> select Spouse LastName
Looks good, and if you check the numbers, you'll see that our new column has essentially looked up the spouse's name and pulled the correct value from the ClientID column. (Zoe Ng has a client ID of 2. Zoe is also Tony Fredrickson's spouse – as we can see on row 4 – and the Spouse ID points back to Zoe's value of 2.
Remember how I mentioend to pay attention to the order of the records in the previous step? Have a look at the ClientID column now. I have NO IDEA why this changed, but it happend as soon as we expanded the merged column. I'm sure there must be some logic to it, but it escapes me. If you know, please share in the comments. It doesn't affect anything – we could sort it back into ClientID order easily - it's just odd.
At any rate, we can now fully solve the issue!
Removing Offset Duplicates
So we have finally arrived at the magic moment where we can finish this off. How? With the use of a custom column:
Go to Add Column –> Add Custom Column
Provide a name of "Keep?"
Enter the following formula:
if [ClientID]<[SpouseID] then "Keep" else "Remove"
The trick here is that we are using the first person in the list as the primary client, and the spouse as the secondary, since the list is numbered from top to bottom. Since we've looked up the spouses ID number, we can then use some very simple math to check if the ClientID number is less than the Spouse's ClientID. If it is we have the primary client, if not, we have the spouse.
So let's filter this down now:
Filter the Keep? column and uncheck the Remove item in the filter
Select the ClientID, SpouseID and Keep? columns –> right click –> remove
And finally we can go to Home –> Close & Load
And there you are… a nice list created by removing offset duplicates to leave us with a list of unqiue households:
Just a quick note to say that even though I'm an accountant, my brother in law Jason is so good at tax that I use him to do mine. If you need a good accountant in BC, Canada, look him up here.
The intake is closing soon for our first Power Query class of 2016, which starts on February 3, 2016.
If you haven't heard about this, or you've been considering taking it but haven't signed up yet, you've been missing out. We truly believe that you'll never take a course that can have this much impact on your job. If you routinely clean up and prepare data before you can analyze it, and you're NOT using Power Query to do it, you're putting in too much effort and doing too many things over again. Quite simply, you (or your staff) are wasting their time. You owe it to yourself to join us and find out how you can significantly decrease or eliminate data preparation time and devote your skills to what they were hired for: analyzing results and reacting to them.
What is included?
We've reviewed the course since we started airing it last year, and overall have been very pleased with the feedback that we've received, as well as the way it's been delivered. In case you weren't aware, every registration includes:
Full downloadable recordings of the entire training event. (So if you have to miss some time, it's okay, as you get to download it later to re-watch it on your schedule. We've found people really like this, as it helps not only with time zone issues, but also allows you to review the material at a later date when you are trying to implement your own solutions.)
Copies of every workbook used in the workshop delivery
Access to our SQL Azure database so you can practice working with data in SQL
6 practice labs with full written and video solutions.
Real world examples to explain not only how to do the job, but also the value proposition of using Power Query
Explanations and demos of pitfalls, hurdles and gotchas!
A free digital copy of M is for Data Monkey
A Q&A day to ask questions about applying the techniques to YOUR data
Course Improvements
We're really proud of all of that. But one part bothered us in our intial setup… we felt that our Q&A day came a bit too fast, and didn't allow people enough time to really use Power Query to any great degree. To that end we are still offering a Q&A day – heck, we think this is a huge value proposition to the course as you can submit your own issues and we will solve them for you! – but we have bumped the date out a bit. Instead of hosting our Q&A session one week after the main course, we are now hosting it two weeks after the final day. We feel that this should allow more time for our attendees to experiment with their data and submit even more challenges for Miguel and I to solve and demo for you. And remember, the entire Q&A session is recorded for you to download too… so even if you can't make it, you can still submit your questions and get them answered!
Long lasting training resources
We've worked really hard on this course, and tried to make this one of the most complete training packages on the planet. We've included as many resources as possible to get you up and running with kick-butt and maintainable solutions as quickly as possible. We've worked hard to give you resources that will FAR outlast your time in our class and impact the way you work with data forever. Don't miss your opportunity to jump on this, as our next intake won't be until some time in April! Why miss out on a whole 2 months of productivity gains?
Even better, the skills you'll learn here aren't just applicable to Excel 2010 and higher… they are also applicable to Power BI and Power BI desktop. So you're learning material that will help you with multiple programs in one session!
Need professional development hours?
We are more than happy to provide you with a certificate of completion, as well as the actual hours you are logged in online.
Discounts available if you register now!
This training will pay for itself, we're sure of it. But to make that even more likely, we're offering you a 10% discount on the list price of $595 USD. Use code GPCPA1 at checkout and we'll knock $59.50 off the price, but only until January 31…which is coming up in a couple of days!
Register for the first Power Query class of 2016 here
To register or learn more about the course, head on over to http://powerquery.training/course We hope you'll join us so that we can help transform your Excel skills into a whole new level of awesomely efficient!
Yes, you read that right. If you haven't heard yet, I'll be coming to New Zealand and Australia in just under a month! And the entire purpose of the trip is to come and share Excel knowledge for with my friends and colleagues south of the Equator.
I'm pretty jazzed about this, and not just because I get to go to the southern hemisphere for the first time in my life. And also not just because I get to talk about Excel when I'm there. That would be enough, but no… I'm jazzed because I get to do this with some pretty cool friends who are world respected leaders in their area.
Excel Summit South
The main purpose for my trip is the Excel Summit South conference. Two days, two tracks of advanced Excel material in 3 different cities:
Mar 3&4: Auckland, New Zealand
Mar 6&7: Syndey, Australia
Mar 9&10: Melbourne, Australia
And the best part about this conference is that – while it's sponsorsed by Price Waterhouse Coopers – regisration is open to everyone. So basically, you can check the schedule, pick the sessions that interest you, and learn things that will impact your Excel skills. In other words, if Valuation Modelling isn't your thing, then you can go to a Power Query class. And if Power Query isn't your thing… well… you're kind of odd, but there will be something that is. Image may be NSFW. Clik here to view.
The cast and crew for this conference really can't be beat. Charles Williams, Bill Jelen, Jon Peltier, Zack Barresse, Liam Bastick are all Excel MVP's on the bill (as well as Ingeborg Hawighorst in our New Zealand apperance.) Heck, we've even got a couple of guys from Microsoft attending and presenting as well. This is a fantastic opportunity to not only meet some of the big hitter independant Excel folks out there, but also to talk to Microsoft directly. How can you pass that up?
My Sessions
If it hasn't shown yet, I'm seriously looking forward to this conference. Personally I'll be leading two sessions:
An End to Manual Effort: The Power Query Effect
What Power Query is, why you care, and how it can re-shape and transform the data experience. What's really special about this session is that I'm going to take this data and turn it over to Jon Peltier who is then going to take it and turn it into a dashboard. This is perfect, as I'm demoing how to automate data cleanup, and Jon will show you how to use it to add true business value… the real life cycle of Excel data in just a couple of hours.
The Impact of Power Pivot
This one will be fascinating, especially for those who have never seen Power PIvot in action before. In just an hour I'll show you how big business BI (business intelligence) is at the fingertips of anyone with an Excel Pro Plus license. It's applicable to companies as small as one employee, and scales up to multi employee small businesses, and even large businesses. (Departments in large corporations eat this up, as they effectively just act as a small business within the larger whole.)
Register for Excel Summit South now!
Tickets are going fast for this event in all cities, so we ecourage you to register sooner rather than later, and hope to see you there! You can find out more details and register at: https://excelsummitsouth.wordpress.com/
I was a bit surprised to see some Excel 2016 updates when I opened it up this morning. For reference, I am on an Office 365 early release program – so I might get these a bit before you do – but how cool is this? Some of the key ones that made me take note:
New Formulas
We’ve got some new formulas to add to our arsenal. I haven’t tried any of them yet, but the ones listed were:
I put this last, but to me this is the biggest deal of the whole bunch. The Power Query engine has been updated to version 2.29.4217.xxx. It’s hard to see what’s been added, as the update hasn’t been released for Excel 2010/2013 yet, nor has a detailed feature page…
Having said that, a feature that I asked for a while back has finally been implemented: Monospaced Fonts.
The importance of this is huge. Power Query has always been big on using a pretty font, which wasn’t monospaced. I.e. the characters weren’t the same width. This is a big problem if you are trying to split by number of characters, as they just don’t line up.
Now, there is still an issue… Power Query is still aggressively trimming spaces (something that started with version 2.28.xxx) as you can see below:
How much easier will that be for splitting columns based on width? Like 1000% easier, that’s how much!
Dear Power Query team
This is a fantastic feature, thank you. I’ve got two asks for you:
Can you get us the update for Excel 2010/2013 fairly soon? We need this there as well.
Can you please give me an option to set Monospaced as the default way to display my queries? This is not due to the overzealous trimming issue (which I do want to see fixed) but rather because this is the way I need to see my data come in every time.
I’m only a few days away from my flight to New Zealand to kick off the first leg of Excel Summit South. I’m really looking forward to it. And if you’ve been sitting on the fence as to why you should attend… just ask Jeff Weir. (Seriously, read his post, it’s awesome!) But you need to act quick here, as it’s pretty much your last chance to register for Excel Summit South now.
What it’s all about
This will be a great opportunity to keep up with modeling practices, extend your analysis skills, and see what’s happening with Excel. Full details about the Summit can be found at the Excel Summit South 2016 web page, but you can read about some of the high points below, or – did I mention that you should read Jeff Weir’s Why I’m going to Excel Summit South. (And why you should too) post on Daily Dose of Excel?
When and where is Excel Summit South?
The Summit will take place at these cities on the dates shown:
Auckland: Thurs-Fri 3-4 March (register by 28 February)
Sydney: Mon-Tues 7-8 March (register by 1 March)
Melbourne: Thurs-Fri 10-11 March (register by 6 March)
Last Chance to register for Excel Summit South - Discounts available!
As an additional incentive, we’ve arranged a last chance registration discount, but only up to the date above. Simply REGISTER HERE and use the code LASTCHANCE to save 30% on your registration fees.
23 Excel Master Classes
With your registration, you can choose from 23 master class sessions over two days. There are twin tracks for modelers and analysts alike, and you can jump between if you’d prefer to do so.
Modeling Track – Manage Spreadsheet Chaos, Testing Spreadsheets, Avoiding Common Errors, Modeling Best Practices, Simulation Analysis Without VBA, Power Pivot.
Analysis Track – Tables, Pivot Tables, Power Query, Data Visualization, Dashboards, Automating Excel.
The Who’s Who of Excel…
Learn from six (seven!) leading Excel MVPs as they discuss the Excel topics most useful to you.
Liam Bastick (AU), Zack Barresse (US), Bill “Mr Excel” Jelen (US), Ken Puls (CA), Jon Peltier (US), Charles Williams (UK), with a guest appearance by Ingeborg Hawighorst (NZ) in Auckland.
Hear industry leading speakers about Financial Modeling best practices, standards and spreadsheet risk.
Smita Baliga (PwC), Félienne Hermans (Delft U), Ian Bennett (PwC), Andrew Berkley (F1F9).
Interact with members of the Microsoft Excel Dev Team as you explore with them the future of Excel.
Ben Rampson and Carlos Otero from the Microsoft Excel product team.
Network and Interact
As if the classes weren’t enough, we’ll also have Panel Discussions, Ask The Experts sessions, Demonstrations of Commercial Excel Tools, and even an Evening Meet-up where you can ask your Excel questions over a beer. (Full caveat… the quality of the answers may decline as the evening progresses!)
A shout out to our principal sponsor
Image may be NSFW. Clik here to view.Our principal sponsor for this Summit is PwC Australia and PwC New Zealand. We appreciate them coming on board to host this event!
I got a question on the blog recently about creating a banding function in Power Query, or creating buckets for Accounts Receivable transactions. (30-60 days, 60-90 days, etc..) As this is something that can be applied to a lot of areas, I thought it might make a good post to cover.
Picture that you have a list of transactions that could be from 1 – 170 days overdue, and you'd like to group them as follows:
0-30 days (current)
31-60 days
61-90 days
91-120 days
>120 days
You could create a table with 365 days in column 1 and the appropriate description in column 2, then merge them, but that seems like a lot of work. It would be much easier to create a simple little function that banded them correctly for us. Especially if you happen to have a little template that you can refer to…
The Banding function
The banding function template we need is shown below:
days (highlighted in yellow) is the variable that we'll pass into our function to evaluate
ARBand is the name of our function
Between the indented curly braces we have a list of the potential outcomes we'd like to use for our bands. If the value of x (which we will test) is less than 31, it is labelled "Current". If not, then -- if it's less than 61 -- it is labelled "30-60 Days" and so on. The final clause (=>true) basically returns an "else" statement.
The Result line then checks the days variable against the list and returns the correct match or the "else" clause if no match is found (">120 Days" in our case)
This banding function is a super useful template that you can modify to suit for any grouping needs. If you are updating this function for your own scenario, make sure that the yellow pieces match, the orange pieces match, then change the number bands and offsetting text pairs (ensuring that the remain wrapped in quotes.)
You can add as many steps (bands) as you need, just make sure that each line ends with a comma, and the =>true line stays at the end of the list.
To implement the function:
Create a new query –> from blank query
Enter the Advanced Editor
Paste in the code shown above
Modify your bands to suit
Click OK to exit the advanced editor
Name the function
I obviously didn't need to edit mine, and I called mine "DayBanding".
Setting up the data
There are two pieces that I need to deal with for my scenario. I have a transactions table, but it only lists the original transaction dates. In order to work out the day bands, I need to create a way to show how many days have been elapsed. Easy enough to do, I just need to pull in today's date from somewhere.
So I created a simple table that holds today's date:. (It's hard coded in the same file, since the transaction dates are hard coded as well.) Regardless, it looks like this.
Since I'm going to need the date to work out the number of days outstanding, I'll start there. The steps to accomplish this:
Select a cell in the parameter table –> New Query –> From Table
Rename the query to "Today"
Click the fx icon in the formula bar
Modify the formula to show as follows:
= Date.From(#"Changed Type"[Value]{0})
(I've discussed this technique a lot on the blog in the past – like in this post – but it basically we are drilling in to the first item in the [Value] column of that table, then wrapping the item with the Date.From() function to extract the date. We'll use this shortly, but first…
Go to home –> Close & Load To… –> Only create connection
And we now have a way to pull up the date when need.
Grabbing the transactions table
Next I needed to pull in the ARTransactions table, include the date, work out the number of days outstanding, then band it all. Here's the steps I used:
Select a cell in the ARTransactions table –> New Query –> From Table
Add a Custom Column
Name: Today
Formula: =Today
This works since we called our original function Today, and we drilled right in to the date.
Now I'd like to build a Pivot Table using this, but I'm not really in love with the idea that I have to load this data to a table first. I mean really, I only added a single column. Normally I'd load this to the data model, but I don't really need Power Pivot for what I want to do. So let's take a look at another little trick that will let us avoid the data duplication that would be caused by loading this to either the Data Model or the Worksheet.
Close & Load To… –> Only Create Connection
Now we need to build the Pivot Table. I'm going to show the steps for this in Excel 2016 (because I'm working on a computer that only has Excel 2016), but you should be able to make this work in Excel 2010/2013 as well.
Insert –> Pivot Table
Choose External Data Source (yes, you read that right) –> Choose Connection
I showed a couple of tricks here: How to use a Banding function, and how to build a Pivot Table directly against a connection only query without having to go through Power Pivot. Both useful things that you should have in your arsenal of tools. Image may be NSFW. Clik here to view.
I know that this comes with limited notice but… as many of you know I'm currently in Sydney, Australia, and I'll be in Melbourne in a couple of days for Excel Summit South. Well, as it happens, I'm actually staying in Melbourne for another week to deliver some live Power Query and Power Pivot training for a client.
Well guess what… we still have a bit of room, so we are going to open it up to the general public. If you're interested in a full day of hands on training on either Power Pivot or Power Query, check out what we are doing at Parity Analytic's website, or download the individual brochures here:
A while back I got an email from someone who had taken my Power Query training course online. They were asking how to create a running total, although with some added twists and turns for calculating taxable gains and losses for a stock portfolio. I decided to tackle that using the List.Accumulate() function.
Now, to be fair, I'm not going to demo the whole stock portfolio thing, but I do want to look at the List.Accumulate function as I found this a bit… confusing… to build. It's super useful to be sure, but the help article… it needs work.
The Data
I'm using a pretty simple dialog box, inspired by my time in Australia. You can download a copy from this link, but here's what it looks like:
List.Accumulate(list as list, seed as any, accumulator as function) as any
Arguments:
Argument
Description
list
The List to check.
seed
The initial value seed.
accumulator
The value accumulator function.
Example:
// This accumulates the sum of the numbers in the list provided.
List.Accumulate({1, 2, 3, 4, 5}, 0, (state, current) => state + current) equals 15
Using the List.Accumulate Function
So this formula looks pretty promising. Let's go see how it works…
Click in the table of data –> create a new query –> From Table
Go to Add Column –> Add Custom Column
Formula Name: Initial
Formula:
=List.Accumulate(
#"Changed Type"[Sales],
0,
(state, current) => state + current
)
The tricky part here is the #"Changed Type"[Sales], which provides the list of the sales values from the Changed Type step of the query (that was automatically created when we pulled the data in.)
So this is a bit weird, as it shows the total for all rows, rather than the running total. I figured that you should be able to change the accumulator function… except that there is no documentation about what the options are! (I left some critical feedback on the MSDN site, and would suggest you do too, as that's pretty poor.)
At any rate, I tried dropping the "+ current" from the end, leaving just => state. The result was a 0 value all the way down the column. So that plainly didn't work. Then I tried modifying the formula again, leaving => current instead. The result was 231 on all rows (so the last value in the accumulator.) How 0 + 231 = 1095 I'm not quite sure but whatever. state + current returns the overall total.
So plainly, we can't just use this function on it's own.
We need the List.Range function!
With the List.Accumulate function returning a total of all rows fed into it, it became plain that we needed to control what was being fed into the list used as a parameter. So I reached back out to MSDN and browsed the site until I located the List.Range function.
Function:
List.Range(list as list, offset as number, optional count as number) as list
Arguments:
Argument
Description
list
The List to check.
offset
The index to start at.
optional count
Count of items to return.
Example:
List.Range({1..10},3,5) equals {4,5,6,7,8}
Using the List.Range function
In order to use the List.Range function, we are going to need to figure out which rows we want. To do that, we need to add an Index column
Add Column –> Add Index Column –> From 1
Then add a column that makes use of List.Range()
Go to Add Column –> Add Custom Column
Formula Name: Initial
Formula:
=List.Range(#"Added Index"[Sales],0,[Index])
So what I'm doing here is feeding in the Added Index step (from adding the Index column), and providing the [Sales] column to get a list. But I'm asking it to return the list for the number of rows as contained in the [Index] column. The result is a green word that says List all the way down the column. But if I select the whitespace beside any of those List items, we can see what it is contained within. Shown below is the list for the Stuffed Koala row:
The final step is to put these together. So let's add a new column again, but this time we'll use that List.Range() function instead of #"Changed Type"[Sales] as shown below
Go to Add Column –> Add Custom Column
Formula Name: Success
Formula:
List.Accumulate( List.Range(#"Added Index"[Sales],0,[Index]),
0,
(state, current) => state + current
)
And the result gives us what we were originally looking for:
The only thing left to do is remove the columns we used along the way. Of course, we could just remove those steps, as they never really needed to happen, but I'm going to select them and remove them so that you can see the work in progress.
I got an email from a friend today who was using some complicated logic to replace specific records in a table with records from another table. His query was running pretty slow, so he reached out for a little help. In this post I'll show how to replace records via joins in Power Query; a much easier (and what should be a faster) solution to his issue.
Data Background
The data footprint that was sent to me looked something like this:
So basically, we want to take the record for Unit002 from the Override table and replace the Unit002 value in the Original Data table.
At first glance, this looks hard. And my friend cooked up something pretty complicated to make this work. Funny thing is (and believe me… I've had this happen to me as recently as last week…) when you put another pair of eyes on it, you suddenly realize it's much easier than you first saw.
In this case we can actually solve this very easily by using a couple of Power Query's different Join types!
Laying the Groundwork
If you want to follow along, grab the sample workbook here. You'll notice that we have taken the following actions already:
Select any cell in the Original Data table
Create a New Query –> From Table
Go to Home –> Close & Load To… –> Connection Only
Select any cell in the Override With table
Create a New Query –> From Table
Go to Home –> Close & Load To… –> Connection Only
Which leaves us with the following queries in the Workbook Queries pane:
This actually takes a Merge and an Append in order to complete the job. So let's start at the merge.
Right click the "Original" query –> Reference
This creates a pointed to the data in the "Original" query, showing all four rows of data in the table. The challenge here is that we only want the rows which are NOT being replaced. The secret to getting those? An Anti-Join!
Go to Home –> Combine –> Merge Queries
Choose the Override query
Select the Unit column on both the top and bottom queries
Change the Join Kind to "Left Anti (rows only in first)"
Why only 3 rows? Because the Left Anti Join only returns the rows which don't match what is in the other table. So where Unit002 exists in the second table, it cause it to pull everything EXCEPT Unit002 from the left table. (For more on using Anti-Joins in Power Query, see this blog post.)
Joining tables does create a new column however, even if it is full of null values (as this one is.) Since we don't need it, let's just delete that column:
Right click the NewColumn column –> Remove
Now we just need to add the record(s) from the Override table to this list. That's fairly easy:
Go to Home –> Combine –> Append
Choose the Override table
Right click the Unit column –> Sort –> Ascending (this step is optional, and done for readability only.)
And you're done! 5 steps (after the connection only queries were created), 100% user interface drive, and should perform quite quickly. Image may be NSFW. Clik here to view.
I'm honestly not sure what's taken me so long to do this, but I'm pleased to say that I've finally added a Power Query specific help forum at Excelguru. I'm hoping that this forum becomes THE place to ask and answer Power Query (or Get and Transform) related questions for both Excel and Power BI desktop. After all, we wrote the book, so it only makes sense that we try and host the Q&A on the topic. Image may be NSFW. Clik here to view.
Extra monitoring of the Power Query Specific help forum
As the forum gets up and running there are a couple of key people I've added for email notification as well. The intent here is that we get notified when people post questions, and will try to focus on making sure that they get addressed and (hopefully) solved. If you are a Power Query expert and would like to be included in that list, just email me or post on this thread. I'll get you set up. (Make sure you've signed up for an account on the site, as I'll need your user ID to do this.)
Naturally, if you've got a question pertaining to the topic posted on the blog, you can still ask it here. If the question is a bit more general though, I'd encourage you to sign up at www.excelguru.ca/forums and post the question in the Power Query forum.
Re-focusing on the Power Technologies
While I was setting this up, I also took the time to set up forums specific forums for some of the other "Power" stack:
The April 2016 Power Query Update was just released for Office 365 subscribers, and I can confirm that it is available to the First Release customers, as I’ve already got it installed. (If you’re on a later branch it may be a bit longer.) It’s also available for download for Excel 2010/2013 customers.
WARNING TO EXCEL 2016 USERS!!
If you read Power Query data from named ranges, I HIGHLY recommend that you avoid updating your software to the newest release right now if you can. The latest builds on the insider track have caused a rather large issue if you are sourcing from a named range that doesn’t have an equal offset of rows/column. I.e. if your source range doesn’t start in A1, B2, C3, D4, etc… then it pulls the wrong range. Tables are fine, named ranges are the issue. Microsoft knows, has architected a fix, but it hasn’t been pushed out yet. I’ll update this as soon as it has.
The problem is not an issue in Excel 2010/2013 running version 2.29.x or the current 2.30.x. It is only affecting Excel 2016.
What’s in the April 2016 Power Query update?
At first glance it doesn’t seem like a ton – only two that they are calling out – but I think that this will make a few people pretty happy.
ODBC Connectivity Improvement
The first is that they’ve added the ability to easily select from a list of a available DSN’s when you’re setting up a Power Query connection against an ODBC data source. No new functionality there, but it saves you the headache of having to manually enter the connection string (which you can still do if needed.)
Hey everyone, we need your votes to make a difference in Power Query and Power Pivot! There are a couple of items in the uservoice forums that I’d like to bring your attention to, and hopefully entice you to vote them up. The more votes we get, the easier it is for the program managers in the design teams to get the support to actually implement these features. They ARE interested, they just need you to up-vote them to get them done.
Where we need your votes:
#1 - Add Intellisense to the M Editor
So the idea here is simple: Add Intellisense and better general editing capabilities in the Advanced Editor. This would make a huge difference to those of us writing M code, and I’ve also suggested in the comments that this be extended to the Add Custom Column dialog.
What kills me on this is the signature of the original submitter: “Software Engineer, Power BI Desktop”. I don’t think I’ve ever seen a clearer case where they need our help to justify the budget to get this done. Please go there and throw some votes on this.
#2 – Modification of the Power Pivot Field List experience
Back in November I posted a suggestion to improve the Pivot Table experience which would benefit everyone, but especially Power Pivot users. Full details of my suggestion can be found on the blog here, but the basic summary is this: Allow the fields area to be collapsible in side by side view. This would make it WAY easier to rearrange fields by reducing unused whitespace.
I was really encourage to see Ashvini Sharma’s response which, paraphrased, says: “We want to do this too, so please get enough votes to help us justify it!”
Please, take some time to throw some votes on these ideas, and encourage every other user you know to do the same. It’s super easy to do, just go there, click the Vote button, assign as many as you want and verify you’re real with your email address. (The only email I’ve ever received from this is when they confirmed a feature got implemented.)
Again, we need your votes. Help us out! I’d like to see both of these hit 500 votes in order to give Microsoft the justification they need to get these done.
Well, there’s good news and bad this time. I just updated my Excel 2016 to 16.0.6868.2048 (First release version) and there is a fix and a new bug evident.
… if you are sourcing from a named range that doesn’t have an equal offset of rows/column. I.e. if your source range doesn’t start in A1, B2, C3, D4, etc… then it pulls the wrong range. Tables are fine, named ranges are the issue…
The issue was reported, fixed, and build 16.0.6868.2048 (which I finally got today) has fixed the issue.
I have to say that this is pretty cool. Even though I was frustrated having to wait 2 weeks for a fix, the fact that it was only 2 weeks is pretty darned amazing. In past cycles, this would have been several years until a new build of Excel came out. So even though we see new bugs, we also need to recognize that the team is working very hard to try and be responsive to them and get the fixes pushed VERY quickly.
Unfortunately, a new bug
I’m not actually sure if this is new in a more recent update, or if it was there in the previous build and I just didn’t notice. (While I do check updates almost daily, I don’t actually use every feature of Power Query every day.) While this bug doesn’t prevent you from using your models, it is pretty irritating… Since I know this was a pretty major contention point for many in the past, I figured I should talk about it.
Once again, this is Excel 2016 specific, and doesn’t affect Excel 2010/2013.
So let’s assume you have some data as shown below:
The green table is a simple query that imports the original data, and sets the first column to a Date data type, then loads it to the table. (Nothing fancy, it’s just that simple.)
The orange table was created by right clicking the green table’s query and choosing “Duplicate”, then loading it to the worksheet. It is an EXACT copy of the query that leads to the green table.
Make sense so far? Now, let’s add a couple of rows to the blue table, then hit refresh all. What you should expect to get is this:
Check out the green table… those dates are pretty impressively formatted as serial numbers, not dates. But yet, the orange table – an exact duplicate – is fine. Huh? Doesn’t this feel like a throw-back to early Power Query days, where tables didn’t hold the formatting properly?
Here’s the best we have from a temporary workaround point of view (courtesy of the Excel/Power Query team):
Select the green table –> Table Tools –> Design –> Properties
Stellar! The number formats have remained, but the table style formatting has changed to a different one. Ugh.
Now, it IS still a table. But unfortunately the style and the number formats all seem to be controlled by that one selection. So until they fix this, it appears that you can either have your tables pretty, or you can have your number formatting correct.
Or maybe you can create your query, immediately duplicate it for your reports, then delete the original, as the second one seems to behave properly. (I have no idea why this is.)
Final thoughts on “a fix and a new bug”
The subscription model is a new thing for us, and personally, I’m pretty high on it, despite these kinds of issues. My hope is that – with the connections I have at Microsoft – that I’m in the first ring of testers, and can get this stuff fixed before it hits you. I’d highly suggest you also have one person in your company in the “First Release” program for this reason as well.
My understanding of this method is that the fixes we get into the First Release band are implemented before that version is shipped to the General Release band of users. That’s a good thing, as the last thing we want to see is our end users having to experience two months of the first issue listed here!
With regards to the new bug, I’ve again reported this to the Power Query team. They’re aware, and we are having some active dialog about it. I know they are going to fix it, but I’m not sure how soon. (I really hope it’s as quickly as the last one, as this is pretty visible!)
One of the questions I get quite frequently is how we can pass parameters to SQL queries, allowing us to make them dynamic. Someone asked a question in the forum on the topic, so that finally inspired me to write it up.
Before you do this…
There are a lot of reasons that you should NOT go this route.
Power Query is designed to take advantage of query folding when targeted against a database. While this isn’t technically correct, you can look at query folding like this… As you connect to the database and start filtering and grouping via the user interface, Power Query passes these commands to the database to execute. This means that the majority of the heavy lifting is done at the database, not your workstation. This continues to occur until you hit a command that the database doesn’t know, at which point it returns the data to Power Query, and the rest is done in your instance of Excel.
Why is this an issue? Dynamic parameters use custom lines of code, which can’t be folder. In the case of the poster’s question, this means that he just asked to bring 40 million records into Excel, rather than filtering them down on the server. Take a guess as to which one is faster?
But what if I want to Pass Parameters to SQL Queries dynamically?
Listen, I totally get it. I love dynamic queries too. And anyone who knows me is aware that I’m super obsessed with making my data tables as short and narrow as possible. This almost always involves filtering based on parameters that the user needs to pass at run time. So are we dead in the water? No, not by a long shot.
How to Pass Parameters to SQL Queries – Method 1
I’m a big fan of query folding, and would encourage you to use it wherever possible. In order to do that, you should use the user interface to connect to the database and drive as many filter, sort and group by operations as you possibly can. The goal is to push as much work to the server as possible, resulting in the narrowest and shortest data table that you can accomplish. Once you have that, land it in a Connection Only query. And from there use your dynamic parameters to filter it down further to get just what you need.
I have no idea what was in the 40 million row data set that the user was working with, but let’s assume it was 40 years of data related to 4 divisions and 30 departments. Assume that our user (let’s call him Mark) wants to bring in the last 2 years data, is focussing only on the Pacific division, and wants to give the user choice over which department they need to work with. For ease of assumption, we’ll assume that each year is 1/40 of the annual record load and each division provides 1/4 of the total records. (Yes, I’m aware that nothing is that easy… this is an illustration of concept only!)
The recommended steps would be to do this:
Create the Staging query
Connect to the raw database table
Filter to only the Pacific Division – Server reduces 40m to 10m records
Filter to only the 2 years of data – Server reduces 10m to 500k records (10/40*2)
Land this output into a staging query – 500k records total
Create the parameter table and the fnGetParameter query
Create a query that references the Staging query and filters the department to the one pulled via the fnGetParameter query
That should take as much advantage as possible, and means that Power Query only needs to run the processing of 500k records against our dynamic criteria.
Where Method 1 breaks down
But what if the data set is still too big? What if you need to parameterize the Division, the Date Range and the Department? In order to avoid issues from the formula firewall, you would have to do this:
Create the Staging query
Connect to the raw database table
Create the parameter table and the fnGetParameter query
Create a query that references the Staging query and…
Collects the Division, Date and Department variables
Filters to only the Pacific Division
Filters to only the 2 years of data
Filters to only the selected department
Seems about the same, doesn’t it? Except that this time we can’t take advantage of query folding. To pass a parameter to the database, we have to separate it from the parameters in order to avoid the formula firewall. This means that we break query folding. And this means that Power Query needs to pull in all 40 million records, and process them. Not the server, your Excel instance.
I don’t know how much RAM you have (and don’t care unless you’re on 64 bit), or how many processor cores you have (as Power Query is single threaded), you’re in for a LOOONNNGGG wait… if it doesn’t just tip over on you.
So how do we fully parameterize this stuff?
How to Pass Parameters to SQL Queries – Method 2
The good news that it can be done, the bad news is that you need:
SQL Skills, and
An adjustment to the default Power Query load behaviour
Let’s talk SQL Skills
The reason you need SQL skills is that you need to be able to write the most efficient query you possibly can, and pass this into the query when you connect to the database. (Thos italics, as you’ll see, are key.) So, let’s assume that I want to connect to the old AdventureWorks database and pull records from the Sales.SalesOrderHeader table where CustomerID = 29825. You need to be able to write this:
SELECT * FROM Sales.SalesOrderHeader WHERE CustomerID='29825'
Why? Because you need to include that query when you’re building/testing your code. It goes in the advanced options, below:
(You may need to trust the Native Database Query to get to the next step.)
So that created the following code for me:
let
Source = Sql.Database("azuredb.powerqueryworkshop.com", "AdventureWorks2012", [Query="SELECT * FROM Sales.SalesOrderHeader WHERE CustomerID='29825'"])
in
Source
Modifying the hard coded query to use dynamic parameters
let
//Pull in a values from the parameter table
dbURL = fnGetParameter("Database URL"),
dbName = fnGetParameter("Database Name"),
dbTable = fnGetParameter("Database Table"),
sFilterField = fnGetParameter("Filter Field"),
sFieldValue = Text.From(fnGetParameter("Field Value")),
//Create the query
dbQuery = "Select * FROM " & dbTable & " WHERE " & sFilterField & "='" & sFieldValue & "'",
//Get the data
Source = Sql.Database(dbURL,dbName,[Query=dbQuery])
in
Source
I won’t go through the fnGetParameter function, as it’s quite well detailed in the blog post, but the key to understand here is that we are pulling a variety of items from the worksheet table, and putting them all together to feed the line starting with dbQuery. This line dynamically sources the database path, database name and the SQL query. Wouldn’t this be heaven if it worked?
My thought is that I could load this as a connection only query, then reference it and add the dynamic query afterwards. Unfortunately, I got nowhere. I’m not saying it can’t be done, but I couldn’t figure this out. It seems that the only way to pass a query to the database is to pass it during the initial connection. But doing so, by its very nature, violates the formula firewall.
So how can we get past it?
Bypassing the Formula Firewall
Yes, we can make this work. But doing so involves turning off Privacy settings. I’m not generally a big fan of turning off firewalls, but this one just aggravates me to no end. Normally we turn this off to avoid prompting messages. In this case it just flat out prevents us from running. So let’s kill it:
Go to Query Settings –> Privacy
Select Ignore the Privacy Levels and potentially improve performance
And at this point you’ll find out that it works beautifully… for you. You’ll most likely need to have anyone else who uses the file set the above option as well.
Security Settings
The security setting we changed above is workbook specific. Now I’m not recommending you do this, but if you get tired of the privacy settings overall, you can turn them off for all workbooks in the Query Options –> Security tab. The second arrow is pointing at the relevant option:
And how about that? Check out the first option… that one lets you avoid prompting about Native Database Queries. (I seriously have to question why a SELECT query should trip a warning.)
I’ve set it up to talk to a SQL Server in order to demo this, as it’s the big databases with query folding that cause us this issue. You might notice that not too many people provide free access to a SQL server since they cost money to run. For this reason, you need a login and password to refresh the solution and really see it work.
As it happens, if you’ve bought a copy of M is for Data Monkey, or attended our PowerQuery.Training workshop, then you’ve already got it the credentials. (They are on page 66 of the book.) And if you haven’t… what are you waiting for? Image may be NSFW. Clik here to view.