
The Dashboard Effect
NEW EPISODES EVERY OTHER THURSDAY!!!
The Dashboard Effect Podcast: Simplifying Data for Smarter Business Decisions
Welcome to The Dashboard Effect, the go-to podcast for mid-market businesses and private equity-backed companies looking to harness the power of data.
Hosted by Brick Thompson from Blue Margin, we demystify data analytics and business intelligence, offering practical insights and actionable strategies that drive accountability, performance, and growth.
From breaking down complex data concepts to sharing real-world success stories, we cover topics like Power BI, data lakes, dashboards, automation, and the latest trends shaping the future of business intelligence. Whether you're a CEO, operator, or BI professional, we’ll help you unlock the potential of your data for smarter, faster decisions.
Tune in to The Dashboard Effect and discover how the right data, at the right time, can transform your business.
Subscribe now and take the first step toward becoming a data-driven organization!
Learn More: BlueMargin.com
The Dashboard Effect
Mastering Delta Loads: Caleb's Best Practices, Tips, and Workarounds
Brick and Caleb get technical this week with a special focus on delta loading. They break down the concept of delta loading and highlight its benefits over traditional full loading techniques in terms of speed and resource efficiency. Caleb covers the typical obstacles faced when implementing delta loading, including issues with legacy systems and API limitations. He also shares strategies for effective delta loading, with practical tips and workarounds that you can apply directly to your operations.
Watch this episode on our YouTube channel.
Blue Margin increases enterprise value for PE-backed, mid-market companies by serving as their fractional data team. We advise on, build, and manage data platforms. Our strategy, proven with over 300 companies to-date, expands multiples through data transformation, as presented in our book, The Dashboard Effect.
Subscribe here to get more episodes of The Dashboard Effect podcast on your favorite podcast app.
Visit Blue Margin's library of additional BI resources.
Welcome to The Dashboard Effect Podcast. I'm Brick Thompson.
Caleb Ochs:I'm Caleb Ochs.
Brick Thompson:Caleb, our last two episodes have been pretty technical. Talking about Power BI data models. Talking about report performance, DAX measures... I think we'll stay on that theme and talk about data pipelines. And by pipelines, I mean, sort of what we used to refer to as ETL tools. Now, it's commonly an ELT, as you're pulling data into to a data lake and then building a semantic model and so on. But particularly wanted to talk about delta loading. And some of the challenges that happen when you're pulling data from an API.
Caleb Ochs:Yeah, ol' delta loads.
Brick Thompson:Let's define delta loads first.
Caleb Ochs:Yeah, good. Good point. So 'delta load' is when you're pulling only new and changed records from from your source. So that I guess the opposite of that, which I think is helpful, is a full load, which you're pulling every single record, every time you run your pipeline. It's not a great way to go because you would like your pipeline to run faster than that. In some cases. that's just not feasible, even. So only pulling new and changed records, so you can update your lake. And then you know, you have up to date information is best practice.
Brick Thompson:Yeah, for a lot of reasons. One, you can pull it a lot faster. You're moving a lot less data. You can update your lake more often. Because you know, if if it takes you three hours to do a full load, you're not going to do that very often - once or twice a day. If it takes you three or four minutes to do a delta load - great, you can do that every hour or every half hour, whenever you want. So, there's other reasons to do delta loads. And that's to keep track of changes. So for example, if you have a dimension, let's say, a salesperson dimension a salespeople are assigned to a region. And let's say a salesperson gets moved to a different region. If you're just doing full loads, and somehow not keeping track of when that salesperson was in the previous region, when you do historical reporting, it'll be wrong. Because it'll only know about the current region they're in.
Caleb Ochs:Yeah, it kind of simplifies that process, right? Yeah, if you if you're doing a full load, technically you'd still do it. But you're going to have to basically do a full comparison of the data you have versus the data you just pulled and kind of do it that way.
Brick Thompson:Find your own delta.
Caleb Ochs:Right. Yeah, find the differences. Which is not awesome. So yeah, if you're just pulling the changes then it's easy. You know that this record changed.
Brick Thompson:Why are you prevented from always just doing a delta load? What causes problems?
Caleb Ochs:Yeah, so there can be a few different ways. One, They won't report they won't let you say hey, just let me pull that's becoming a lot less common with newer systems and stuff. But some of the old ERP systems just do not have a modified date or any way to identify a record that changed. things that have changed since the certain date timestamp. Really, you need a modified date, you need to know, when was the last time this record was touched? So then basically, the way that you do it is you say,"Okay, last time our process ran, it started at this time." So you capture that, and then the next time your process runs, you say, "Give me everything that was updated after the last time I ran." So that's the way to do a delta load. Some of the old systems just don't have timestamps on them. They don't keep track of when records were Right, and that is still pretty common. You might have to... touched. Like I said, that's becoming less and less of a problem, but we still run into it here and there. Another common thing that is a little bit more common is there will be like an 'updated date' on a record, but your detail table may not have it. Now, you can still do it. So in this in this scenario I'm talking about like, if you had an Invoice Header, one row for an Invoice Header, and then you have like five like some APIs (this is less common) some API's will only let lines on your invoice. Those are typically stored in two separate tables. So Invoice Header, you might be able to say okay, this invoice was adjusted, but really it was one of the lines on the system. So you can still kind of do a pseudo delta, but then you'd have to say, "Okay, all of these invoices changed, so give me all the lines for every single one of these invoices." It's not awesome. The other thing that can come up is your APIs. Some APIs just don't give you a good way to do delta loads. you pull data if you have IDs. And then there's a separate endpoint to pull all of the IDs. So you have to like pull all the IDs, then you have to loop through and pull every single record for every single ID. It can be kind of challenging, however, you know back to the delta load thing that can be a big problem from an API, just not having the ability to do it.
Brick Thompson:So if you don't have a good API that supports it. Seems like people are gonna need to support that more and more. The solution, the workaround, of doing a full load and then doing your own compare to get the delta load just seems really inefficient. And in fact, the people that are providing the API, they don't want people to having to pull full loads. It's a big tax on their system.
Caleb Ochs:Yeah, I think kind of what we're seeing right now is the same type of thing that we saw with ERPs and stuff. Where it really wasn't a big use case, at first, like APIss are not built for data extraction. They are built for third party developers to build something on top of their...
Brick Thompson:Yeah.
Caleb Ochs:But people are using it more and more for that. So to your point, I mean, it's in their best interest to give you a easy way to pull changes.
Brick Thompson:Yeah, interesting. There's also APIs that just simply don't let you get at detailed data to even be able to do a delta load. Really, you're pulling it at a layer that's so high that I mean, that's its own its own class of problem.
Caleb Ochs:Yeah, right. Right. Right.
Brick Thompson:All right. What else about delta loads? Any any suggestions for people how to deal with it when they get one of these bad APIs?
Caleb Ochs:Yeah, well, I mean, sometimes you just can't. I guess you got to try and make every effort you can to do it. It's gonna pay off in the long run. I guess the other topic that I did want to touch on was that topic of deletions in your data. So those can be a real challenge.
Brick Thompson:If they just disappear,
Caleb Ochs:If they just disappear. So if you're if you're delta loading, so that's where delta loading can actually cause you a problem, if you don't handle deletions. Because you can't assume down the line because you're only getting changed records. You're not doing a full load, and you don't know if it disappeared.
Brick Thompson:If they're not giving you hey, this this record changed, and it changed by becoming deleted. If it's just disappearing, you don't know,
Caleb Ochs:Right, you don't know if it changed, or if it didn't change, or if it got deleted, right? So the way that to handle that is, at least the way that we handle it, because it works in every scenario, is to pull the primary keys from every table that you need to reconcile deletions from. So like a customer dimension, you probably don't
Brick Thompson:Okay, need to actually, I would recommend you do not do deletions. Because you might mess up your integrity if a customer is deleted in the records in the fact table is being used somewhere. So you in your dimension tables, typically you want to keep everything. So it's not every table that you have to do this for. But like your invoice table, for example, you want to pull all of the invoice IDs -- just the ID column. Don't pull any of the information. So you can tell like, okay, is every single ID still in the data set? And then you can reconcile yours with that. Okay, so that pull of just the IDs column is a lot cheaper than pulling everything.
Caleb Ochs:Exactly.
Brick Thompson:And you can reconcile on that. Oh, that's cool. Yeah, that's a good workaround. Yeah.
Caleb Ochs:Yeah, it's still not ideal. The ideal case is if the But yeah, most systems do not have that. Maybe system could tell you, add a deletions table or something that you could pull. we'll see that change.
Brick Thompson:Yeah. Any other cool tricks like that?
Caleb Ochs:I think that pretty much covers it. If you can do that, if you can figure out how to use delta loads and then do your deletions. You're in a pretty good spot.
Brick Thompson:Yeah, okay. All right. Good topic. Thank you.
Caleb Ochs:Yeah. See ya.