The Dashboard Effect

Schema on Write vs. Schema on Read: What’s Faster for Your Business?

Brick Thompson, Jon Thompson, Caleb Ochs Episode 142

In this episode of The Dashboard Effect Podcast, Brick and Landon break down the differences between schema on read and schema on write. They discuss when to use each approach, the pros and cons, and real-world applications for businesses of different sizes. If you're navigating the complexities of data lakes, data warehouses, or lakehouses, this episode will give you clarity.

Tune in to learn how these data strategies can impact your reporting, analytics, and overall business performance.

If you found this episode helpful, don’t forget to hit the like button and subscribe to our channel for more insights on data analytics, business intelligence, and strategies to help your business succeed. Turn on notifications so you never miss an episode!

Looking for smarter, faster ways to make data-driven decisions? At Blue Margin, we specialize in making data accessible and actionable for mid-market companies. Visit bluemargin.com to learn how we can help you unlock the full potential of your data.

Brick Thompson:

Rick, welcome to the dashboard effect Podcast. I'm brick Thompson

Unknown:

and I'm Landon Ochs.

Brick Thompson:

Hey Landon, good to see you. We wanted to talk about a topic today which is a little bit technical, but we'll try to explain it in a way that's not too complicated, but basically it's a concept of schema on read versus schema on write that already sounds complicated. Maybe we can define it. Let's define that. So what is schema on write? What does that mean? Yeah,

Unknown:

so schema on write is the whole idea of storing your data in a format that you want it to be in, mainly for reporting, for

Brick Thompson:

reporting. So it's already got the relationships, the transforms, anything that needs to happen exactly the data is sitting in place, sort of, sort of the, I mean, the old data warehouse model where you got a SQL Server, yeah, you've got the relationships, the tables. Everything's already set

Unknown:

up. Yeah, yeah. With databases, you know, you define this as my table. These are the columns I need. These are the constraints. So it's there, you know, it's defined, okay.

Brick Thompson:

And so then what is schema on read?

Unknown:

Yeah, so schema on read is the whole idea of taking your data, giving it, let's say, a select statement as an easy example. On top of that, and your SELECT statement is what's actually going to transform your data and serve it out in the way that you want it. So you're not actually saving it anywhere. You're just saving how to get the data into that format whenever you need it. Kind

Brick Thompson:

of okay. So you the easy example is you have a data lake. It's got raw data in there. It just sits there. You've got a semantic model defined that gets activated when you're running some BI reports or doing a query, something like that, that that is sort of the translation layer into the data lake. So the the schema is being defined as you're reading it. Yeah, exactly. Okay. So that's sort of the modern way that we're doing a lot of stuff these days. There are still use cases where you want schema on write so that the data is sitting there with the schema already defined. What typically would lead to that?

Unknown:

Yeah, definitely. So I will say, you know, we haven't seen this too often with the size of clients we typically work with, but there is one very recent, not, not too recent, but there's a, there's a prominent use case that that comes to mind, where this particular firm had, you know, 10s of billions of rows that they were dealing with. They also had six to seven different sources. So all of these sources served up data differently. They had different rules, different logic that needed to happen to get the insights that they were hoping to get out of the data. Okay, so when we're talking that much data, and you know that complex of transforms, it was just crazy trying to use that whole schema on read methodology. It spun and spun for Okay,

Brick Thompson:

so, yeah, so you were asking the system basically to create conform dimensions and all the stuff you need on read, and it just took a lot of compute. I'm sure you could scale the compute so it would perform well, but then that gets expensive, yep. So you're just better off doing it on right, sort of the old style, Kimball style, you know, data warehouse. Okay, that makes sense. So when we're doing schema on read, typically for our clients, he said, you know, our size clients just, just so listeners can know, typically, a client of ours is sort of in the $50 million revenue range to maybe a billion or 2 billion in revenue. So very mid sized company, not a huge enterprise company. So when we're setting up a schema on read, we're typically doing that with serverless SQL, so users still feel like it's sort of the old model, like they're just dealing with tables with rows and relationships and so on. Yeah,

Unknown:

exactly. They can go through, select the tables they want, select the columns they want, and in most cases, it's about the same speed as if we had a physical table that they were using. So yeah,

Brick Thompson:

these tools have gotten so good, that's amazing. Okay, so for you, then the decision about whether to do schema on write, having the data sitting there with a schema already defined, not going through a semantic layer. I guess there's an implicit semantic layer there, but you decide whether to do that based on how much compute it's going to take to get to that schema. So if you have billions of rows, lots of different data sources, and basically it comes down to you're going to you're gonna have to do lots of transforms on the way, and you might as well go ahead and store that on the way, and so you're not doing it every time a report is accessing.

Unknown:

Yeah, yeah. And, I mean, I like to tell my team even simplify it more, you know? And don't, don't worry about the compute, worry about the time, right? It's easier. Everybody can attach to it. It can be hard to know how much compute you're using in these serverless environments. But you know, if we have a report that's taken five hours to refresh, yeah? So yeah, you know, ideally we'd want to be in 20 minutes less. So that's usually,

Brick Thompson:

yeah, okay, but the report itself is going to be much more performant than

Unknown:

that. Oh, yeah. This is just getting the data into the report. Once it's stored in the report, it's, it's going to be snappy, yeah,

Brick Thompson:

very few seconds. Okay, all right, that's an interesting topic. I mean, we've been seeing this change over the last decade, you know, going from where you know, all, all data stores, well, for reporting, basically, were data warehouses, Kimball style data warehouses, or Inman style data warehouses, to now Data Lake is the main, main form, and really is data lake house. So you've got the data lake, and then you've got these schema on read semantic layers ready to go? Yep,

Unknown:

exactly. Yeah, it's, it's actually been, been cool to see that. I will say I was expecting to have to do that whole scheme on right a lot more when we initially went with this design,

Brick Thompson:

which designed the the lake house. So when we, when we

Unknown:

pivoted from data warehouse to Lake houses, and I was pleasantly surprised. I still am at what we're able to do without that, you know, without doing it on right? Yep, exactly. And it's just faster, more efficient for our team, faster for our clients, because we're able to add columns, tweak logic, just way more flexible. Yeah, so fast, yeah, so you don't have to rewrite a bunch of ETLs and exactly, rewrite reports to deal with that. Yeah, interesting.

Brick Thompson:

Okay, well, that's what I wanted to cover on this. Thanks. Great. Thank

Unknown:

you. All right. You.