Written by 10:18 pm Programming • One Comment

Improving Entity Framework performance by tracking changes manually

Introduction

While working on a recent personal project, I had the not-too-unfamiliar pleasure of fighting Entity Framework to try and get some half-decent performance out of a console application that I was using to bulk add and update records based on some external data feeds.

It’s often said that EF isn’t really geared towards bulk database operations and other approaches should be considered (ADO.NET, Integration Services etc.) but there is naturally a trade-off here between time spent implementing the solution versus performance gain. Most people (myself included) choose EF not only because it’s familiar (Integration Services? I’m no DBA!), but typically because you’re already using it elsewhere in the solution (in my case an MVC web application) so it seems like re-inventing the wheel to go back to ADO when, with some tweaking, Entity Framework can be made to run reasonably quickly if you know how to fine-tune your implementation.

The following article should provide some useful guidance for anyone needing to add or update large-ish data sets using Entity Framework. I’m using EF 6 but I believe the principles described are applicable to all recent versions.

How not to do it

Let’s start by examining the typical approach you would take if starting such a project from scratch. I’ll assume you’ve already set up EF and have a DbContext instance ready to add and load/update data with. In my scenario, I’m taking a source data set (in this case loaded from a large XML file) looping through the element collection and creating an instance of an entity class using data extracted from each node. I then check to see if a matching entity already exists in my database (based on some key properties of the entity), if so I grab it and update, otherwise I insert the entity as new and save it away. In pseudo-code the process looks like this:

Load data from XML file into memory. 
Select elements from XML. 
For each element in selected elements: 
  Create entity from element nodes/attributes. 
  Check if matching entity exists on DbContext entity set. 
  If match found update entity properties, else add entity to DbContext. 
End For 
Save changes.

In this example, you would probably start by creating an instance of your DbContext up-front and using the same instance the whole way through. This will certainly work, but as I discovered, it’s anything but fast. What’s interesting is that, by adding console output so I can see each record being processed, it’s obvious that the program gets slower and slower as each successive record is saved. Why is this?

The answer lies in EF’s change tracking capabilities: whenever you call SaveChanges on the DbContext, Entity Framework first needs to work out what exactly it has to do in order to synchronise your local object instances with what’s currently stored in the database. I won’t go into detail about how this process works (there are plenty of other articles online about it) but it’s fairly obvious that the more objects the DbContext is aware of, the more work it has to do each time you try and save changes. Armed with this knowledge, we can move forward and try to improve matters by changing the way we interact with the DbContext.

Beware of multiple DbContext instances

The first and perhaps the most obvious solution to rectify our earlier problem is simply creating a new DbContext instance inside our main execution loop:

// Load XML 
foreach (var element in xmlElements) 
{ 
  var context = new MyDbContext(); 
  // Rest of code 
  context.SaveChanges(); 
}

On first inspection it might seem like this has solved all our problems – it’s certainly a lot faster than the first attempt, due in no small part to the fact that creating new instances of DbContext objects was intentionally made as cheap (i.e. fast) as possible. But when our example becomes a little more complicated we can see this approach has a major flaw which renders it pretty unusable.

Let’s imagine we want to speed things up a bit by pre-loading our table of existing entities from the database, so we don’t incur the cost of performing a SELECT query for each iteration of our loop. We can easily grab the entire list of entities and store them in a list or array:

// Create a context instance 
var myDbContext = new MyDbContext(); 
// Preload existing data 
var existingEntities = myDbContext.MyEntitySet.ToList();

But, what happens now when we combine this with our new solution and we need to update an existing entity and then save the changes back to the data store?

// Preload data 
var context = new MyDbContext(); 
var existingEntities = context.MyEntites.ToList(); 

// Load XML... 
foreach (var element in xmlElements) 
{ 
  context = new MyDbContext(); 
  if (existingEntities.Any(x => x.Key == element.Value)) 
  { 
    var entity = existingEntities.Single(x => x.Key == element.Value); 
    // Update entity properties with data from xml... 
  } 
  else 
  { 
    // Code to handle new entities... 
  } 

  context.SaveChanges(); 
}

At first, I thought I’d found the Holy Grail of solutions as this seemed to execute MUCH faster, but when I checked my database table I found that none of my records had actually been updated! Anyone familiar with EF will see the problem straight away: as soon as we overwrite the context object with a new instance of MyDbContext, Entity Framework loses track of all the existing entities we loaded at the start of our method. When we update the properties and call SaveChanges(), nothing actually happens as the current context instance has no idea the entity even exists!

Where do we go from here?

Now we have a sticky problem: we can’t create new contexts each time we loop through our source data set or we lose track of all entities loaded via that context instance, and it appears we can’t use the same context all the way through without the incremental slowdown we saw originally…or can we?

The answer lies in the change tracking feature we discussed at the start of the article. If we allow EF to manage change tracking for us, it wastes a lot of time checking for modifications to objects that we know won’t have changed at all. By taking control of this ourselves, we can get the performance boost of pre-loading our existing data and still have EF update our entities in a fast and efficient manner. Here’s how we do it:

var context = new MyContext();

// Disable automatic change tracking 
context.Configuration.AutoDetectChangesEnabled = false; 

// Preload existing data 
var existingEntities = context.MyEntites.ToList();

// Load XML... 
foreach (var element in xmlElements) 
{
  if (existingEntities.Any(x => x.Key == element.Value)) 
  { 
    var entity = existingEntities.Single(x => x.Key == element.Value); 
    // Update entity properties with data from xml... 
    // Tell EF our entity has been modified 
    context.Entry(entity).State = EntityState.Modified; 
  } 
  else 
  { 
    // Code to handle new entities... 
  } 

  context.SaveChanges(); 
}

The important changes are lines 4 and 17 highlighted above. context.Configuration.AutoDetectChangesEnabled = false tells EF that we don’t want it to automatically perform change tracking on entities loaded from this context. context.Entry(entity).State = EntityState.Modified is the required statement to tell EF that this entity is ‘dirty’, that is, it has been modified and any changes should be persisted on the next save.

Summary

Disabling automatic change tracking within Entity Framework and managing object state ourselves, we can achieve usable performance levels in situations where many records are being processed as part of a batch, or long-running operation.

Note that in most (if not all) cases, pure ADO.NET would still offer better performance, but at the expense of additional time writing Stored Procedures or parameterized insert/update statements within the code, not to mention the headache of maintaining these additional aspects each time our underlying structure changes. With EF migrations we can easily make changes to our database schema and continue to use the existing code with little or no interruptions.

Caution

It should be noted that pre-loading an entire entity set is not a recommendation due to the memory overhead of keeping these objects around for the duration of the program – my example is somewhat contrived to illustrate this common pitfall of re-creating DbContexts and magically assuming they’ll know about entities we loaded elsewhere.

(Visited 195 times, 1 visits today)
Last modified: 16/02/2020
Close
shares