I think I’ve come up with a reasonable form of an UPSERT that strikes a balance between performance and ease of coding. For now, this post is just an outline; I’ll convert it to a proper post with code samples if it actually pans out. I have tested it by UPSERTing hundreds of records to a database containing over 2 million records, and it is still very fast (the db select statement was hundreds of milliseconds and the update statements were faster). So it might not be too bad for some UPSERT use cases.

Outline:

  1. widgetsInput = list of entities to be inserted or updated if already exist (UPSERT). Could be parsed from an import file, for example. Use actual entity class from model.
  2. widgetsInputIds<int/string> = list of unique identifier from widgetsInput that will be used to match records in the database. Would be easiest if this was an integer. May or may not be actual PK from database. (In my example it’s a list of unique bank transaction Ids which are a non-PK unique index in the associated table.)
  3. Start a database transaction with isolation level REPEATABLE READ or SERIALIZABLE to avoid issues with other updates being made to records between the initial select and the update/insert statements. Both levels have drawbacks. Read the thing. TLDR:
    • If REPEATABLE READ, the update statement is guaranteed to be safe, but the insert statement could cause PK/index conflicts if another script has swooped in and inserted a new record between the select and when the insert is committed. Also, the update statement might lose to a deadlock if another script updates the same data during the transaction. So use a try/catch to handle these scenarios.
    • If SERIALIZABLE, the update and insert statements are guaranteed to be safe, but if the range of locks is massive or the database is heavily used, the locking could be deadly.
  4. widgetsExist = db-selected list of existing entities whose PK are in widgetsInputIds. Use ToList() so it becomes local. List will contain (worst-case) up to as many db entities as there are in the input file. Don’t load related data unless absolutely required.
  5. widgetsExistIds<int/string> = list of primary keys from widgetsExist
  6. widgetsInputNew = list of entities from widgetsInput where the PK is in widgetsExistIds
  7. context.Widgets.AddRange(widgetsInputNew)
  8. widgetsInputExist = list of entities from widgetsInput where the PK is not in widgetsExistIds
  9. Iterate through widgetsInputExist and:
    1. widgetToUpdate = widget from widgetsExist which matches widgetsInputExist[i]
    2. Update widgetsToUpdate properties to match widgetsInputExist[i]
    3. Optionally check if the entity was actually updated (more on this here) and potentially update other columns as well (LastModifiedDate, BatchUpdateId, etc.)
  10. context.SaveChanges();
  11. Commit the transaction

Some Pros:

  • The insertion of new records works just like it always does in EF Core 3: A single merge statement (for 3+ records), which only gets divided up into multiple statements when dealing with lots of records (each statement will handle hundreds of updates)
  • The update statements will only be created for entities where an entity value actually changed, ignoring scenarios where you overwrite values with the same values, which saves you from having to write code to compare values before updating entities (just update ’em all)
  • The update statements will ninja each record by only updating columns that actually changed, not all columns in the record (shorter & more efficient update queries)
  • You can leverage EF’s entity state and know which entities will be considered “updated” before saving the context, allowing you to do extra work to that record (e.g. set update date, update related entities, etc.)
  • Only one query to the database for matching existing entities. The rest of the logic is done in local memory (e.g. lists of new vs existing, ids, etc.)
  • Avoids having to drop down to ADO.Net
  • Avoids having to use the dangerous AddOrUpdate() Migrations method

Some Cons:

  • The complexity of the initial query for existing values could be high, given that you’re searching a table for values that exist in a potentially long list. But if the column being searched is indexed, the search would probably be pretty efficient.
  • A separate update statement is generated for every entity that is updated (same connection so still fast, just not perfect)

Some Warnings:

  • This might perform very terribly if the comparison in the query on step # 4 is done against a non-indexed field, particularly a long string.

Note to self: See a test implementation of this in Program.cs in my budget app