Jan 18, 20265 min read

OAuth/SSO integration: the part of fullstack nobody warned me about

Enterprise SSO sounds like a config job. It isn't. Three weeks, eight tenants, and a list of mistakes I keep paying for.

The first time I shipped enterprise SSO for a customer, I gave my project manager an estimate of "three or four days, depending on the IdP." It took three weeks. Eight customers. Two production incidents. One quiet apology to a team in São Paulo whose Monday morning I broke.

This is the post-mortem I never wrote at the time, packaged as the warning I wish someone had given me.

The protocol is the easy part

If you've only read the OAuth 2.0 RFC, you'd reasonably conclude this is a solved problem. There's a chart. The boxes have arrows. The arrows are labeled. The library you npm install claims to handle it.

The library does handle the protocol. The protocol is not the problem.

The problem is everything around it:

Clock skew. The customer's IdP server clock is 47 seconds ahead of yours, and assertions are dying as "future-dated."
Certificate rotation. Their cert expired six hours ago and nobody noticed because the previous one had two years of runway.
Attribute mapping. Their mail claim is nameid. Their nameid is a UUID nobody uses. Their actual email is in urn:oid:0.9.2342.19200300.100.1.3.
Tenant isolation. Two customers have an employee with the same email address, because both companies bought from a third company that you also have as a customer.
The customer's IT person retired. The IdP was configured in 2019. Nobody has the admin password. The export option is greyed out.

None of this is in the RFC. All of it is in production.

What I'd tell past me (a checklist)

I keep a list in my notes from that period. Here it is, cleaned up.

Before you wire the first tenant

The number of decisions you can't reverse later is higher than you think. Pick wisely on day one.

1. Decide whether email is your identity, before anything else

You will be tempted to use the email address as the user identity, because that's what every spec doc shows. Don't, unless you've earned the right.

What happens later:

A customer renames their domain (acme.com → acmegroup.com) and now every user looks new.
A customer migrates IdPs and the new IdP exposes a different attribute as the "primary" email.
An employee changes their email when they get married. Their account is now orphaned.

The safer pattern: identity = (tenant_id, immutable_subject_id) where subject_id is whatever the IdP swears is stable for that user. Email is a display attribute that you update, not a key you join on.

I learned this on tenant #6, after I'd already shipped it the wrong way for tenants #1 through #5. The migration was not fun.

2. Build the unhappy-path UI first

The happy path is redirect → callback → set cookie → done. That part is forty lines of code and it works on day one.

The unhappy path is everything else:

The IdP returned a 500. You want to show the user a real error, not an infinite loop back to the same redirect.
The assertion was valid but the attribute mapping is missing. You want to fall back to manual input and page the customer's admin.
The tenant is configured but disabled. Show "your account is suspended, contact billing@" — not a generic 401.
The user's IdP session is alive but their app session expired. Force re-auth without re-prompting the IdP, or your help desk will get "why does it sign me in but then immediately sign me out" tickets.

I built the unhappy paths last. By the time I got to them, I'd shipped two of the most embarrassing tickets of my career.

3. Treat metadata expiry as a scheduled outage

IdP metadata documents have an validUntil field. They expire. When they expire, every login for that tenant fails.

I had eight tenants. Three had metadata that expired within the same month, but on different days. I didn't have a job to refresh them. I didn't have a job to alert me. I had a Sev-1 incident the morning the first one died.

// the cron job I should have written on day one
async function refreshTenantMetadata(tenantId: string) {
  const tenant = await db.tenant.findUnique({ where: { id: tenantId } });
  const fresh = await fetchIdpMetadata(tenant.metadataUrl);
 
  if (fresh.validUntil < addDays(now(), 14)) {
    await pageOps(
      `Tenant ${tenant.slug} metadata expires ${fresh.validUntil}, ` +
      `please coordinate refresh with their IT.`,
    );
  }
 
  await db.tenant.update({
    where: { id: tenantId },
    data: { metadataXml: fresh.xml, validUntil: fresh.validUntil },
  });
}

Page yourself at least two weeks before expiry. That's how long it takes a customer's IT team to schedule, review, sign, and re-export new metadata. I learned this number by paying it three times.

4. The customer's "test environment" is a fiction

When you ask the customer for an IdP sandbox to test against, you will get one of these answers:

"We don't have one."
"We have one, but it's behind the VPN."
"We have one, but the admin who set it up has left."
"We have one, but its certs expired six months ago."

In every case, the answer is: plan to test on their production IdP during a planned window, on a single test account they create for you.

This is not a workaround. This is the actual process. Schedule the integration call. Bring a checklist. Test five scenarios end-to-end in that one window:

Successful login.
Successful logout (often the broken one).
Wrong-password login (you want their error UI, not yours).
Disabled account.
Re-login after session expiry.

If you've already shipped the unhappy-path UI from rule #2, the checklist takes ten minutes. If you haven't, the checklist takes a week and the customer hates you.

5. Logging is not "nice to have"

Every assertion in/out of your system gets logged with:

tenant_id
request_id
subject_id (hashed if you must)
the issuer
the timestamp the assertion claims
the timestamp you observed

Don't log the assertion body itself in plaintext. Do log enough to answer "why did user X fail to log in on day Y at time Z" without asking them to reproduce. They will not reproduce. They will switch to a competitor who logs.

I had a bug for two months where one tenant's users hit a 1-in-50 login failure. I couldn't reproduce it. I had no logs. I shipped logging, waited a week, and found it in twenty minutes — they had a load balancer in front of their IdP that occasionally rewrote the RelayState parameter in transit. I would never have guessed.

The unglamorous summary

Enterprise SSO isn't a config job. It's a customer-success job that happens to involve protocols. Most of the work isn't writing code — it is:

talking to a customer's IT person you've never met,
about software they configured years ago,
to fix a problem they don't think is their problem,
on a deadline somebody else set.

The code part is the easy part. The library will do that.

What the library can't do:

decide what your identity model is,
own the unhappy paths,
own metadata refresh,
own the relationship with the customer,
own the logs,
own the cert rotation calendar.

That part is your job. It is also the part that decides whether the integration is a quiet success or a Friday evening pager. Choose which you want to be on call for.

If you found this useful and want the next post from the trenches in your inbox, there's a subscribe button at the bottom of the page.

OAuth/SSO integration: the part of fullstack nobody warned me about

The protocol is the easy part#

What I'd tell past me (a checklist)#

1. Decide whether email is your identity, before anything else#

2. Build the unhappy-path UI first#

3. Treat metadata expiry as a scheduled outage#

4. The customer's "test environment" is a fiction#

5. Logging is not "nice to have"#

The unglamorous summary#