OAuth/SSO integration: the part of fullstack nobody warned me about
Enterprise SSO sounds like a config job. It isn't. Three weeks, eight tenants, and a list of mistakes I keep paying for.
The first time I shipped enterprise SSO for a customer, I gave my project manager an estimate of "three or four days, depending on the IdP." It took three weeks. Eight customers. Two production incidents. One quiet apology to a team in São Paulo whose Monday morning I broke.
This is the post-mortem I never wrote at the time, packaged as the warning I wish someone had given me.
The protocol is the easy part
If you've only read the OAuth 2.0 RFC, you'd reasonably conclude this is a
solved problem. There's a chart. The boxes have arrows. The arrows are
labeled. The library you npm install claims to handle it.
The library does handle the protocol. The protocol is not the problem.
The problem is everything around it:
- Clock skew. The customer's IdP server clock is 47 seconds ahead of yours, and assertions are dying as "future-dated."
- Certificate rotation. Their cert expired six hours ago and nobody noticed because the previous one had two years of runway.
- Attribute mapping. Their
mailclaim isnameid. Theirnameidis a UUID nobody uses. Their actual email is inurn:oid:0.9.2342.19200300.100.1.3. - Tenant isolation. Two customers have an employee with the same email address, because both companies bought from a third company that you also have as a customer.
- The customer's IT person retired. The IdP was configured in 2019. Nobody has the admin password. The export option is greyed out.
None of this is in the RFC. All of it is in production.
What I'd tell past me (a checklist)
I keep a list in my notes from that period. Here it is, cleaned up.
1. Decide whether email is your identity, before anything else
You will be tempted to use the email address as the user identity, because that's what every spec doc shows. Don't, unless you've earned the right.
What happens later:
- A customer renames their domain (
acme.com→acmegroup.com) and now every user looks new. - A customer migrates IdPs and the new IdP exposes a different attribute as the "primary" email.
- An employee changes their email when they get married. Their account is now orphaned.
The safer pattern: identity = (tenant_id, immutable_subject_id) where
subject_id is whatever the IdP swears is stable for that user. Email
is a display attribute that you update, not a key you join on.
I learned this on tenant #6, after I'd already shipped it the wrong way for tenants #1 through #5. The migration was not fun.
2. Build the unhappy-path UI first
The happy path is redirect → callback → set cookie → done. That part
is forty lines of code and it works on day one.
The unhappy path is everything else:
- The IdP returned a 500. You want to show the user a real error, not an infinite loop back to the same redirect.
- The assertion was valid but the attribute mapping is missing. You want to fall back to manual input and page the customer's admin.
- The tenant is configured but disabled. Show "your account is suspended, contact billing@" — not a generic 401.
- The user's IdP session is alive but their app session expired. Force re-auth without re-prompting the IdP, or your help desk will get "why does it sign me in but then immediately sign me out" tickets.
I built the unhappy paths last. By the time I got to them, I'd shipped two of the most embarrassing tickets of my career.
3. Treat metadata expiry as a scheduled outage
IdP metadata documents have an validUntil field. They expire. When they
expire, every login for that tenant fails.
I had eight tenants. Three had metadata that expired within the same month, but on different days. I didn't have a job to refresh them. I didn't have a job to alert me. I had a Sev-1 incident the morning the first one died.
// the cron job I should have written on day one
async function refreshTenantMetadata(tenantId: string) {
const tenant = await db.tenant.findUnique({ where: { id: tenantId } });
const fresh = await fetchIdpMetadata(tenant.metadataUrl);
if (fresh.validUntil < addDays(now(), 14)) {
await pageOps(
`Tenant ${tenant.slug} metadata expires ${fresh.validUntil}, ` +
`please coordinate refresh with their IT.`,
);
}
await db.tenant.update({
where: { id: tenantId },
data: { metadataXml: fresh.xml, validUntil: fresh.validUntil },
});
}Page yourself at least two weeks before expiry. That's how long it takes a customer's IT team to schedule, review, sign, and re-export new metadata. I learned this number by paying it three times.
4. The customer's "test environment" is a fiction
When you ask the customer for an IdP sandbox to test against, you will get one of these answers:
- "We don't have one."
- "We have one, but it's behind the VPN."
- "We have one, but the admin who set it up has left."
- "We have one, but its certs expired six months ago."
In every case, the answer is: plan to test on their production IdP during a planned window, on a single test account they create for you.
This is not a workaround. This is the actual process. Schedule the integration call. Bring a checklist. Test five scenarios end-to-end in that one window:
- Successful login.
- Successful logout (often the broken one).
- Wrong-password login (you want their error UI, not yours).
- Disabled account.
- Re-login after session expiry.
If you've already shipped the unhappy-path UI from rule #2, the checklist takes ten minutes. If you haven't, the checklist takes a week and the customer hates you.
5. Logging is not "nice to have"
Every assertion in/out of your system gets logged with:
tenant_idrequest_idsubject_id(hashed if you must)- the issuer
- the timestamp the assertion claims
- the timestamp you observed
Don't log the assertion body itself in plaintext. Do log enough to answer "why did user X fail to log in on day Y at time Z" without asking them to reproduce. They will not reproduce. They will switch to a competitor who logs.
I had a bug for two months where one tenant's users hit a 1-in-50 login
failure. I couldn't reproduce it. I had no logs. I shipped logging, waited
a week, and found it in twenty minutes — they had a load balancer in
front of their IdP that occasionally rewrote the RelayState parameter
in transit. I would never have guessed.
The unglamorous summary
Enterprise SSO isn't a config job. It's a customer-success job that happens to involve protocols. Most of the work isn't writing code — it is:
- talking to a customer's IT person you've never met,
- about software they configured years ago,
- to fix a problem they don't think is their problem,
- on a deadline somebody else set.
The code part is the easy part. The library will do that.
What the library can't do:
- decide what your identity model is,
- own the unhappy paths,
- own metadata refresh,
- own the relationship with the customer,
- own the logs,
- own the cert rotation calendar.
That part is your job. It is also the part that decides whether the integration is a quiet success or a Friday evening pager. Choose which you want to be on call for.
If you found this useful and want the next post from the trenches in your inbox, there's a subscribe button at the bottom of the page.