VMS Ingestion Automation: 30 sources, one job model

A travel-nursing job arrives at Trusted in one of three ways. A scheduled Sidekiq worker logs into a vendor portal with a headless browser, sets a date filter, downloads an XLSX, and uploads it back into the app. An ops engineer drops a file into an admin form because a portal’s scheduled fetch has been broken for two days and the jobs need to go out anyway. A Zapier webhook fires because a hospital’s VMS emails its weekly position list as an attachment, and email is the integration.

Then it has to become a job. Not a row in a vendor’s spreadsheet. A real Job record with a clinical unit, a business, a bill rate, dates, hours, and the rule scaffolding a clinician’s profile can be validated against. The shape every system downstream of ingestion expects: curation, qualification, packet assembly, search.

Thirty-plus vendor management platforms sit in our ingestion path. Each one is its own dialect: different column names, different authentication, different definitions of what counts as a position. The naive answer is one importer per vendor, written from scratch, glued together by whoever is on call when it breaks. I tried versions of that. They turn into the kind of code nobody wants to touch.

The design decision that did the most work: pushing classification into a rule table so adding a new vendor is a small PR, not a code rewrite.

Three entry paths, one pipeline

The three entry paths look different from the outside and almost identical from the inside. Each one produces the same thing: a raw file, attached to a workflow record, ready to be parsed. From that point on, every job runs through the same four steps.

Fetch. Authenticate with the vendor and pull files down. For the API vendors this is paginated REST calls with a refresh token loop. For scraped portals it is a Capybara/Selenium session with headless Chrome and a small library of selectors per vendor. The output is the same shape in both cases: one or more files attached to the workflow. For manual and Zapier paths Fetch is skipped---the file is already attached, either dropped in by an ops user through an admin form, or posted to a webhook endpoint by Zapier from an inbound email.

Serialize. Parse the raw files. XLSX, CSV, XLS, occasionally a one-off format we wrote a custom reader for. Output: typed JobDataImport rows with their columns coerced to the schema we expect downstream. This step deduplicates on a SHA256 row hash, so re-importing the same file updates timestamps without creating duplicate jobs.

Normalize. The vendor-specific transformation layer. Most of the per-vendor code lives here. The base class handles structure: iteration, error handling, output shape. Each vendor subclass overrides a small set of extraction methods. One vendor knows that position names arrive in the shift column in SPECIALTY - ROLE - UNIT format. Another encodes the same information across three columns. A third tucks the company identifier into a free-text notes field, because someone at the vendor decided years ago that was a reasonable place to put it.

Classify. Translate vendor strings into our internal IDs. Position names become clinical_unit_id. Company IDs become business_id. Role names become normalized role keys. This is where the rule-table model lives.

The four steps are the same for every workflow. The shape of the pipeline does not change when we add a new vendor; only the contents of the per-vendor classes change.

The declarative column schema

The Serializer is where the type coercion happens, and the choice that paid off most here was making the column definitions declarative.

Every vendor file has a header row that names its columns. Every internal JobDataImport row has a known schema. Bridging the two used to involve a switch statement per vendor and a thicket of to_s/to_i/parse calls. The shape we landed on instead is a small DSL:

class VendorASerializer < BaseSerializer
  field :position_name, source: "Position", type: :string
  field :company_id,    source: "FacilityID", type: :string
  field :start_date,    source: "Start Date", type: :date, format: "%m/%d/%Y"
  field :bill_rate,     source: "Bill Rate", type: :decimal
  field :shifts,        source: "Shifts", type: :integer
  field :row_hash,      computed: ->(row) { Digest::SHA256.hexdigest(row.to_s) }
end

VmsField is the underlying schema object. It reads a value out of a source row, coerces it to the right type, applies a custom format when needed (date columns are the obvious case), and computes derived values. The base Serializer iterates the configured fields, applies each one to the incoming row, and writes the result.

The win is not the line count. The declarative shape makes the data model legible. A new engineer reading VendorASerializer can see, in one screen, what I expect from that vendor’s file. Adding a vendor with a similar shape is mechanical. Changing the type of a field is a one-line edit.

Rules over hardcoding

Vendors do not name their clinical units the way we name ours. We have a normalized taxonomy of clinical units with stable integer IDs. Vendors send us strings. TRAVEL - RN - MED/SURG TELE. ICU-CARD-4E. CVICU. Medical Surgical Telemetry / Stepdown. The same unit at the same hospital can be named two different ways by two different platforms, because the platforms scrape from upstream systems that disagree.

The naive answer is a lookup table: vendor string in, internal ID out. That works for the first hundred mappings and breaks the moment a vendor renames a unit, or the same string means different things at different facilities, or a new vendor introduces a string almost-but-not-quite an existing key and the lookup quietly fails open.

The model we use instead is a rule table: job_ingestion_classification_mappings. Each row has a source (a vendor-specific scope, or NULL for a global fallback), a clinical_unit_id (the destination), and two sets of criteria. The matching criteria say when the rule applies. The exclusion criteria say when to skip it even if the matching criteria fire.

Inside each criterion is a small structure we call CriteriaStorage:

class CriteriaStorage
  # Each criterion has three fields. Within a criterion, all
  # populated fields must match (AND). Across criteria in the
  # array, any single match is sufficient (OR).
  Criterion = Struct.new(:equals, :contains_all, :contains_any)

  def matches?(input)
    criteria.any? do |c|
      [
        c.equals.nil?       || c.equals.casecmp?(input),
        c.contains_all.nil? || c.contains_all.all? { |t| input.downcase.include?(t.downcase) },
        c.contains_any.nil? || c.contains_any.any? { |t| input.downcase.include?(t.downcase) }
      ].all?
    end
  end
end

That is most of the engine. A criterion can require an exact match, a set of tokens that all have to appear, a set of tokens where any one appearing is enough, or any combination. Multiple criteria are an OR. Exclusion criteria run the same logic on a separate list and short-circuit the match.

So “any string containing ICU” matches Medical ICU. “contains ICU, excludes cardiac” keeps Cardiac ICU out of that bucket. “contains med/surg AND contains tele” routes telemetry/stepdown jobs into Medical-Surgical ICU instead of plain Med-Surg.

The precedence rule is the second half of the system. When we classify an input, we look for matching rules in two passes:

def classify(input, source:)
  source_rules = ClassificationMapping
    .where(source: source)
    .order(:created_at)

  match = source_rules.detect { |r| r.matches?(input) }
  return match if match

  global_rules = ClassificationMapping
    .where(source: nil)
    .order(:created_at)

  global_rules.detect { |r| r.matches?(input) }
end

Source-specific rules win. Global rules are the fallback. Within each scope, the first match by created_at wins, which makes the order of additions predictable and lets us layer a new rule in front of an older one when we need to.

The practical effect: a clinical curation manager can add a new mapping through an admin UI and the next ingestion run picks it up. No deploy. No migration. No case statement in a Ruby file that has to be PR’d and reviewed. The rule table is data, not code, and the engineering team gets out of the way of an operational change that does not need engineering judgement.

Adding a vendor is a small PR

Adding a vendor at all is a four-file change: a Fetcher, a Serializer, a Normalizer, and a Workflow that wires them together. Adding a new client on an existing platform is usually three to five lines per file. The platform shape is already known. The new client has different credentials and a different facility ID range, and that is most of the diff.

The bigger payoff shows up in the long tail. The 90th-percentile new mapping is not a new platform integration. It is a new variation of a string that an existing platform started sending us last week. With a hardcoded lookup, that would be an engineering ticket. With the rule table, it is a row insert through an admin form, and no engineer is involved.

What is actually hard

The pipeline itself is clean. The operational surface around it is where the real engineering goes. A few of the failure modes that shape the design more than any architecture diagram suggests:

Portals change without notice. A vendor swaps a <button> for a <div> with a click handler and our Capybara selector stops matching. The fix is mechanical; the slow part is detecting the failure and routing it to the right person before yesterday’s jobs go stale.

Authentication drifts. API tokens expire on schedules we do not control. Some vendors rotate them monthly, some on demand, a couple silently. We monitor for 401s in the fetch step and alert.

Date formats fail silently. A vendor that has been sending us MM/DD/YYYY for a year quietly switches to YYYY-MM-DD for a subset of their platform. The Serializer coerces unparseable dates to nil instead of raising, because a hard failure stops the whole import and a Tuesday is the wrong time to find out half your jobs won’t parse. We surface coercion failures as JobDataImport errors and treat the count as a health metric.

The same job arrives from two sources. A facility runs on two VMS platforms in parallel during a migration. The same position comes in twice with different identifiers and slightly different naming. The Serializer-level dedup catches identical files; the deeper job-level dedup is its own machinery and its own post.

None of these break the pipeline shape. Most of the operational work on this system is keeping the inputs healthy, not reworking the inside.

--- Carlos, Engineering