Operations / SRE Engineer
Table of contents
- What you do in ITIP
- Core workflows
- Using frameworks
- Example: Detect and remediate compliance drift
- Example: Set up observability-driven governance for a new service
What you do in ITIP
You monitor compliance drift against governed quality constraints. You inspect near real-time architecture state sourced from observability platforms. You trigger remediation when Norm violations are detected.
ITIP connects your observability stack to the governance plane. You act on governed constraints, not ad-hoc alerts.
Core workflows
Monitor compliance drift
Observability platforms (Datadog, Grafana, Prometheus) collect SLI/SLO metrics. ITIP ingests this telemetry and continuously evaluates Norm assertions against live data.
When a Norm evaluation fails, ITIP flags a compliance drift:
VIOLATION: payment-gateway
Norm: availability-30d
Assertion: ServiceTelemetry.availability30d >= 0.999
Current value: 0.9973
Tolerance: SUSTAINED (30d window)
Governing Directive: "MUST ENSURE availability ON payment-gateway"
Regulatory preset: NIS2
The violation includes full governance context — which Directive it traces to, which regulatory preset it belongs to, and what the tolerance mode is.
Inspect near real-time architecture state
Navigate to Operations > Architecture State. ITIP shows the current state of all governed Structures:
| Column | What it shows |
|---|---|
| Structure | The governed component |
| Compliance status | Green / Amber / Red based on Norm evaluation |
| Drift indicators | Code-sourced definitions vs. authored definitions |
| Active violations | Norms currently in violation with duration |
| Last telemetry | Timestamp of most recent observability data |
This is the live architecture state — not a diagram drawn months ago. It reflects what is actually running.
Trigger remediation workflows
When a compliance drift is detected:
- ServiceNow incident — ITIP creates an incident linked to the violated Norm, including governance context and current metric values.
- Jira remediation task — created in the owning team’s backlog with the violated Norm as acceptance criteria.
- Governance loop tracking — ITIP tracks resolution: Norm re-evaluation → compliance restored → incident closed.
The remediation workflow is governance-driven. The Norm defines what must be fixed. The tolerance mode defines how quickly a violation must be resolved.
Tolerance modes in practice
The three tolerance modes determine how ITIP evaluates compliance:
| Mode | Ops meaning | Example |
|---|---|---|
| INSTANTANEOUS | Every single measurement must pass | ResponseTime.p99Ms <= 500 — any single violation is a breach |
| AGGREGATED | Statistical aggregate must pass | ErrorRate.percent <= 0.1 over 1h window — individual spikes allowed |
| SUSTAINED | Continuous compliance over a time window | Availability.percent >= 99.9 over 30d — sustained performance required |
AGGREGATED and SUSTAINED modes are where SRE practices (error budgets, burn rates) map directly into ITIP Norms.
Using frameworks
ISO 25010 — Operational quality characteristics
ISO 25010 quality characteristics map to your monitoring targets:
| Characteristic | Ops concern | Typical metrics |
|---|---|---|
| ProductReliability | Availability, fault tolerance, recoverability | Uptime %, MTTR, error rate |
| PerformanceEfficiency | Latency, throughput, resource utilization | p50/p95/p99 latency, RPS, CPU/memory |
| ProductSecurity | Runtime security posture | TLS compliance, auth failure rate, vulnerability scan |
Norms using these characteristics are your governance-driven SLOs.
ISO 25012 — Data quality in production
For data pipelines and data-processing services:
| Characteristic | Ops concern |
|---|---|
| DataCurrentness | Is data fresh enough? Staleness detection, refresh lag |
| DataAvailability | Is the data accessible when needed? Storage uptime, query success rate |
| DataRecoverability | Can data be restored after failure? Backup validation, RPO/RTO compliance |
Example: Detect and remediate compliance drift
Context: The Payment Gateway is in production. The SRE team monitors via Datadog. ITIP has ingested SLI metrics.
- Drift detected — ITIP flags:
availability-30dNorm in violation. Current value 99.73%, threshold 99.9%. - Inspect context — the Norm traces to a NIS2 regulatory preset and the architecture board’s availability Directive.
- Trigger remediation — ServiceNow incident created. Jira task added to the payments team’s sprint.
- Investigate — the Reconciliation view shows a recent deployment changed a timeout configuration. Code-sourced definitions diverged from governed values.
- Resolution — team fixes the configuration. ITIP re-evaluates: compliance restored. Incident auto-closed.
Result: Compliance monitoring is continuous and automated. The governance context tells you why the metric matters, not just that a number is wrong.
Example: Set up observability-driven governance for a new service
Context: A new service is being deployed. You want observability-driven Norm evaluation from day one.
- Check governed Norms — open the Structure’s Norms tab. See the quality Norms defined by the architect.
- Map to metrics — each Norm assertion references a quality dimension. Map those to your telemetry sources (Datadog monitors, Grafana dashboards, Prometheus rules).
- Configure telemetry ingestion — ITIP’s observability integration ingests metrics and evaluates Norms continuously.
- Verify initial state — the Operations dashboard shows all Norms green (or flags gaps if telemetry is missing).
Result: Governance-driven monitoring from the first deployment. SLOs are defined by architecture Norms, not invented ad-hoc by the ops team.