Research / RCRA Operations

RCRA Compliance Complexity Index: Generator Status Does Not Tell the Whole Operational Story

Name: RCRA Compliance Complexity Analysis Dataset
Creator: RCRAReady
License: https://creativecommons.org/licenses/by/4.0/

May 8, 2026 · 2.4M evaluation events analyzed

A joined EPA RCRAInfo and HWIP e-Manifest analysis using 14.1M manifests, 124.1M waste-code records, and predictive models across 17,331 manifesting facilities.

Share report:

Key Findings

→The high-complexity tail is different in kind, not just degree: evaluation events in the top decile averaged 29,152 manifests, 182,863 waste lines, and 55 transporters in the two-year lookback before an evaluation.
→That operating surface showed up in outcomes: the highest complexity decile had a 27% higher violation rate on the next eligible evaluation (30.2% versus 23.7%).
→Robustness check: the gradient persisted after excluding the top 1% of manifest-heavy facilities: 26.3% versus 23.3%.
→A statistical model that included operational complexity and prior compliance history ranked future violation outcomes better than a baseline model using generator status, sector, and state alone.

Operational Complexity Shows Up in Inspection Outcomes

Two facilities can share the same RCRA generator category while facing very different compliance burdens. In the data, that hidden operational complexity matters: evaluation events in the highest complexity decile were followed by a 27% higher violation rate on the next eligible evaluation (30.2% versus 23.7%).

This report builds a RCRA Compliance Complexity Index from actual EPA facility and e-Manifest records to measure that operating burden: manifest volume, waste-code breadth, transporter and receiving-facility networks, submission patterns, corrections, and prior compliance history. The goal is not to replace generator status, but to measure the work that generator status can hide.

It also builds on our earlier RCRA violation recidivism analysis: that report showed prior violations predict future ones; this report looks at whether operational complexity helps explain where compliance systems come under strain.

What we counted as operational complexity

The index uses data available before each evaluation date. Core inputs include manifest volume, waste-line count, federal and state waste-code breadth, management-method breadth, transporter count, receiving-facility count, concentration of transporter and TSDF networks, interstate shipment share, manifest submission mix, and proxies for rejection, residue, and correction activity.

Generator status, state, sector, prior violations, prior Significant Non-Compliance months, prior enforcement, and days since the last evaluation were treated as controls or history features rather than as part of the complexity score. Environmental justice and demographic variables were excluded from the primary model.

Average pre-evaluation manifest activity in the lowest and highest operational complexity deciles. Source: RCRAReady analysis of EPA RCRAInfo and HWIP e-Manifest data.

The complexity tail is real

The descriptive data shows a wide operational spread. In the lowest complexity decile, the average evaluation event had about 12 manifests, 21 waste lines, 3 federal waste codes, fewer than 2 transporters, and roughly 2 receiving facilities in the lookback window.

In the highest decile, the averages were roughly 29,152 manifests, 182,863 waste lines, 360 federal waste codes, 55 transporters, and 42 receiving facilities in the two-year lookback window. Those facilities are managing a very different operating surface.

Complexity and later violations move together

In the full event-weighted dataset, the lowest complexity decile had a 23.7% violation rate on the next eligible evaluation. The highest decile had a 30.2% violation rate. Put differently, the highest-complexity events were associated with a 27% higher violation rate than the lowest-complexity events.

A robustness slice excluding the top 1% of manifest-heavy facilities still showed a high-to-low increase, but the gap narrowed to 26.3% versus 23.3%. The pattern persisted even after removing the largest manifesting operations, though the gap became smaller.

Line chart showing violation rate by operational complexity decile for the full event-weighted dataset and a version excluding the top one percent of manifest-heavy facilities — Violation rate by operational complexity decile, full event-weighted data and top-1%-excluded robustness slice. Source: RCRAReady analysis of EPA RCRAInfo and HWIP e-Manifest data.

The model is strongest as a ranking tool

The nonlinear model's clearest result was ranking power: its highest predicted-risk decile had 1.84x the violation rate of the holdout average. That makes the result useful for research and prioritization, while still too coarse for facility-level prediction.

On ROC AUC, baseline controls alone reached 0.594. Adding prior compliance history improved the logistic model to 0.625. The interpretable logistic complexity model reached 0.614, better than baseline controls but weaker than history controls. The nonlinear complexity benchmark performed best at 0.654, suggesting that operational complexity is not a single straight-line factor. It interacts with prior history, sector, and the shape of the facility's manifest network.

Bar chart comparing model ROC AUC for baseline controls, history controls, logistic complexity index, and nonlinear complexity benchmark — Time-based holdout model comparison. The descriptive charts use the full joined dataset; model training used a capped training sample with a 2022 holdout. Source: RCRAReady analysis.

The signal varies by sector

Sector slices show that complexity does not mean the same thing in every operating environment. Waste-service and support-service facilities showed a clear increase from low to high complexity, as did government and public-sector facilities. Manufacturing, by contrast, showed a weaker pattern, suggesting that scale and standardization may absorb complexity differently than service, waste-handling, or public-sector operations.

That sector split is useful because it prevents one-size-fits-all scoring. High-volume manifesting appears to mean different things across operating models, so the complexity index is most valuable when it is paired with sector context.

Paired bar chart showing violation rates in low and high complexity bands for manufacturing, waste and support services, and government facilities — Violation rate in low-complexity deciles 1-3 versus high-complexity deciles 9-10 for selected NAICS sectors. Source: RCRAReady analysis.

Interpretation for compliance programs

The practical takeaway is that generator status is a regulatory category, not an operational map. Complex facilities often need stronger systems precisely because their operations demand more coordination across waste streams, containers, manifests, vendors, and receiving facilities.

A compliance program has to manage the real workflow: containers, accumulation dates, labels, waste profiles, manifests, return copies, transporters, TSDFs, corrections, rejections, and record retention. As those moving parts multiply, spreadsheets and disconnected binders have less margin for drift.

Methodology

This analysis used EPA ECHO RCRAInfo bulk data and the EPA HWIP e-Manifest CSV extract. The underlying RCRAInfo evaluation history spans records from 1980 through March 2026, with next-evaluation labels observed through May 2026; e-Manifest-derived complexity features provide the modern operational layer. The local warehouse included 14,095,134 manifest records, 33,960,636 waste-line records, 124,143,550 waste-code records, 22,721,864 transporter records, and a derived feature matrix of 2,424,253 evaluation events covering 17,331 manifesting facilities.

The unit of analysis was a RCRA evaluation event for a manifesting facility. The label was whether the next eligible RCRA evaluation found a violation. Manifest-derived features used a 730-day lookback before the evaluation date, and all feature windows used only records available before the evaluation date to reduce leakage. Descriptive decile charts use the full feature matrix. Model comparisons used a time-based 2022 holdout, with a capped training sample for practical runtime.

The model set compared baseline controls, history controls, an interpretable penalized logistic model with complexity features, and a nonlinear gradient-boosting benchmark. Reported diagnostics include ROC AUC, average precision, top-decile lift, and a permuted-label leakage check.

Source data: EPA ECHO RCRAInfo downloads and EPA HWIP public data downloads. Supporting files include the reproducibility summary, model metrics, complexity decile inputs, and selected sector inputs used to generate the published figures.

Key limitations: this is correlational, not causal. State reporting completeness varies. Inspections are not random. Facility activity, sector mix, regulator focus, and prior compliance history affect the measured relationship. Optional TRI/FRS enrichment was not included in the primary analysis because the core RCRAInfo and e-Manifest join was sufficient for this version of the report.

If this report is useful, the most relevant follow-on resources are the Audit Readiness Assessment, the Generator Status Classifier, and our guide to SQG vs LQG requirements.