Home > IT Monitoring > How I Crashed Our Network During a Stress Test (And What I Learned)

How I Crashed Our Network During a Stress Test (And What I Learned)

Cristina De Luca -

December 05, 2025

It was 2:47 AM on a Saturday when I realized I’d made a terrible mistake. Our company’s entire network—supporting 1,200 employees across three offices—was completely offline. And it was entirely my fault.

I’d been running what I thought was a carefully planned network stress test. I’d scheduled it for the weekend, notified management, and prepared my testing tools. What I hadn’t prepared for was discovering our network’s breaking point by actually breaking it. The next four hours taught me more about network stress testing than the previous five years of my career combined.

This is the story of how one overconfident network engineer learned to stress test networks properly—through spectacular failure, emergency troubleshooting at 3 AM, and eventually building a testing methodology that actually works.

The Challenge: Proving We Could Handle Growth

Our company was planning a major expansion—adding 400 new employees over six months. My manager asked a simple question: “Can our network handle it?”

I’d been managing our network infrastructure for three years. We had a solid setup: dual redundant firewalls, gigabit switches throughout, and a 500 Mbps fiber WAN connection. On paper, we had plenty of capacity. Our peak usage barely touched 200 Mbps, and our firewalls typically ran at 30-40% CPU.

“Absolutely,” I told him confidently. “We’ve got headroom for days.”

But he wanted proof. “Run some stress tests this weekend. Show me the numbers.”

I’d never actually stress tested a production network before. I’d used iperf3 in lab environments during training, and I understood the theory. How hard could it be? Generate some traffic, watch the graphs, document the results. Easy.

That confidence would come back to haunt me.

My Biggest Mistake: Assuming “More Traffic” Meant “Better Test”

I spent Friday afternoon preparing. I installed iperf3 on two servers—one in our main office, one in our datacenter. I configured our network monitoring tools to capture detailed metrics. I even created a spreadsheet to document results.

My plan was simple: start with 100 Mbps of test traffic, then increase in 100 Mbps increments until I found our limit. I’d run each test for 60 seconds, check the graphs, and move to the next level.

Saturday at 2:00 AM, I started testing. The first few tests went perfectly:

• 100 Mbps: Everything normal, latency stable at 4-5ms
• 200 Mbps: Still good, matching our peak production load
• 300 Mbps: Slight latency increase to 8ms, but acceptable
• 400 Mbps: Latency jumped to 15ms, but no packet loss

Then I made the critical error. Instead of continuing with 500 Mbps, I thought “Let’s really push it and see what happens.” I configured iperf3 to generate 1 Gbps—the maximum our network interfaces could theoretically handle.

Within 15 seconds, everything went dark.

My monitoring dashboard showed red across every metric. The iperf3 test had frozen. I couldn’t ping anything. The VPN connection I was using to access the office network had dropped.

I’d completely crashed our primary firewall.

What I Learned About Network Stress Testing (The Hard Way)

Sitting in my home office at 3 AM, locked out of the network I’d just destroyed, I learned several critical lessons very quickly.

Lesson 1: Firewalls fail at connection limits, not bandwidth limits

When I finally drove to the office and accessed the firewall console directly, I discovered the problem. The firewall hadn’t crashed from bandwidth overload—it had exhausted its connection tracking table.

iperf3 at 1 Gbps had opened thousands of TCP connections simultaneously. Our firewall, configured to track every connection for security logging, ran out of memory and crashed. The bandwidth usage had only reached 680 Mbps when it failed.

I’d been testing the wrong thing. I thought network capacity meant bandwidth, but our actual bottleneck was the firewall’s ability to track connections—something I’d never even considered monitoring.

Lesson 2: “Maintenance window” doesn’t mean “consequence-free”

I’d scheduled testing for Saturday at 2 AM, assuming nobody would be affected. What I didn’t know: our automated backup systems ran from 2-6 AM, transferring data between offices and to our cloud backup provider.

When the network crashed, those backups failed. Monday morning, I discovered we’d missed our backup window for the first time in two years. Our backup administrator was not pleased.

Even “off-hours” testing has consequences. I should have checked what systems ran during my testing window.

Lesson 3: You need an emergency stop plan before you start

When the network crashed, I panicked. I didn’t know if stopping the iperf3 test would help (it wouldn’t—the firewall had already crashed). I didn’t have console access configured for remote management. I didn’t even have a checklist of recovery steps.

I ended up driving 30 minutes to the office at 3 AM to physically reboot the firewall. The entire outage lasted four hours—not because the fix was complicated, but because I hadn’t planned for failure.

Lesson 4: Gradual increases reveal more than maximum load tests

After recovering the network, I spent Sunday analyzing what happened. I realized my testing methodology was fundamentally flawed.

I’d been trying to find the absolute breaking point—the maximum load our network could handle. But what I actually needed to know was when performance started degrading. That happens well before catastrophic failure.

Looking at my monitoring data, I could see warning signs I’d ignored:

At 300 Mbps, firewall CPU jumped from 35% to 58%
At 400 Mbps, connection table usage increased from 15,000 to 47,000 connections
Latency had doubled from baseline before I even reached 500 Mbps

The network was telling me it was struggling. I just wasn’t listening.

What Actually Worked: Building a Real Testing Methodology

After my spectacular failure, I spent two weeks researching proper stress testing methodologies. I talked to colleagues, read network engineering forums on Reddit, and studied how enterprise networks actually conduct capacity testing.

Here’s the methodology I developed—tested successfully over the past 18 months:

Start with baseline documentation

Before any stress test, I now spend a full week documenting normal network behavior. I track bandwidth usage, latency, packet loss, firewall CPU, connection counts, and memory utilization during typical operations.

This baseline tells me what “healthy” looks like. During stress tests, I watch for deviations from baseline—not just absolute limits.

Test individual components separately

Instead of testing the entire network at once, I now isolate components:

WAN link bandwidth capacity (bypassing firewall)
Firewall throughput and connection tracking
Switch backplane capacity
Individual application performance

This reveals which component fails first. In our case, the firewall connection tracking limit (65,000 concurrent connections) was our bottleneck—not our 500 Mbps WAN link.

Increase load gradually with monitoring checkpoints

I now use 10% increments with 5-minute monitoring periods between tests:

Test 1: 50 Mbps (10% of capacity) - Monitor 5 minutes
Test 2: 100 Mbps (20% of capacity) - Monitor 5 minutes
Test 3: 150 Mbps (30% of capacity) - Monitor 5 minutes
...continue until degradation appears

This approach reveals the point where performance starts degrading—usually 60-70% of theoretical maximum capacity.

Use realistic traffic patterns

I learned about Open Traffic Generator from Reddit discussions. Instead of just blasting bandwidth with iperf3, I now capture actual traffic patterns during peak usage and replay them at higher volumes.

This revealed issues that pure bandwidth tests missed. Our VoIP traffic, for example, degraded at 350 Mbps even though we had bandwidth available—a QoS configuration problem that bandwidth tests alone wouldn’t have caught.

Always have a kill switch ready

I now configure hard bandwidth limits in all testing tools and set up automated monitoring alerts that notify me immediately if production traffic is affected. I also have documented emergency stop procedures—including remote console access to all critical devices.

Most importantly, I test the emergency stop procedure before running actual stress tests.

Lessons Learned: What I’d Tell My Past Self

If I could go back to that Friday afternoon before my disastrous test, here’s what I’d say:

• Your network’s limit isn’t what you think it is. Bandwidth is just one constraint. Connection tracking, CPU capacity, buffer sizes, and QoS policies all create bottlenecks that pure bandwidth tests won’t reveal.

• Failure during testing is valuable—if you’re prepared for it. Discovering that our firewall crashes at 65,000 connections was important information. Discovering it by crashing production at 3 AM was stupid. Test in isolated environments first, or have robust recovery procedures ready.

• Start small and increase gradually. There’s no prize for finding the absolute maximum capacity in a single test. Gradual increases reveal performance degradation patterns that matter more than catastrophic failure points.

• Monitor everything, not just bandwidth. CPU utilization, memory usage, connection counts, buffer statistics, and error counters tell you what’s actually happening. Bandwidth graphs only show one dimension of network performance.

• Learn from the community. The network engineering community on Reddit and professional forums has solved these problems before. I wasted weeks figuring out testing methodologies that others had already documented and shared.

Your Action Plan: How to Stress Test Without Crashing Everything

Based on my hard-won experience, here’s how to stress test your network properly:

Step 1: Document your baseline (1 week)

Capture normal network behavior during typical operations. Record bandwidth usage, latency, device CPU/memory, and connection counts. This becomes your reference point for identifying degradation during tests.

Step 2: Test in isolation first (if possible)

If you have a test environment that mirrors production, test there first. If not, test individual network segments separately before testing end-to-end paths.

Step 3: Start at 25% of expected capacity

Begin with conservative load levels. Verify your monitoring works, confirm your emergency stop procedures function, and ensure production traffic isn’t affected.

Step 4: Increase gradually with monitoring breaks

Use 10-15% increments with 5-minute monitoring periods between tests. Watch for early warning signs: CPU increases, latency growth, connection count spikes.

Step 5: Stop at degradation, not failure

When performance metrics deviate significantly from baseline (latency doubles, CPU exceeds 70%, packet loss appears), you’ve found your practical capacity limit. Don’t push to catastrophic failure.

For comprehensive ongoing visibility, consider integrated solutions like PRTG Network Monitor that combine stress testing capabilities with continuous monitoring. This catches performance degradation as it develops rather than during quarterly manual tests.

The Results: What Changed After I Fixed My Approach

Using my new methodology, I successfully validated our network capacity for the planned expansion. The results were eye-opening:

What I discovered:

Our WAN link could handle 480 Mbps sustained (96% of rated capacity)
Our firewall became the bottleneck at 350 Mbps due to connection tracking limits
VoIP quality degraded at 320 Mbps due to misconfigured QoS policies
Our practical safe operating capacity was 300 Mbps—not the 500 Mbps I’d assumed

What we fixed:

Upgraded firewall to model with 250,000 connection tracking capacity
Reconfigured QoS policies to properly prioritize VoIP traffic
Implemented bandwidth monitoring tools with alerts at 250 Mbps (80% threshold)
Created documented stress testing procedures for future capacity validation

The outcome:

We successfully onboarded 400 new employees over six months with zero network-related incidents. Our monitoring showed peak usage reached 380 Mbps—which would have crashed our old firewall but ran smoothly on the upgraded infrastructure.

More importantly, I learned that network stress testing isn’t about proving everything works—it’s about discovering what breaks and fixing it before users notice. That 3 AM disaster taught me more than any certification course ever could.