EdgeRouter 4: routing, VLANs and banging one’s head against the wall

I spent most of my Labour Day trying to accomplish two tasks with an EdgeRouter 4 and the other miscellaneous networking gear in the house: setting up a simple VLAN and getting my backup DSL connection working.

Two WANs and a LAN

With two WAN connections (one DHCP/cable, one PPPoE/DSL), I wanted to have specific local network ranges send traffic out to (and receive forwarded traffic from) a specific WAN connection. Note that this isn’t quite the load balancing feature (which I don’t want), but moreso “IP range A uses cable, IP range B uses DSL”. I went through the gauntlet of EdgeRouter support articles and forum posts without much success:

I haven’t yet solved the problem, but I believe the issue is related to the PPPoE connection not injecting default routes into the main table (hence the need for policy-based routing), plus my second SNAT rule didn’t seem to match traffic. The PPPoE connection has a very volatile dynamic IP address, so source NATing based on address translation rather than masquerade wouldn’t work.

In any event, I’m sure this will be another weekend problem, but it was compounded by…

Why can’t I ping hosts on the VLAN?

Using some details from the “Router on a Stick” configuration, I wanted to split out hosts that would be on the DSL network from the cable network. I added a new VLAN (16) to eth1, stood up a DHCP server in the appropriate IP block, and configured /etc/network/interfaces on my Ubuntu 16.04 box using approximately these instructions from Debian and microHOWTO. The system got a lease in the correct range, but hosts on VLAN 1 (192.168.1.0/24) were unable to ping or access the server in VLAN 16 (192.168.16.0/24).

I went through a large number of troubleshooting steps, including:

  • Can I ping from VLAN 16 to VLAN 1?
    • Yes, but the server still had an interface on VLAN 1, so this wasn’t really a valid test.
  • Can I ping the router IP address?
    • Yes, clients from VLAN 1 could ping 192.168.16.1, which is the EdgeRouter IP on VLAN 16.
  • What does tcpdump say?
    • The Linux box on VLAN 16 was getting ping packets, but not replying to them.
  • Are there firewall rules on the EdgeRouter that might be preventing VLAN-to-VLAN traffic? 
    • The default seems to be “accept”, but adding explicit accept policies including logging only showed the inbound traffic.
  • Is the switch not permitting VLAN traffic?
    • The Cisco SG500-52P purchased as surplus gear has the most awful web interface. I tried changing the port mode from “Trunk” to “General” and back again, specifically setting the port for the server as untagged/PVID 16 and then updating the config on the Linux box to avoid tagging the VLAN – no change. I also took the opportunity to upgrade the firmware.
  • Is the EdgeRouter somehow not permitting the reply ICMP traffic at a lower level that I can’t easily see?
    • At this point I busted out the old pfSense box and hooked it into an EdgeSwitch Lite, configured VLANs and firewall settings correctly there and tried to ping the server on VLAN 16 from another system. No change.

At this point I had changed out all components in the equation except for the server, so after dinner I poked around with a few more settings in the switch and then tried a different scenario:

  • Using a “known good” Netgear GS742 switch that wasn’t connected to the rest of the network, I configured port 3 with VLAN 16, untagged/PVID
  • A Windows desktop computer was connected to port 1 with VLAN 1 untagged
  • A macOS laptop was connected to port 3
  • The pfSense box was connected to port 24 and offered DHCP on VLANs 1, 16 and 32

When all components were connected, the desktop on VLAN 1 at 192.168.1.101 was able to ping the laptop on VLAN 16 at 192.168.16.101 successfully.

The next test was to move the laptop downstairs, plugit into the Cisco SG500-52P, and assign the port VLAN membership as 16, untagged, PVID. The laptop picked up a DHCP lease from the EdgeRouter, and a system on VLAN 1 elsewhere on the network was able to ping the laptop on VLAN 16!

Investigating the server

At this point, the trouble seemed to lie with the server itself. After some Googling, I ran across a Ubuntu Forums post that talked about VLAN routing issues – the last post suggested checking the rp_filter setting with the following command:

sysctl -a | grep \.rp_filter

The setting is described in sysctl.conf as:

# Uncomment the next two lines to enable Spoof protection (reverse-path filter)
# Turn on Source Address Verification in all interfaces to
# prevent some spoofing attacks

On my IBM x3650 server with a large number of interfaces, it turns out rp_filter was enabled in both the “all”, “default” and “eno2” categories:

net.ipv4.conf.all.arp_filter = 0
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.br-0fce6441466a.arp_filter = 0
net.ipv4.conf.br-0fce6441466a.rp_filter = 1
net.ipv4.conf.br-f02b395ad2f3.arp_filter = 0
net.ipv4.conf.br-f02b395ad2f3.rp_filter = 1
net.ipv4.conf.default.arp_filter = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.docker0.arp_filter = 0
net.ipv4.conf.docker0.rp_filter = 1
net.ipv4.conf.eno1.arp_filter = 0
net.ipv4.conf.eno1.rp_filter = 1
net.ipv4.conf.eno2.arp_filter = 0
net.ipv4.conf.eno2.rp_filter = 1
...

I made the following adjustments to /etc/sysctl.conf, then ran sysctl -p:

net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.all.rp_filter=0

Then manually made the adjustment for the eno2 interface:

sudo echo 0 > /proc/sys/net/ipv4/conf/eno2/rp_filter

After this command was run, I was able to successfully ping the server’s IP address in VLAN 16 from a desktop in VLAN 1.

Follow up tasks

So that I don’t forget, here are some follow up tasks that I’d like to finish for this project (in addition to sorting out the PPPoE routing):

  • Do some reading and better understand the rp_filter mechanism. Try firing up a VM or system with only one interface (instead of one on VLAN 1 and one on VLAN 16) to see if this affects the behaviour.
  • Reboot the server in question and see if the rp_filter setting persists on the eno2 interface based on the “conf.default” and “conf.all” settings.
  • Review switch port settings; see if some ports can be changed to “General” from “Trunk”. Consider replacing the switch with something that will cause less irritation.
  • See if merely tagging the port with VLAN 16 (and not setting it as untagged/primary) and configuring an eno2.16 interface still allows traffic to flow.
  • Apply firewall rules on the EdgeRouter (starting from a “deny all” basis) and confirm that only authorized traffic is permitted.
  • Ensure VLAN hardware offload is enabled on the EdgeRouter
  • Add another VLAN now that the first one was figured out!