Understanding the BGP Table Version – Part 3: Troubleshooting

More on BGP Table Version – the most unknown and unexplained, BGP concept/value that I rarely ever troubleshoot without

This is part 3 and final post in the 3 part series of “Understanding the BGP Table Version”. If you haven’t already read parts 1 & 2, I would strongly recommend reading them prior to this post.

Understanding the BGP Table Version – Part 1: Introduction to BGP Table Version

Understanding the BGP Table Version – Part 2: BGP Table Version in Action

In the opening of Part 1 of this blog series I said

I cannot honestly imagine troubleshooting BGP without understanding the BGP table version.  I use it all the time.  Sometimes it is just a quick “eyeballing” of it to check to see if all the BGP table versions are in sync…. or if there is work to be done.  I see people “eyeballing” the up/down time for a BGP peer when they are troubleshooting.  And sometimes I see them quickly eyeball the InQ/OutQ columns.  But I rarely ever see anyone using the BGP table version.  And, honestly, I just can’t imagine not eyeballing all 4 of those –  up/down & prefixes learned,, InQ, OutQ, BGP table version.  It is these 4 that I use to give me the “whole picture” when I’m troubleshooting.

Prior to getting into the nitty gritty specific only to the BGP table version and using that for troubleshooting…. let’s take a brief second to look at the four I mention above that I always “eyeball” and give a quick look at whenever I’m troubleshooting.

  • Up/down & Prefixes Learned
  • InQ
  • OutQ
  • BGP Table Version

Up/Down & Prefixes Learned

One of the first things I usually look at is what I refer to as “Up and Learning”.  Common questions I ask when looking here are

Question: Are all the BGP neighbors up that should be up?

Question: Are the # of prefixes I’ve learned from each look like about what I expect?

NOTE: For the above two questions… obviously one needs to know what “normal” is in order to be able to differentiate what is a clue from what is just a fact.  Knowledge is key.

network_detective

Question: If I’m troubleshooting something… is there a BGP neighbor in there that has been up for a period of time that looks “off”.  For example… below both BGP peers have only been up for 2 minutes and 20 seconds.  These doesn’t seem very long.  Is this a “clue” to the “who done it” I’m trying to solve?

check1

InQ/OutQ

While troubleshooting and looking at InQ & OutQ I’m really mostly looking for whether or not they are “0” or not. If they are “0” I just move on from those columns.

InQ is “information” coming in from that neighbor into my BGP input queue.  Initial startup or new best paths that the neighbor feels it needs to inform me about.

OutQ is “new information” that this router needs to send to that neighbor. Be it the initial startup or new best paths.

Don’t be worried if you have initial startup or new best path changes and you don’t happen to ever catch anything in the InQ/OutQ.  🙂

check2

If they are NOT zero then what I do next depends on what those values are and a lot of other factors. Hard to give a complete list and cover every issue I’ve ever seen. So let me just summarize some of the common ones you might see.

First let’s deal with the boring ones. 🙂 The ones that have nothing to do, really, with BGP or routing and everything to do with your environment.  These will be CPU, Interface issues, MTU.

  • Router CPU Capabilities – the device’s CPU capabilities might not exactly be a match for the work it is being tasked to do.  Check CPU and memory utilization.
  • Interface Issues – look at the connected interface used to peer.  Make sure it is not oversubscribed and does not have any input/output drops or errors.
  • MTU – yup yup. Check this too while you are at it.

check2

Okay… now that the boring ones are done.  Let’s get to the fun ones.  🙂

  • New Neighbor
    The BGP neighbor in question is new and this router has a LOT (OutQ) to tell that peer about or/and the BGP peer has a LOT (InQ) to tell this router about.  A little time and a few more “show ip bgp summary” and if this is just a new neighbor high InQ/OutQ then once they are done passing their initial BGP best paths with each other, the InQ/OutQ should quiet back down to 0S.

And now on to my FAVORITE to troubleshoot.

ROUTE CHURN!!

(Yea… I know… easy to say when one doesn’t live in a production environment)  This is also the one that I most heavily rely on the BGP table version and what it is doing.

So for this one I’m “eyeballing” the 3 table versions.  This tells me what is going on.  Obviously, as Part 1 and Part 2 mentioned, when the BGP table version increments that means we have a new best path for something.  Via the BGP table version we can eyeball the rate of change (new best paths) that router is experiencing.  Again… we do have to know what is normal and whether this rate of change is expected.

check_final

Ready to play?  🙂


troubleshooting_2

So let’s say you are experiencing some weirdness in your network.  The list goes on and on and on as to the varying things you could be experiencing that could be related to route churn.  So I’m not going to list them all out.

So let’s keep it at – we are trying to solve a problem here. You go into R2 below and run show ip bgp summary and you do the quick eyeballing.  From what you know as “normal” in your network… the up/down times sync up and the # of prefixes received is also in line with what you expect. But those table versions?

one_with_router

  • BGP Table Version: 39641
  • Main Routing Table Version: 39641
  • Table Version for neighbor 10.100.100.3: 39641
  • Table Version for neighbor 20.2.20.20: 39041

Wow. So that is 600 prefixes off. With only 2 neighbors in the BGP table it is a little easier to see what is going on.  But say we didn’t easily see it. What would I do next?

If the BGP table version is incrementing like this… some BGP best path somewhere is changing and then it is updating other routers.  The question becomes… where is the start of the churn?  Is it this router?  Or is this router just an innocent propagator of the route churn someone else is sending it?

As Gil Grissom said in CSI LasVegas, “Let the evidence guide you”

When troubleshooting I like approaching it with methodology much like a Detective. After all, when troubleshooting in IT, we are all Network Detectives.

The command below can help you see the BGP best path churn. There are two ways to run this command.  Either focus in on a BGP table version number and put it in…. and get everything that is tagged with that version number or above.  OR just use “recent” and then a number.

two

Admittedly, when I’m troubleshooting churn I don’t ever really use the first option. Why? Well… cause there is route churn and I don’t know how often the churn is happening. FOr example… look below… see the BGP table version is already at 47441?  It’s gone up almost 9,000 prefixes.

What I’m looking for are the most recent BGP best paths and who is the next hop sending them here. Then… follow the breadcrumbs.

three

So the above is the list, at this specific moment in time, of the 10 most recent BGP best path changes sent to this router from next hop 30.3.30.101. Who is that?

topo

30.3.30.101 is the router in the bottom right. OSPF neighbored with R3.  Why is something R3 is receiving via OSPF in the BGP table on R2?  Because R3 is redistributing OSPF into BGP.

R3_bgp

I call this a “wide open redistribute”.  When you configure this what you are configuring your router to do is to keep your BGP table and all your BGP peers informed with all the changes that are happening in your IGP.  🙂

… and that is just exactly what you are getting here.. and this is the source of the route churn… we have found the root cause of the incrementing BGP table version.  It is coming from route churn in the IGP.   Have we found the root cause of the route churn in the IGP?  🙂  No.. but those are the next breadcumbs to follow in your network with whatever routing protocol is your IGP.

As an additional note — It isn’t uncommon that when I’m troubleshooting route churn that I also end up using the following command to see the most recent updates to the RIB.  This helps me see the RIB’s churn and then follow those breadcrumbs.

sir

How did I cause this?  🙂  I got some toys.  🙂  I was kinda using a traffic generator, a Spirent TestCenter, off of that OSPF router and injecting the churn into it.


If that “Tips from a Network Detective” thing interested you…. you can click go to that PacketPushers blog series by clicking on the jpeg below.

network_detective



Categories: BGP, Routing, Troubleshooting

Tags: , , , , , ,

9 replies

  1. Hi Denise,

    This is great. Too bad, I don’t see a similar command in NXOS. Do you know if something similar can be used on NXOS?

    Amit

  2. I liked your “sh ip bgp recent” and “sh ip route | include 00:00” commands – very useful. Keep up the good work.

  3. I like this post, thank you!

Trackbacks

  1. Understanding the BGP Table Version - Part 1: Introduction to BGP Table Version : Networking with FISH
  2. Understanding the BGP Table Version - Part 2: Example : Networking with FISH
  3. Understanding the BGP Table Version (3 part Blog Series) :

Leave a Reply