Bugs

Funny thing here. I happened to stumble only over non-as4 related issues which were triggered by my test environment, which you find somewhere in the log of this project. This whole testing effort became more a "what bugs are still in bgpd?" than a "are we as4 ready?" thing. While that was also a good thing, it cost time I wanted to spend for purely as4 things, sigh.

This page does not list bugs which hamper the AS4 functions in any way: There are no known bugs in the AS4 patch implementation in version 05 (as of 2007-03-14)!

  1. SOLVED NOT as4 related!!

    In the test setup (see somewhere in the log) the session between the two as4 speakers locked up after a longish time: one peer stayed in "Clearing", the other retried steadily to get a session.

    This happened after a holding time expired. The peer where the holding time expired stayed in "Clearing", got open messages from the other site and rejected them by notification "refused/cease connect".

    Symptoms: bgp_ignore was called very often, the outgoing queue was 1 packet to send, log file was filled up with bgp_ignores, ...

    We never went from "Clearing" to "Idle" and were stuck in Clearing. Hmmm.

    Yep, bug was known, not as4 related; it is bug 302 in quagga bugzilla . It was supposed to be fixed, but was not.

    Fix: The fsm entry for status "Established" when receiving "Hold_Timer_expired" stated "{bgp_fsm_holdtime_expire, Clearing}. That was wrong. It must be "{bgp_fsm_holdtime_expire, Established}". Reasoning: In the progress of bgp_fsm_holdtime_expire, BGP_Stop is be added as event which is then supposed to stop the session and enqueue "Clearing_Completed". If we go directly into Clearing when having a hold time expire, the enqueued BGP_Stop will never cause bgp_stop to be called because we are already in "Clearing" where everything (i.e. every event) besides "Clearing_Completed" is ignored, so we stayed stuck in "Clearing". Uuh, Ooh.

    Well, a patch (with another solution) is in cvs now. Thus, newer versions of the AS4 patch do not need any patch for this any more.

  2. circumvented not directly as4 related

    A session between one of the as4 speakers and one of the ciscos goes down, and stays down. Both sides try to reconnect, but do not succeed.

    This was caused by a malformed aspath going out from the as4 speaker. Synthesized aspaths (i.e. build from a AS4_PATH and an ASPATH attribute) were not normalised, which hit a bug in aspath_put. Aspath_put as it is in the code base has all its length-based decisions out of kilter: The length of the aspath attribute has already been put onto the stream by the calling function, but aspath_put may change this predetermined length by either splitting an assegment (thus needing space for one more assegment header) or merging two segments (thus needing less space by an assegment header). In both situations an aspath will be send to a neighbour with wrong length information causing the session to be dropped by the peer. Aspath_put also looks whether enough space for writing is in the stream which is being used for writing but takes no action other than ceasing to write whenever that happens which will lead to the same situation. This is independent of AS4 and should be dealt with at some time: one aspath longer than 255 as numbers will break all quaggas in the world. Solution: call stream_resize if space does not fit, and patch up the already written length in the caller if it changed. Patching up may also encompass having to move all written bytes by one byte if the size of the attribute crosses 255 bytes by the adaption thus changing it from an "expanded" (two byte length information) attribute to a not expanded (one byte length information) attribute or vice versa.

    Circumvented this bug by normalising the synthesized aspaths which I should have done anyway.

  3. not solved not as4 related will ignore this one

    A "no router bgp xxx" crashes bpgd (i.e. hits an assert) if there are neighbors defined. There is NO crash if you disconfigure the peers first.

    Also in current CVS quagga, so not AS4 related. In fact, there it is no assert, but a hard segmentation fault, yuck!

    quagga as4:

    2006/10/31 10:02:28 BGP: 10.1.1.197 rcv OPEN, version 4, remote-as 23456, holdtime 180, id 10.1.1.197
    2006/10/31 10:02:28 BGP: 10.1.1.197 bad OPEN, remote AS is 145.15, expected 0
    2006/10/31 10:02:28 BGP: 10.1.1.197 sending NOTIFICATION 2/2 (OPEN Message Error/Bad Peer AS) 2 bytes 5b a0
    2006/10/31 10:02:28 BGP: 10.1.1.197 send message type 3, length (incl. header) 23
    2006/10/31 10:02:28 BGP: 10.1.1.197 went from OpenSent to Deleted
    2006/10/31 10:02:51 BGP: 10.1.1.197 [FSM] TCP_connection_open (Active->OpenSent)
    2006/10/31 10:02:51 BGP: 10.1.1.197 passive open
    2006/10/31 10:02:51 BGP: 10.1.1.197 went from Active to OpenSent
    2006/10/31 10:02:51 BGP: 10.1.1.197 rcv message type 1, length (excl. header) 34
    2006/10/31 10:02:51 BGP: 10.1.1.197 rcv OPEN w/ OPTION parameter len: 24, peeking for as32
    2006/10/31 10:02:51 BGP: 10.1.1.197 PEEKING: OPEN w/ optional parameter type 2 (Capability) len 6
    2006/10/31 10:02:51 BGP: 10.1.1.197 PEEKING: OPEN w/ optional parameter type 2 (Capability) len 2
    2006/10/31 10:02:51 BGP: 10.1.1.197 PEEKING: OPEN w/ optional parameter type 2 (Capability) len 2
    2006/10/31 10:02:51 BGP: 10.1.1.197 PEEKING: OPEN w/ optional parameter type 2 (Capability) len 6
    2006/10/31 10:02:51 BGP: 10.1.1.197 OPEN peeking found 4BYTEAS capability
    2006/10/31 10:02:51 BGP: 10.1.1.197 rcv OPEN, version 4, remote-as 23456, holdtime 180, id 10.1.1.197
    2006/10/31 10:02:51 BGP: 10.1.1.197 bad OPEN, remote AS is 145.15, expected 0
    2006/10/31 10:02:51 BGP: 10.1.1.197 sending NOTIFICATION 2/2 (OPEN Message Error/Bad Peer AS) 2 bytes 5b a0
    2006/10/31 10:02:51 BGP: 10.1.1.197 send message type 3, length (incl. header) 23
    2006/10/31 10:02:51 BGP: 10.1.1.197 went from OpenSent to Deleted
    2006/10/31 10:03:12 BGP: 10.1.1.197 went from Active to Deleted
    2006/10/31 10:03:12 BGP: Assertion `(node)->data != ((void *)0)' failed in file bgp_route.c, line 1418, function bgp_process_main
    2006/10/31 10:03:12 BGP: Backtrace for 8 stack frames:
    2006/10/31 10:03:12 BGP: [bt 0] /usr/lib/libzebra.so.0(zlog_backtrace+0x1f) [0xb7ed6e98]
    2006/10/31 10:03:12 BGP: [bt 1] /usr/lib/libzebra.so.0(_zlog_assert_failed+0x83) [0xb7ed70b8]
    2006/10/31 10:03:12 BGP: [bt 2] /usr/lib/quagga/bgpd [0x80697e7]
    2006/10/31 10:03:12 BGP: [bt 3] /usr/lib/libzebra.so.0(work_queue_run+0xba) [0xb7edf14e]
    2006/10/31 10:03:12 BGP: [bt 4] /usr/lib/libzebra.so.0(thread_call+0x62) [0xb7ecc65b]
    2006/10/31 10:03:12 BGP: [bt 5] /usr/lib/quagga/bgpd(main+0x273) [0x805bcaf]
    2006/10/31 10:03:12 BGP: [bt 6] /lib/tls/libc.so.6(__libc_start_main+0xc8) [0xb7d3eea8]
    2006/10/31 10:03:12 BGP: [bt 7] /usr/lib/quagga/bgpd [0x805b8e1]
    

    Don't be perturbed by the logged as numbers, some of the loggings were not yet as4 aware when I did this. This is not an as4 related bug. To have a neighbor go from "OpenSent to Deleted" and then from "Active to Deleted" can also be a general bug: looks like something is going to be deleted twice, and the second time it is not there any more...

    quagga-as2:

    2006/10/29 23:31:45 BGP: Import timer expired.
    2006/10/29 23:32:00 BGP: Import timer expired.
    2006/10/29 23:32:11 BGP: 10.1.1.199 [FSM] Timer (routeadv timer expire)
    2006/10/29 23:32:15 BGP: Import timer expired.
    2006/10/29 23:32:22 BGP: 10.1.1.199 rcv message type 4, length (excl. header) 0
    2006/10/29 23:32:22 BGP: 10.1.1.199 KEEPALIVE rcvd
    BGP: Received signal 11 at 1162161148 (si_addr 0x8, PC 0x8067a6a); aborting...
    Program counter: /usr/lib/quagga/bgpd[0x8067a6a]
    Backtrace for 18 stack frames:
    /usr/lib/libzebra.so.0(zlog_backtrace_sigsafe+0x28)[0xb7f00bbb]
    /usr/lib/libzebra.so.0(zlog_signal+0x230)[0xb7f00b8b]
    /usr/lib/libzebra.so.0[0xb7f083a6]
    /lib/tls/libc.so.6[0xb7d7c9e0]
    /usr/lib/quagga/bgpd(bgp_info_delete+0x15)[0x8067a3d]
    /usr/lib/quagga/bgpd(bgp_static_withdraw+0x88)[0x806bf23]
    /usr/lib/quagga/bgpd(bgp_static_delete+0xa3)[0x806c3b4]
    /usr/lib/quagga/bgpd(bgp_delete+0x16)[0x805e20e]
    /usr/lib/quagga/bgpd[0x80832b1]
    /usr/lib/libzebra.so.0[0xb7ef1c7d]
    /usr/lib/libzebra.so.0(cmd_execute_command+0xb6)[0xb7ef1d3b]
    /usr/lib/libzebra.so.0[0xb7eec3c0]
    /usr/lib/libzebra.so.0[0xb7eed592]
    /usr/lib/libzebra.so.0[0xb7eed936]
    /usr/lib/libzebra.so.0(thread_call+0x62)[0xb7ef665b]
    /usr/lib/quagga/bgpd(main+0x273)[0x805bbdf]
    /lib/tls/libc.so.6(__libc_start_main+0xc8)[0xb7d68ea8]
    /usr/lib/quagga/bgpd[0x805b811]
    
  4. not solved not as4 related will ignore this one

    This is on old standing one which probably many of us know, but we did not get around to do something about it. If a defined logfile hits a system imposed maximum length and writing is no longer possible, bgpd enters a non-deterministic way of doing things.

    Often, I've seen it simply crash. Here, when testing as4, it kept a peering session in the status "closing" forever. That could also have been the first bug listed here, though.

    IMHO, having a non-writable logfile should have no impact whatsoever on the function of a routing daemon.