Donatas Abraitis

Scaling Chef Server

Fri, 26 Mar 2021 00:00:00 +0000

If you search on how to scale Chef Server, you are lucky, there is a SINGLE post in internets. I appreciate the author of this blog post which helped me go in the right direction (moving from standalone instance to multiple).

I’m gonna switch into German mode here and wanna give a big disappointment for Chef Community: it’s horrible, lack of proper documentation, authors of Chef suggesting do not use Chef (=huh?) or paying money - helpless, at all. Unlike Ansible. But let’s leave this topic untouched here.

Standalone Chef Server served well for a long time until reached more than 700 nodes with heavy searches and a non-optimized version of cookbooks. Scaling vertically was good enough until around 1000 nodes. It was kinda 4 CPU cores, later 8, 16, 32, and we are done here. Chef Server is CPU intensive, memory does not matter much.

The main CPU hungry process is erchef which is responsible for serving API requests for all necessary components: search, cookbooks, nodes, attributes, etc.

Then we start looking at the beforementioned blog post tuning recommendations, like enable cookbook caching, worker count tuning, timeouts, etc. Frankly speaking, it doesn’t help at all if you reach this state with 32 cores going at full rate.

I was randomly reminded about when it should be the time to start scaling this horizontally because one day shit could happen (When it’s most needed - rule of thumb).

We run chef-client every 30min. + 15min. of splay time.

First, what you have to do is eliminate all possible *:*, name:*, fqdn:* or any other wildcard searches in your recipes. This is basically handled in erchef which is as I said before CPU-hungry. That means, you fetch all attributes from PostgreSQL (backend) and parse them in Erlang. Instead of fetching from Solr (already chewed). In our case queries like fqdn:* eat 16 cores (nice and easy).

The next step was to replace all search() calls with filtered searches. That reduces the load for the erchef process and overall.

One more thing to consider is to review all possible iterations through data_bag_item() and replace it to search(:something, 'id:*'). Erchef does not like this much, but eventually, it’s good enough.

Before optimizations, we had converged time around 15-30min. It depends on the location. We had Geo distributed nodes with a single Chef Server in Europe while nodes are everywhere: US, SG, BR, ID, everywhere.

After optimizations converge time dropped to 5-20min. or so. Well, almost half-cut. Not bad, not good.

But still, there are searches that can’t be replaced for fqdn:* due to limitations in our infrastructure. Erchef is still angry about that. Decided to replace our standalone Chef Server with tiered architecture. One backend and multiple frontend servers. Two frontends per location with anycast for load balancing and high availability.

Backup standalone Chef Server instance and do restore on new frontends. That’s it, you need to change configs only afterward.

This is the graph (CPU drop) after launched frontend servers. Erchef is happy as never before.

chef-client converge time dropped noticeably.

Still, the problem is that frontends are scaled per region while the backend is single (especially PostgreSQL). From Singapore to UK or so, latency is around 160-300ms, it depends on the conditions how the traffic is routed. We had a problem with one of our ISPs in Singapore and download bandwidth dropped below 1Mbps. And guess what. The chef stopped working at all, every queries to the backend timed out. When this happens, even 10s timeout is not enough. The biggest problem was due to (again) fqdn:* searches which consume sort of more traffic and causes timeouts. For non-wildcards queries that were basically good enough if the response size was around 100Kb.

Started thinking quickly about what to do here. Decided to launch pgpool-II instances as a sidecar to Chef frontends and separate memcached instances per region to cache data from pgpool-II.

Cache hit ratio to memcached instances isn’t high, around 10%, but that’s enough to offload wildcard searches.

And now, we have converge time dropped to 1-5min.

The traffic to the backend server from frontends dropped by half from ~500mbps to ~250mbps.

FIN

I hope this post will be handy for others looking for similar problems.

Bonus - pgpool.conf

listen_addresses = 'localhost'
port = 31337
socket_dir = '/tmp'
listen_backlog_multiplier = 2
pcp_listen_addresses = 'localhost'
pcp_port = 9898
pcp_socket_dir = '/tmp'
backend_hostname0 = 'chef-backend.donatas.net'
backend_port0 = 5432
enable_pool_hba = off
pool_passwd = 'something'
authentication_timeout = 120
num_init_children = 32
max_pool = 16
child_life_time = 900
child_max_connections = 1000
connection_life_time = 600
pid_file_name = '/var/run/pgpool/pgpool.pid'
logdir = '/var/log/pgpool'
connection_cache = on
memory_cache_enabled = on
memqcache_method = 'memcached'
memqcache_memcached_host = 'chef-memcached.donatas.net'
memqcache_memcached_port = 11211
memqcache_total_size = 134217728
memqcache_max_num_cache = 10000000
memqcache_expire = 14400
memqcache_auto_cache_invalidation = on
memqcache_maxcache = 1048576
memqcache_cache_block_size = 2097152
memqcache_oiddir = '/var/log/pgpool/oiddir'
white_memqcache_table_list = ''
black_memqcache_table_list = ''

Bonus - frontend config

nginx['enable_ipv6'] = true
nginx['ssl_certificate'] = '/var/opt/opscode/nginx/ca/donatas.net.crt'
nginx['ssl_certificate_key'] = '/var/opt/opscode/nginx/ca/donatas.net.key'
opscode_erchef['depsolver_worker_count'] = 8
opscode_expander['nodes'] = 8
opscode_erchef['nginx_bookshelf_caching'] = ':on'
opscode_erchef['s3_url_expiry_window_size'] = '100%'
opscode_erchef['db_pool_queue_max'] = 32
opscode_erchef['db_pooler_timeout'] = 300000
opscode_erchef['depsolver_pool_queue_max'] = 10
opscode_erchef['depsolver_pooler_timeout'] = 300000
opscode_erchef['db_pool_size'] = 16
opscode_erchef['sql_db_timeout'] = 300000
opscode_erchef['authz_timeout'] = 300000
oc_bifrost['db_pooler_timeout'] = 300000
oc_bifrost['db_pool_queue_max'] = 32
oc_bifrost['db_pool_size'] = 16
lb['cache_cookbook_files'] = true
lb['redis_connection_timeout'] = 300000
lb['redis_keepalive_timeout'] = 300000
postgresql['vip'] = '127.0.0.1'
postgresql['port'] = 31337

topology 'tier'

server 'backend1.donatas.net',
  :ipaddress => 'X.X.X.X',
  :role => 'backend',
  :bootstrap => true

backend_vip 'backend1.donatas.net',
  :ipaddress => 'X.X.X.X,
  :device => 'ens192'

server 'frontend1.donatas.net',
  :ipaddress => 'Y.Y.Y.1',
  :role => 'frontend'

server 'frontend2.donatas.net',
  :ipaddress => 'Y.Y.Y.2',
  :role => 'frontend'

server 'frontend3.donatas.net',
  :ipaddress => 'Y.Y.Y.3',
  :role => 'frontend'

server 'frontend4.donatas.net',
  :ipaddress => 'Y.Y.Y.4',
  :role => 'frontend'

...

api_fqdn 'chef.donatas.net'

Unnecessary Updates in BGP

Sun, 29 Nov 2020 00:00:00 +0000

It is well known that BGP, as a distance-vector protocol, suffers from path exploration: For every withdrawn route (AS path or any other mandatory attribute change), the next best, supposedly valid route is selected and announced, until there are no more candidates left in the router’s RIB.

BGP routers can generate multiple identical announcements with empty community attributes if stripped at egress. This is an undesired behavior. Why do we need to send an UPDATE if this is not an actual change?

Imagine a topology below:

Assume a converged network. In order to induce updates for prefix p, we disable the link between Y1 and Y2 and wait for arriving updates at collector C1.

I have FRRouting configuration at X1:

router bgp 65002
  no bgp ebgp-requires-policy
  neighbor 10.0.1.1 remote-as external
  neighbor 10.0.1.1 timers 3 10
  neighbor 10.0.2.2 remote-as external
  neighbor 10.0.2.2 timers 3 10
  address-family ipv4 unicast
    redistribute connected
    neighbor 10.0.1.1 route-map c1 out
  exit-address-family
!
bgp community-list standard c1 seq 1 permit 65004:2
bgp community-list standard c1 seq 2 permit 65004:3
!
route-map c1 permit 10
  set comm-list c1 delete
!

X1 receives paths with communities 65004:2 and 65004:3 from Y1 because those paths are tagged at Y2 and Y3 appropriately.

When X1 sends the best path to C1 it generates duplicate updates because of attribute change (community, 65004:2 -> 65004:3). But since we strip all communities at egress, it’s absolutely not necessary sending duplicates towards C1 because the Adj-Rib-Out wasn’t changed.

I developed a patch for FRRouting, which prevents sending duplicate updates if the path actually not changed. This is valid for all attributes.

For more detailed information and experiments, please read at https://www.cmand.org/communityexploration/. You will find more configuration examples for other routers (Cisco, Juniper, Nokia) and other software routing daemons (Bird, OpenBGPD, Quagga).

ARP table quiz in Cumulus Linux

Fri, 17 Jul 2020 00:00:00 +0000

The weird situation when you notice that IPv4 get a packet loss, while IPv6 works as expected for the same physical link:

--- 2a02:4780:face:f00d::1 ping statistics ---
1368 packets transmitted, 1368 received, 0% packet loss, time 1367006ms
rtt min/avg/max/mdev = 0.217/0.285/0.667/0.045 ms
root@sg-nme-leaf1:~#

--- 10.0.31.1 ping statistics ---
1366 packets transmitted, 1348 received, 1% packet loss, time 1365005ms
rtt min/avg/max/mdev = 0.147/0.227/0.874/0.064 ms
root@sg-nme-leaf1:~#

There is a lot of what to check, but this is what I did solving this issue.

First I checked ptmd status which surprised me more. BFD status was failed, but BGP session was UP. That’s because we use a single IPv6 session for both IPv4 and IPv6.

10.0.31.1 is down while 2a02:4780:face:f00d::1 is up. Why? It’s for the same reason (packet loss).

Checking for drops/errors with ethtool -S swp48 | grep -iE "drop|err|disc" gave nothing except that there is a usual drops count which is normal due ACL, bursty buffer congestions, etc.

If IPv6 works while IPv4 not, it seems related to the ARP table. I double-checked ARP table entries with ip neigh | wc -l. It was around 6k. Nothing special as well, just a well-worn node.

Unfortunately, Broadcom devices do not have native ASIC monitoring, which could provide me with the stats about the buffers, packets count, queue lengths, etc.

I thought I would inspect buffer congestions or so, but anyway. Continuing.

Running this command in a terminal I noticed host entries drop when packet loss happens:

watch -n 1 'date >> /tmp/host_count.log ; cat /cumulus/switchd/run/route_info/host/count_0 >> /tmp/host_count.log ; tail /tmp/host_count.log'

Fri Jul 17 08:14:06 UTC 2020
12702
Fri Jul 17 08:14:07 UTC 2020
9281

Like from 12k to 9k and varying. The maximum is 16k, but that’s not an issue since it’s not hitting nearly 16k.

dmesg is clear. If it would be garbage collection for a stale ARP entries it would be an error message in dmesg output, like: kernel: Neighbour table overflow.

Just in case I checked net.ipv4.neigh.default.gc_thresh1 which was a default 128. Like I mentioned above current ARP entries were around 6k.

128 is the maximum number of ARP entries in the cache. Garbage collection won’t be triggered if below 128. We have 6k. That’s why I noticed a host entries drop when packet loss happens. I double-checked this few times and confirmed.

Lowered gc_thresh1 to slightly less than we have ARP entries fixed the problem.

If you raise gc_thresh1 higher than you have entries, you will probably gonna have lots of FAILED/STALE entries in the neighbor table and fun things could start happening (lots of TCP spurious retransmissions, out of order packets, and so on).

So keep eyes carefully on this ;-)

Two years in FRRouting

Fri, 08 May 2020 00:00:00 +0000

Since the beginning of my contribution to FRRouting, I raised myself to the top 15 contributors (that’s a huge win for me personally):

% git shortlog --summary --numbered | head -n15
  Donald Sharp
  Quentin Young
  David Lamparter
  Renato Westphal
  paul
  Philippe Guibert
  Lou Berger
  Russ White
  Paul Jakma
  Rafael Zalamena
  Mark Stapp
  Christian Franke
  Daniel Walton
  Martin Winter
  Donatas Abraitis

Now I feel that I can read the code, I can find the root causes faster, I can even review others’ code, not that dumb and green as was 2 years ago.

When I started digging into FRRouting (because it was picked by Cumulus Networks) I solved old issues (3-4 years old), there were really lots of them. I managed to solve quite a lot, more to go still, but it’s never-ending for a huge and fast-moving project. Of course, I’m not referring github.com/ansible/ansible, which is I would say brain dead.

What I do now? I participate in the development process more and more. Mostly I contribute to BGP protocol (my favorite one), packaging, and testing environments. Sometimes in other areas (don’t have much experience and knowledge yet of other protocols).

Overall my commits during two years (including merge commits of other contributors):

% git shortlog --summary --numbered --author='Donatas Abraitis'
   365  Donatas Abraitis

Commits owned by me:

% git shortlog --summary --numbered --no-merges --author='Donatas Abraitis'
   204  Donatas Abraitis

Doing the math one commit every third day. For some people, I know that would be like a full-time job. Comparing with my regular work, two of the most active repositories I contribute are:

% git shortlog --summary --numbered --no-merges --since="2 years ago" --author='Donatas Abraitis'
  1148  Donatas Abraitis

% git shortlog --summary --numbered --no-merges --since="2 years ago" --author='Donatas Abraitis'
   355  Donatas Abraitis

That sounds like a full-time vs. a part-time job.

I absolutely do not regret that I spent plenty of my free time contributing to FRRouting because that helped me grow as a person and professionally. Also, I started understanding lots of things about how huge open-source projects work, how the deployments run, communication, rules. Third home.

I see that with more time you get more trust from the community, you get valued. At the beginning, I thought I’m very annoying and angry about reviewing others’ code, but it seems not, people value that.

We just can’t push the bad code, which does not keep requirements, no documentation updates, and testing. That’s not a private project which is used by 10 people (what is usually seen in most companies). There is no excuse to push the bad or untested code.

Basically, we are trying to keep requirements like Linux kernel does: checkpatch.pl, clang-formatter, etc.

I remember one day I was tackling a dnsmasq issue at Hostinger and wanna check how caching is implemented in dnsmasq because the performance comparing to tinydns is really significant (tinydns 10x faster).

I pulled the source code to dig into the problem I was facing (not the scope of this blog post). Guess what was my first words (direct). “Abandoned university bachelor’s code.” Yes, that’s the very very notable difference when you look at the code which is community-driven and which is just another one more project.

How could it be better when you commit and even more maintain the world-class project which is used by such big players like Microsoft, Amazon, RedHat, VMWare, Cumulus Networks, etc.? Wait, Cumulus Networks acquired by NVIDIA, that complicates things even more :)

My kids sometimes ask me, why do you work all the time? Well, I answer honestly that I’m not working, I’m learning. It’s my hobby, the same as you watching “Nastya”, “Roma and Diana”, “AcroYoga”, etc., riding a bike.

Sometimes doing that at weekends, usually when I have nothing else planned to do. Don’t expect you doing that.

I always think about leaving the project, but it’s hard, it’s like a third family, where you have an amazing community, good practices, super-hero developers, you just learn, you can’t leave your home :)

This is not the same as a regular job, you don’t get money. But you kinda drive a project which is brilliant. Which is emotionally awesome.

To sum up, being maintainer doesn’t mean you have to be pro-level in everything. I formed my opinion that it’s enough to be active, responsive, willing to help, learn a lot, dig very deep into the protocols, and of course going wild reading RFCs.

Became maintainer of FRRouting

Wed, 04 Sep 2019 00:00:00 +0000

This is the main reason I stopped (or less) blogging because I’m more interested in how the internet works and give my full fingers improving the internet overall.

At first, it sounds amazing on the title. Yes, that’s really cool and motivating. Secondly, you have to pay more attention to what you do as a developer plus review others code where you have some degree to do it, continue reading the source code, help the community to adopt the product for their use.

I started contributing to FRRouting open-source project around one and a half year ago. I just went over existing issues and tried reducing them (solving). In the beginning, I felt like I got a new job. Setup development environment, new tools, formaters, linters, strict requirements, similar to the Linux kernel’s workflow, etc. It was fun and I learned lots of really new stuff, met new especially very very smart people.

After a year I was promoted being the maintainer of FRRouting project.

What does it mean?

Maintaining a huge world-class project is a huge responsibility because you can quickly lose your trust. We have weekly virtual meetings to discuss the current issues, pull requests and then decide who takes which part or who will review something existing. We have Slack channels where discuss some only related things. No rush. It’s always better to have a slow loop rather than going through the fast loop and fixing things again and again.

The lifespan the PR takes to be reviewed sometimes takes even months. Code reviewers have to actually look at the code, they can’t just hope the community will somehow perform a proper review. A project also cannot rely just on tools, they will never do the full job. Proper code review takes a lot of time, and it needs to be done by experienced developers.

Responsibility comes when you have to review or fix parts you already changed/improved. I always had a fear of commenting, merging and discussing the stuff you are not the best one comparing with others. With time I became more comfortable being part of the project.

Noticed an interesting issue and wanted trying it as a challenge? Let’s assign self and start the journey. Sometimes I have no clue if I can or can’t implement assigned issue at all, but it’s even more challenging than a regular job because you must write good code, you must keep styling, testing and other requirements in place. In regular jobs, you mostly can ignore that while here no. Comparing to a permanent job, code review could be more strict in open source projects because you have to keep all the things unified.

Companies require to push features or bugfixes as fast as possible even without carrying about the performance, quality and other factors. Just push forward. In open-source, things differ. Your commits could stay even months not merged. And that’s normal. Why? Because of both: the quality and performance.

My one of the favorite phrase Linus said:

The New Linus: “I really like you as a person and all. But that code will only get committed over my dead body.

Again, maintaining a global project doesn’t mean you have full control and do what you want. For instance, implement BGP version capability, but it’s absolutely not worth doing this without published draft/RFC.

So I created a draft, then we discussed in IDR mailing list about all the options and culprits. Some were agreeing with me, some not, but it’s absolutely fine to be rejected eventually. However, writing draft/RFC is really not the funniest thing to do, because at first it must be written in XML, then you must pass the validations (syntax, grammar, paragraphs, references, etc.), discuss with the community if it’s worth, code a working example/prototype as a reference.

Fun facts

Google (rumors that Facebook as well) uses FRRouting internally;
I was invited to a job interview, then I asked them why did you invite me? “We saw you are a maintainer of a project we use in our devices for routing”;
The best thing that happened apart from the birth of a child - huge motivation to move forward.

LLDP does not work with i40ge driver in VMWare ESXi

Thu, 06 Jun 2019 00:00:00 +0000

Imagine your services critically depend on the Link Layer Discovery Protocol (LLDP). And this is the brownie if you have $VMWare ESXi and Intel 710/711722 series NICs.

LLDP just doesn’t work with i40ge driver well. Well, it doesn’t work at all.

Test it on the guest:

# lldpctl
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------

The main problem is that the NIC doesn’t forward LLDP packets to guests. Hence, you can’t see what are your neighbors inside the guest. The bigger problem arises if you depend on LLDP and other services are failing due to that. Cascading failure scenario.

To overcome this limitation and allow guests to see LLDP packets you MUST use the newest i40ge driver and modify the flash.

Here are the commands to upgrade i40ge firmware to the newest and disable LLDP agent. When you disable LLDP agent, it will forward LLDP packets to guests:

# wget http://donatas.net/VMW-ESX-6.7.0-i40en-1.8.6-13636624.zip
# esxcli software vib install -v /vmfs/volumes/datastore1/__tmp/i40en-1.8.6-1OEM.670.0.0.8169922.x86_64.vib
# esxcli system module parameters list -m i40en
# esxcli system module parameters set -m i40en -p LLDP=0,0
# reboot
# wget http://donatas.net/x722-lldp.tar.gz

As you see from the output above, it’s not enough to upgrade the firmware. We need to use the 3rd party tool which is released by Lenovo to flash the firmware and disable LLDP agent.

[root@us-imm-esxi7:/vmfs/volumes/5cef6df8-fabe2ba4-7e68-ac1f6b5a176e/__tmp/X722-LLDP-FW-Setting-Tool/esxin64e] ./LLD
Pn64e -disable -debug
Connection to QV driver failed - please reinstall it!
LLDP 1.0.5
Copyright (C) 2018 Intel Corporation

Scanning for matching Intel(R) devices...

## Vend Dev  MAC          Branding-String
-- ---- ---- ------------ ---------------
MAC type is 327683
 1 8086 37D2 AC1F6B5A176E Intel(R) Ethernet Connection X722 for 10GBASE-T
   FW: 3033 ETrack: 80000E48  FW:3.1 API:1.5  NVM: 3.33 MAP2.31
MAC type is 327683
 2 8086 37D2 AC1F6B5A176F Intel(R) Ethernet Connection X722 for 10GBASE-T
   FW: 3033 ETrack: 80000E48  FW:3.1 API:1.5  NVM: 3.33 MAP2.31
-- ---- ---- ------------ ---------------

2 matching adapters discovered.

About to disable LLDP in NVM on port 0
Flash size is 6004736
EMP is 8034
EMPSettingsModuleHeaderOffsetW is 0001A000
LLDPConfigurationPtr is 0007
LLDPConfigurationOffset is 0001A00D
LLDPConfigurationLength is 0008
LLDPAdminStatusOffsetW is 0001A00E
Note: LLDPAdminWord is 3333
LLDPCRCOffset is 0001A015
LLDPCRC is 8038
Changed word 0001A00E to 0000
New CRC is A4
Changed word 0001A015 to 80A4
Writing modified flash of size 6004736...done.

After that we have it:

# lldpctl
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface:    ens192, via: LLDP, RID: 1, Time: 0 day, 01:52:28
  Chassis:
    ChassisID:    mac 80:a2:35:21:a2:71
    ...

Conclusion

This took me four+ hours to debug and read half the internet.

Git in operations

Wed, 01 May 2019 00:00:00 +0000

I was asked from developers to tell about git how we use it in operations team (aka SRE/DevOps). In an ideal world, developers should teach operation guys how to use git at scale. But sometimes reality is far away.

Below are my short slides I prepared one hour before stand-up, keep it simple, stupid.

I will explain each command from the slides how I use it not only for work but for open-source contributions as well.

git config –list

List all the configurations per project or globally. I don’t think I use it at all, just for the setup phase maybe.

git config –global push.default current

This is very handy to avoid specifying the branch name you want to push. This allows you to use the current branch you are working right now by default.

git config –global branch.master.remote origin

Usually, you have to type for instance git push origin master. Using this configuration knob and the previous one, you cut this command to just git push.

The trick for Github users git config --global remote.origin.fetch +refs/pull/*/head:refs/remotes/pr/*. This command fetches the branches by pull request numbers. If a pull request has a number 123. Then you can check out locally to that branch by typing git checkout pr/123. Cool? Not? Continue reading.

git config –global alias.last ‘log -1 HEAD’

I always tend to use git log -n 1. Using git aliases it could be more human readable like git last.

git remote add upstream https://

When you work on a local fork of an upstream repository where you don’t have write permissions you are out of sync of branches between your fork and the upstream. git fetch --all helps you to have all branches an upstream has locally as well.

git log

Mostly I use git log, but if I want to find a relevant commit id by commit message I use git log --oneline | grep nginx or git log --grep 'nginx'. If I need to find the commit by looking into the code, not only to commit message I use git log -S 'nginx'.

git shortlog

I rarely use this command. Mostly I use it when I want to see what are the most active contributors to the arbitrary project or it’s useful to have this information when publishing release notes.

The output is well known for some people:

Donatas (2241):
      Merge pull request #1 from hostinger/feature/add_supermarket_cookbook
      Merge pull request #2 from hostinger/feature/add_hostinger-machine_cookbook

git shortlog -s -n --all --no-merges gives us more statistic related view:

% git shortlog -s -n --all --no-merges
  2685  Donatas Abraitis
   415  FirstName LastName

git add

Every day I usually use git add . to add all files recursively in a current directory, but sometimes I have to use git add <path> to add specific file or directory to skip some temporary files. I never used git add --interactive. It’s quite fun but doesn’t look very handy at a first glance. It just slowdowns the process, IMHO.

I use git add --patch not often, but it’s useful if you changed more than one code places and finally, you need to add only a few of them (not all). Patch splits your changes into hunks and you are able to select which one to include.

To remove the file you accidentally added you could do that by typing git rm --cached <path>.

git commit

Sorry, all IDE lovers, but git commit -m is really big crap. First of all, it teaches you to not think much about the commit message and description at all. Why? Because if you use -m you probably want to push this commit as fast as possible without caring about the details. Lazy developers syndrome:

* git add .
* git commit -m 'I do not care much about that'
* git add .
* git commit -m 'I do not care much about that 2'
* git add .
* git commit -m 'fix'
* git push

Let’s talk about git commit --signoff. Some big projects require you to sign before merging.

It is used to say that you certify that you have created the patch in question, or that you certify that to the best of your knowledge, it was created under an appropriate open-source license, or that it has been provided to you by someone else under those terms.

git commit –amend

I always use git commit --amend --allow-empty --signoff when committing. Amend is the feature which allows you to modify previous commit keeping the same commit message, description. It doesn’t create an additional commit, just appends the code you want. It’s required to have a single commit per pull request somewhere because some linters or CI/CD platforms run tests under commit and not under pull request. What happens when you have two commits where the later one fixes syntax while the formerly introduced syntax error. It would start two deployments per commit and both should fail.

bad commit example

commit a5140910088f33ec6edd3869a1354ebfafb63ff8
Author: joni2back <xxx@gmail.com>
Date:   Tue Mar 3 16:58:54 2015 -0300

    fix error

What can you extract from this message? Absolutely absurd. Nothing. Oh, I can say that the author is doing his career well.

good commit example

commit afad5cedf1be827238b376e63b0b93bb555c928e
Author: Donatas Abraitis <donatas.abraitis@gmail.com>
Date:   Mon Feb 25 21:16:02 2019 +0200

    bgpd: Add peer action for PEER_FLAG_IFPEER_V6ONLY flag

    peer_flag_modify() will always return BGP_ERR_INVALID_FLAG because
    the action was not defined for PEER_FLAG_IFPEER_V6ONLY flag.

    ```
    global PEER_FLAG_IFPEER_V6ONLY = 16384;
    global BGP_ERR_INVALID_FLAG = -2;

    probe process("/usr/lib/frr/bgpd").statement("peer_flag_modify@/root/frr/bgpd/bgpd.c:3975")
    {
        if ($flag == PEER_FLAG_IFPEER_V6ONLY && $action->type == 0)
                printf("action not found for the flag PEER_FLAG_IFPEER_V6ONLY\n");
    }

    probe process("/usr/lib/frr/bgpd").function("peer_flag_modify").return
    {
        if ($return == BGP_ERR_INVALID_FLAG)
                printf("return BGP_ERR_INVALID_FLAG\n");
    }
    ```
    produces:
    action not found for the flag PEER_FLAG_IFPEER_V6ONLY
    return BGP_ERR_INVALID_FLAG

    $ vtysh -c 'conf t' -c 'router bgp 20' -c 'neighbor eth1 interface v6only remote-as external'

    Signed-off-by: Donatas Abraitis <donatas.abraitis@gmail.com>

A bit better example of how good commit message should look like. I don’t say it’s ideal, but it’s good enough to understand what’s going on.

git rebase

Most of the time I use git rebase --interactive master. But if an arbitrary project has release branches I use for instance git rebase --interactive stable/7.0.

git stash

I don’t use stash often, but it’s a cool feature git provides. It’s like a buffer or storage where you put your changes without committing and can jump between branches. Let’s say I work on a feature branch and I need to switch to the master branch to pull new changes. Not possible if I have some changes already. I need to stash my changes to the buffer and retrieve those changes later using git stash pop or restore back to more accurate revision. You can list all stashes with git stash list.

git reset

Let’s say I committed a bad commit or amended to the previous commit without creating a commit before (overwritten someone’s changes). In such cases I use git log -n 2 to grab the commit id I want to reset to. git commit <sha1>. Then I’m able to commit my changes again.

Git allows you to use the --hard or --soft method for reset action. With soft you lose only the commit history. With hard, you lose the changes as well.

git tag

I’m not a fan of using tags. But some projects use them. Basically, there are two types of tags: annotated and lightweight. The latter one just marks the commit with arbitrary tag while the former one creates some metadata in the history, like data, author, message.

git bisect

I discovered this feature a few months ago. It’s very cool. Want to find the commit which broke functionality or which caused regression? Bisect can help you. You specify a good and a bad commits before the start:

git bisect start
git bisect bad <sha1>
git bisect good <sha1>
git bisect bad
...
git bisect reset

Or you can automate this by using git bisec run <cmd>. Where cmd is the script which handles bad/good commits by exit code.

git cherry-pick

What if I want to pick the commit which is already on the master and recommit it to another branch, let’s say stable/7.0? Create one more commit or pick the same commit which is on the master branch and backport it to the stable/7.0? git cherry-pick <sha1> does the trick.

git reflog

I used reflog I think fewer times than I have fingers on my hands. But it’s a really tough feature. For instance, if I used to amend and broke my commit, I’m able to recover to the previous state of the same commit by traveling in time ;-) It’s like a git time machine.

Conclusion

I don’t realize my daily work without git. It’s a critical tool for day to day work even in operations. Truly stupid, but even lawyers nowadays are using Github to publish chapters publicly.

When you code for work, you mostly do not need git command such as git bisect or git cherry pick and so on. That’s why I’m always looking for open-source contributions - to learn something new again and again.

Keep commits as small as possible to help to bisect truly relevant commit instead of reviewing elephant one.

Got my own IPv6 address 2019::E

Sun, 10 Mar 2019 00:00:00 +0000

Preparation

Experience required?

After passing the exam I could say that the list of materials above is not quite enough to prepare because some questions I just knew from my last experience with Cisco certifications.

How long it takes to prepare?

I took one week vacation and read as much as I can, played with virtual environments, attended boot camp.

Testing environment

This is the most interesting piece.

First of all, Cumulus decided to work with Proctoru.com for certifications. Ok, I went to test if my Linux workstation is able to deal with Proctoru requirements.

Unfortunately, Linux is not supported by Proctoru.

I decided to use VirtualBox with Windows 7 as a guest. Shared my webcam, microphone, etc. The test was green. I fully trusted that it’s done and I can easily wait for an exam date.

Exam date came. I filled all the necessary data like passport details, pictures, etc. But in the last stage, I got an error that please disable “VirtualBox integrated camera”. Wait. What?

Contacted support. They said that’s not allowed to use VirtualBox at all. Well done Proctoru.com, well done.

I called my co-worker to ask if he could borrow a laptop with OSX. Ok, I got my new laptop which is fully supported now.

I’m sitting in our meeting room alone. Supervisor asked me to show the room around, put the phone in front of my laptop. That’s fine. Later on, she said that this place is not private and not allowed by the institution. Damn it, I had to move to another place.

I rushed to our other office where actually we have silent boxes.

Again, did the same procedure and eventually, another supervisor said - it’s not a private place. What the hell is this? I asked where should I go? Can I take an exam inside the car? She said - yes because the car is a private space. No problem, I ran to my car, did the same procedures.

Surprise surprise? Wonder what she said to me? This is NOT a private space, move to private space! I won’t say what I said to them (angry). I asked to figure out and look at chat history because the previous supervisor told me to move to the car. It was misunderstanding from their point of view. I rescheduled the exam already the third time.

I decided to go to my home where I don’t live there yet (under construction). This is the private space according to them. Ok, sweaty, warm and angry I started my exam adventure.

Difficulty

One guy from Cumulus said that the exam is really not so easy, but I took the risks and tried to pass it immediately without learning more around.

For me personally, it wasn’t hard. I would say due to my past and current networking experience.

The most difficult part was that I was quite stressed due to the truly stupid examination process.

Compare with CCNP?

Comparing with Cisco exams, especially with CCNP this looks harder because you have to know more, not only command line commands and theory. CCNOP includes command line, theory, Linux and automation chapters.

Another advantage while the exam’s program is young is that there are no braindumps to peck and go. I’m sure there will someone leak questions for this exam as well. That’s why I decided to pass it before instead of cheating.

Was it worth?

I think it was worth. I’m not going to continue with Cisco certification anymore because white-box networking is the future. Even better, you can always spot the problem and try to figure out it yourself instead of just filling an issue. And for extra Brownie points maybe contribute to upstream?

Prevent route leaks by explicitly defining policy

Tue, 19 Feb 2019 00:00:00 +0000

Route leaks or even hijacks are one of the biggest flaws in global routing.

What are the route leaks? Literally, there are prefixes announced accidentally due to wrongly configured import/export filters. Or those filters do not exist at all.

If a customer or a peer sends routes outside his scope and a provider accepts them, we call those events as route leaks.

Route leaks are more referred to accidentally events, while hijacks are an illegitimate takeover of IP addresses by corrupting routing tables. In such a case, scammers can route the traffic as he wishes. To avoid the aforementioned events providers MUST take care of strict filters what to accept from peers and customers.

Most known filters are:

	Easy to implement	Major vendors support	Pros	Cons
RPKI		X	Accept valid prefixes from peers and customers. No need to define any prefix-lists and other ACLs.	Requires external validators. Takes more time for peers to implement. Not much precise at the moment because peers do not update ROA records properly.
prefix-list	X	X	Accepts only defined ranges of prefixes. Allows specifying a dynamic prefix mask (ge, le).	Can’t filter out prefixes by AS-PATH.
distribute-list	X	X	Same as prefix-list, but less mature.	Can’t filter out prefixes by AS-PATH. Cannot assign dynamic prefix mask as prefix-list does.
as-path access-lists	X	X	Filter routes by AS-PATH only.	Cannot filter routes by prefixes.
maximum-prefix	X	X	Set maximum prefix number for a peer or a customer to avoid overfilling the Adj-RIBs-In.	This is too much aggressive if one side decides to restart the session under some circumstances.
WHOIS database filtering			Accept prefixes only defined in WHOIS database (RIPE, APNIC, etc.). This is more accurate at the moment comparing with RPKI.	Takes more time to converge. Changes are visible typically in 24 hours. Requires external services to update the control plane.
RFC8212	X		Accept and/or announce prefixes only if one of the aforementioned filtering techniques applied. This is bidirectional forwarding due to filtering both directions.	Only FRRouting at the moment supports this RFC.

Another interesting approach I found is by implementing roles where each neighbor defines the role: customer, peer, internal, provider. This approach appends BGP Open message to establish an agreement of the relationship of two neighbors. Propagated routes are then marked with iOTC (The Internal Only To Customer)] attribute according to agreed relationship allowing prevention of route leaks.

More about that you can read on htt ps://tools.ietf.org/html/draft-ietf-idr-bgp-open-policy-02.

We discussed this draft in short in the FRRouting group, but at the moment there is nothing to implement while it’s not released as RFC. And this draft is questionable if it would give any reasonable effect right now. It would take a huge amount of time to implement for both sides. It’s absolutely fine if you use only FRRouting daemons, but what about a vendor-agnostic solution?

Instead of using roles in updates and open messages, I found https://tools.ietf.org/html/rfc8212 which sounds reasonable to implement.

The RFC defines that both peering sides should require import/exporter filters explicitly defined. What does it mean? By default, all(?) vendors do not require route-map to be configured for a neighbor.

The snippet below will announce everything it has in its Adj-RIB-Out for neighbor 192.168.3.1.

router bgp 65031
 neighbor 192.168.3.1 remote-as 65032
 !
 address-family ipv4 unicast
  redistribute connected
 exit-address-family
!

Here we have a new BGP bgp ebgp-requires-policy knob which requires filters (filter-list, distribute-list, prefix-list or route-map defined) for every eBGP session.

router bgp 65031
 bgp ebgp-requires-policy
 neighbor 192.168.3.1 remote-as 65032
 !
 address-family ipv4 unicast
  redistribute connected
  neighbor 192.168.3.1 prefix-list exit4-in in
  neighbor 192.168.3.1 route-map exit4-out out
 exit-address-family
!
ip prefix-list exit4-in permit 10.0.0.0/24
!
route-map exit4-out permit 10
!

To clarify, if we remove inbound prefix-list from this neighbor, we will receive zero prefixes due to default behavior.

router bgp 65031
 bgp ebgp-requires-policy
 neighbor 192.168.3.1 remote-as 65032
 !
 address-family ipv4 unicast
  redistribute connected
  neighbor 192.168.3.1 route-map exit4-out out
 exit-address-family
!
route-map exit4-out permit 10
!

Adj-RIB-In table for neighbor 192.168.3.1 will be equal to zero prefixes. It prevents accepting any prefixes if the policy is forgotten to define explicitly.

Even though we see an informational warning that Inbound updates discarded due to missing policy under show bgp neighbors 192.168.3.1.

Hence, instead of just reading the documentation, it’s more fun to play archeologist and try to implement this into real-life examples. An attempt to cool again.

Traffic steering using GeoDNS and IPv6

Fri, 14 Dec 2018 00:00:00 +0000

Using DNS-based load balancing doesn’t save you from the failure. DNS server doesn’t know if the backend is up or down. It just responds without carrying about the state of the backend.

So if you have two or more IPs per record, DNS server will respond in a round-robin manner. Let’s say one backend of two is under maintenance (down) and another is alive. Eventually, all connections affected by first backend failures will be re-established to another instance (if TTL is small enough, e.g.: 30 seconds).

Typically TTL is set to one hour or so. Some resolvers override source TTL to cache DNS responses longer. If you use DNS as a balancing layer then small TTL must be used, like 30 seconds.

This could be improved by using circuit breaker between the server and the client.

A further step to achieve more granular stickiness would be to use GeoDNS service e.g.: PowerDNS. It has full support for MaxMind legacy GeoIP and GeoIP2. It could return address by country, city, continent. Just keep in mind that GeoIP dat format is not maintained anymore. Consider using mmdb format. PowerDNS looks up what to return by source IP of the resolver or EDNS - if forwarded.

If your DNS server is dual-stacked, make sure you use both families of GeoIP data (IPv4 and IPv6). Otherwise, it will respond with surprising results.

You could even craft your own mapping conditions on how to respond to arbitrary queries. Like Facebook has offline DNS map cartographer service. The simplest way is to write in CVS and export to dat/mmdb.

But again this does not guaranty high availability if the backend is down.

Here come CDN providers. Put your website under CDN, cache as much as possible masking backend failures. Some CDN providers have their own load balancers thus you should not care much about how they route the traffic to your website. Again, that’s not free from failures.

Instead of using CDN load balancer, implement anycast over more locations to spread the traffic to the nearest location. Sometimes one is better than many. I refer one as a single anycast address which is deployed between a few locations.

I must mention that $anycast is expensive to deploy because maintenance and monitoring are hard. In addition, anycast is not always end-user friendly because of not very fair routing. Sometimes even de-tour occurs in the path from the source to destination.

You should allocate the whole /24 block for it (yeah, this is the smallest prefix length over global BGP table). So if you are planning to use only a few addresses from the whole /24 block - that’s not the way to go.

If you have only a few locations then no point to use anycast globally. Too much headache.

Even though you have quite enough PoPs, you must ensure your whois information is up-to-date because some ISPs generates route maps according to inetnum country field. Hence a lot of monitoring and traffic engineering stuff are involved.

One more interesting approach is using CDN in front and anycast at the backends. In this case, CDN provider will load balance your traffic using anycast address, but again you are not sure if your content is served from the right region. To improve this setup we can install GeoDNS server and ask CDN’s resolver to query CNAME record instead of A/AAAA. GeoDNS, in turn, will respond with appropriate backend’s address according to resolver’s (or EDNS) source IP.

As I mentioned above IPv4 anycast is an expensive solution, but not for an old good friend - IPv6!

In this blog post, I would like to explain sophisticated (not for networking guys) approach on how to handle failovers gracefully. With IPv6 things change. As drawn in the diagram you should see that every PoP has two prefixes announced.

One global anycast plus region allocated prefix. Both are overlapping prefixes which allow having smooth failover if one region goes down completely. For instance, your GeoDNS server responds to CNAME record with IP 2A02:4780:C3::1 for the CDN’s resolver and at that moment this region is down. New connections will be redirected to the shortest AS-PATH PoP because of global anycast overlapped network.

You should deploy this setup if you care about the infrastructure - it allows you to turn off the whole PoP (or DC) for testing (or Chaos engineering) purposes. Everything you are not testing is breaking.

Chaos engineering will allow you to personally meet all of your colleagues within a short time – whether you want to or not!